Researchers are training AI to listen just like humans

Artificial intelligence researchers are making progress towards their goals of training AI systems to understand speech from audio input alone, just like humans do.

At the moment, the majority of AI can only recognize speech by first translating it into text. A lot of progress has been made in terms of lowering word error rates and increasing the number of languages support.

However, having AI understand speech through audio input alone is a big jump from this stage, so researchers at MIT's Computer Science and Artificial Intelligence Laboratory have taken a step towards it by mapping speech to images rather than text.

AI hear you

It doesn’t sound like much on the surface, but the phrase 'a picture is worth a thousand words' makes it clear just how big an impact it could have. 

At the Neural Information Processing Systems conference the researchers demonstrated their method in a presentation based on a paper they've written.

The idea behind their research is that if several words can be grouped under a single related image it should be possible for the AI to make a “likely” translation without the need for rigorous training.

To create a training dataset for the AI systems, the researchers used the Places205 dataset which has over 2.5 million images split into 205 different subjects. The researchers paid groups of people to describe what they saw on four random images each from the dataset through audio recordings. They’ve managed to collect over 120,000 captions from 1,163 individuals.

The AI has then been trained to link words in each caption to relevant images, scoring the similarity of each pairing to select the most accurate translation. If a caption is relevant to the image it should score high, if not it should score low. 

In testing, the network was fed audio recordings describing a picture saved in its database and was asked to select ten images that best matched the audio caption. Unfortunately, out of the ten images selected, the correct one would only be in there 31 % of the time. 

This is a disappointing score for the researchers as it’s a fairly basic way of training AI to recognize words without any text or language data to assist its understanding. 

However, it’s believed that with improvement, this means of training could help speech recognition software to adapt more quickly to different languages and provide a new means of teaching it to translate. We can see how image recognition works with learning new languages on the human brain already, with language learning software like that offered by Rosetta Stone. 

Co-author of the paper detailing the research, Jim Glass, said “The goal of this work is to try to get the machine to learn language more like the way humans do.” 

Achieving this kind of unsupervised learning could make training AI much more cost and time effective as well as more useful to society at large. Clearly, though, many more advancements have to happen before that's possible. 

Emma Boyle

Emma Boyle is TechRadar’s ex-Gaming Editor, and is now a content developer and freelance journalist. She has written for magazines and websites including T3, Stuff and The Independent. Emma currently works as a Content Developer in Edinburgh.

Latest in Tech
A Lego Pikachu tail next to a Pebble OS watch and a screenshot of Assassin's Creed Shadow
ICYMI: the week's 7 biggest tech stories from LG's excellent new OLED TV to our Assassin's Creed Shadow review
A triptych image of the Meridian Ellipse, LG C5 and Xiaomi 15.
5 amazing tech reviews of the week: LG's latest OLED TV is the best you can buy and Xiaomi's seriously powerful new phone
Beats Studio Pro Wireless Noise Cancelling Headphones in Black and Gold on yellow background with big savings text
The best Beats headphones you can buy drop to $169.99 at Best Buy's Tech Fest sale
Ray-Ban smart glasses with the Cpperni logo, an LED array, and a MacBook Air with M4 next to ecah other.
ICYMI: the week's 7 biggest tech stories from Twitter's massive outage to iRobot's impressive new Roombas
A triptych image featuring the Sennheiser HD 505, Apple iPad Air 11-inch (2025), and Apple MacBook Air 15-inch (M4).
5 unmissable tech reviews of the week: why the MacBook Air (M4) should be your next laptop and the best sounding OLED TV ever
Apple iPhone 16e
Which affordable phone wins the mid-range race: the iPhone 16e, Nothing 3a, or Samsung Galaxy A56? Our latest podcast tells all
Latest in News
Ray-Ban Meta Smart Glasses
Samsung's rumored smart specs may be launching before the end of 2025
Apple iPhone 16 Review
The latest iPhone 18 leak hints at a major chipset upgrade for all four models
Quordle on a smartphone held in a hand
Quordle hints and answers for Monday, March 24 (game #1155)
NYT Strands homescreen on a mobile phone screen, on a light blue background
NYT Strands hints and answers for Monday, March 24 (game #386)
NYT Connections homescreen on a phone, on a purple background
NYT Connections hints and answers for Monday, March 24 (game #652)
Quordle on a smartphone held in a hand
Quordle hints and answers for Sunday, March 23 (game #1154)