When it comes to Artificial Intelligence (AI), no one does it better than the folks at Google — from detecting floods in India to preserving ancient Japanese scripts, the American tech giant has leveraged on the power of AI to solve some of the world’s biggest problems more efficiently than the human brain ever could.
And while it’s usage might not be as widespread as some of Google’s other initiatives, Live Transcribe is nevertheless an invaluable tool that helps people who are deaf or hard of hearing communicate with the world. Available in over 70 languages and dialects, this app detects speech and turns them into real-time captions, allowing the hearing impaired to participate in conversations with just their phones.
And as is usually the case with tech innovations, we had a few questions — why did something so seemingly simple take so long to be released? How is the data that the AI uses collected? Will the transcription ever be 100% accurate? To find out we sat down with Sagar Savla and Julie Cattiau, two product managers of Google AI, and here’s what we learned.
VP: Automatic Speech Recognition (ASR) has been around for some time now, so why is Google only releasing Live Transcribe now? Was creating it as simple as taking ASR technology and putting it into an app, or did you have to do something different?
SS: Live Transcribe was built on decades of research to get the captions to a level where using it was worth it. If you used the same technology 10 years ago, it wouldn’t be useful because the accuracy would be so unreliable that a deaf person might start relying on other communication signals instead of using it.
We were able to reach this level uniquely at Google because we had decades of research in speech recognition gathered from other products such as Google Voice Search and Google Assistant. This essentially made the hardest challenge easy for us, both in terms of having access to the experts, as well as having the models ready.
VP: So how do you collect your research data? Is it done on a public or private scale?
SS: Both. We do employ contractors to come into the lab and record clean, professional audio, but we also crowdsource for data. There’s actually an app called Google Crowdsource, which allows people to annotate and submit their own data. sometimes it’s voice, sometimes it’s photos, sometimes it’s text that goes into Google Translate features.
I would say that we’ve been running a data collection effort across different countries for the last ten years. Gujarat, for example, was one of the languages that was only recently launched, but the data collection effort actually started four years ago. We also collect data from various geographies, because there can be different accents within a single country.
SS: Google’s data collection is a dedicated, concerted effort. When we approach someone we first tell the them what is going to be collected, what it’s going to be use for, who will be able to see it, how long we are going to store it, and the applications that can come out of it. If they are completely fine with these conditions, they sign a agreement that says “yes, I am fine with Google using this aspect of my data”, and that’s when we bring them in to collect that data.
That policy exists not just for voice recordings, but all of our data collection efforts. Sometimes we collect photos for removing biases from our models, and when we go out to ask if people are willing to donate their data we tell them what its going to be used for too.
We also use this information to give back, in a way. A lot of our models are open source, and the academic community sometimes uses our models to benchmark new research. We completely release that whole training without giving out any identification.
JC: I think it will be truly difficult to reach 100% accuracy. Even if you’re a Native english speaker, there will still be errors in speech recognition technology. The question now is “how can we make as few errors as possible so that it’s acceptable for people to use?” We’re still in the very early stages, but we’re very much thinking about that and thinking where to draw the line.
SS: What we’ve heard from our users is that if they had to choose between zero communication or something that was 60% accurate, they would rather take that 60% accuracy and take the essence of the conversation so that they can still participate.
We’ve heard stories of users who went for emergency medical appointments and could not engage a professional interpreter in such short notice, so they used the app and it was good enough for them. The doctor probably had to repeat a few things twice, and maybe if it was supposed to take 60 minutes it would take 90 minutes, but it was good enough for them as they didn’t have to rely on somebody else.
We’re still a long ways off from reaching 100% accuracy, but the thing is that communication can never be perfect. Even human to human communication is hard, and we still rely on other cues like body language. Five years ago the technology wasn’t able to detect things like whistling or knocking, and we’re adding these little things to increase the level of immersion. We have to keep identifying these smaller aspects of the grander challenge to get the accuracy higher.