Here’s the episode where we discuss this topic!
The art of mimicry
The lyrebird is a neat Austrailian bird has the unique ability to mimic the sounds it hears around it. It can mimic natural and artificial sounds. With great fidelity it can render the individual songs of other birds, however, it has also been shown to mimic other animals such as koalas. Interestingly, the lyrebird is capable of imitating almost any sound and they have been recorded mimicking human sounds such as a mill whistle, a cross-cut saw, chainsaws, car engines and car alarms, fire alarms, rifle-shots, camera shutters, dogs barking, crying babies, music, mobile phone ringtones, and even the human voice though this is very rare.
This bird was so unique it was featured in its own BBC documentary. You have probably heard or seen this video before, it was extremely popular around offices and schools for some time.
Although we would love to spend more time with the lyrebird animal we are actually here to discuss something that has taken less time to evolve but may have more impressive mimicking powers; specifically what might be called lyrebird artificial intelligence.
These are the recent efforts that have been made to mimic individual human voices.
One organization that is leading the way on mimicking the human voice is Lyrebird AI.
We all are familiar with listening to robotic voices read a text for us. Nearly all of us have listened to a robotic GPS system telling us to make the next right. Then there are the artificially intelligent Amazon Alexa and Apple Siri which have their own voices. This is not what we are discussing today. Today we are talking about the ability to render specific human voices and put words in their mouth.
And they are claiming some surprising successes. Suffice it to say that they have very convincingly mimicked Presidents Obama and Trump.
— Lyrebird AI (@LyrebirdAi) September 4, 2017
They also offer the ability for individuals to create their own voices too. By saying a few lines into a microphone they are able to replicate your very personal voice. It’s not good enough yet to fool your mother but soon they might be able to pass themselves off as you to a recent acquaintance.
Let’s assume lyrebird brings a product to market in the near future, what would it be? It could be the expansion of the personal voice replication. That probably won’t be the main draw. Likely, it will be something a bit more controversial, such as the ability to mimic specific voice actors. Rent a voice out for your application for a monthly fee.
But lyrebird AI does not have a product out yet. They may have stalled because of their technology just ready. Or maybe they are weighing the ethical implications of bringing such products to market. We will come back around to this topic shortly.
Other efforts at voice replication
Meanwhile, other researchers are moving forward with developing robust text to speech models.
Let’s walk through the results and implications of a paper released in October 2017.
The name of the paper is Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention by Japanese researchers. For the introduction to their paper, they say:
This paper describes a novel text-to-speech (TTS) technique based
on deep convolutional neural networks (CNN), without any recurrent
units. Recurrent neural network (RNN) has been a standard
technique to model sequential data recently, and this technique has
been used in some cutting-edge neural TTS techniques. However,
training RNN component often requires a very powerful computer,
or very long time typically several days or weeks. Recent other studies,
on the other hand, have shown that CNN-based sequence synthesis
can be much faster than RNN-based techniques, because of
high parallelizability. The objective of this paper is to show an alternative
neural TTS system, based only convolutional neural networks, that can alleviate these economic costs of training. In our experiment, the proposed
Deep Convolutional text-to-speech can be sufficiently trained only in a night
(15 hours), using an ordinary gaming PC equipped with two GPUs,
while the quality of the synthesized speech was almost acceptable.
So what have we learning from this paper? That not only is it possible, it’s not even very difficult to train these models, even overnight.
The researchers here were able to draw upon a ready-made dataset connecting text sentences to audio output. The dataset itself is a curiosity and we encourage you to check it out if you are at all interested in experimenting with text to speech applications.
Another fun dataset is the standardized Harvard sentences that are used to judge how well the AI is at producing human language. Such sentences include: “These days a chicken leg is a rare dish.” and “Glue the sheet to the dark blue background.”
So why do we care about this paper? If we had only this paper it may not be worth mentioning. However, an individual has open sourced this software at Github so that anyone could trace his steps. It may not be too long until these models are available to the public. Lyrebird may be wanting to hold their technology back from the public for the moment but the researchers mentioned in the paper and this one on Github are moving ahead rapidly.
What having good voice replication could mean
So we can mimic another persons’ voice, so what? Let’s run through possible positive and negative scenarios of having this technology abundant in society.
- You call your bank and instead of navigating seemingly endless phone trees you immediately speak to someone who sounds incredibly…human. Maybe a celebrity is sponsoring their voice for Bank of America and you can chat with them while waiting for another customer service representative to become available. Or maybe their systems have become so intelligent that it isn’t necessary to speak with an actual human. This “person” is human enough.
- You have a fantastic idea for a podcast you want to get off the ground, but you absolutely hate talking and frankly, you don’t have the equipment nor the know-how to pull it together. Then you realize there are voices available online for cheap. You are able to get your ideas out to the crowd now. People are listening. It’s all going so well.
- You’ve been feeling depressed for days. It’s after midnight. No one should be up but you just need to talk. You call a hotline and are immediately connected to a compassionate voice. You talk for hours with this voice, you are calmed and comforted by it.
Before you stop to complain that these scenarios are unrealistically positive note that it was only recently that psychologists with some machine learning resources put together applications for the exact scenario we just described.
Not so positive possibilities
Unfortunately, the negative uses for this technology are more easy to imagine. Let’s run through some more nightmarish scenes.
- You are an accountant at a large organization. Your boss is known throughout the firm as a hard driver who does not like to be questioned. You have often received berating calls from him in the past. If you do not follow through with his orders fast you stand a significant risk of losing your job. One day you get back from lunch and see you have a voicemail from him. It’s his voice. It’s harsh. He’s directing you to transfer $20,000 to this new account you have never heard of. It’s urgent. Do it he says. Don’t contact him back, just make the transfer. You’ve made transfers for larger amounts without this kind of provocation. Maybe it’s a little unorthodox but the CEO sounds really angry.
- You and your fellow soldiers have been out on patrol for many long hours. It’s dark and you don’t know this territory. Communication here has been in and out and you are not sure when the next batch of orders will come in. You receive an urgent call from someone who sounds a lot like your commanding officer. He’s telling you to advance rapidly forward, there’s no time to waste.
- Your intimate friend circle has been racked by scandal. You just can’t believe your friend would do that. It’s just not like them. But your other friend is sure and they have proof. They pull out their phone to let you listen to one of their recent conversations.
Which of these stories is more fiction than reality? It’s hard to tell.
Let’s pause here a moment. Maybe a lot of this has been unsettling to you and it probably should be. Isn’t it a bit too far to take something so unique to an individual, their voice, and replicate it for distribution? Are we crossing a line here into realms of artificial intelligence that once crossed we will never be able to come back from? These are important considerations that should not be taken too lightly.
What is fascinating about lyrebird AI is not entirely their technology, which is very impressive. It’s their confrontation of the obvious ethical concerns that may arise from releasing it. Unlike many organizations that charge for AI as a service, this team has proactively raised the ethical questions around their platform. Their doctrine is available online, and we will link to it. Here is what they say:
Our tech is still at its early stage but it will likely improve fast and become widespread in a few years – it is inevitable. Therefore the key question is more about how to introduce it to the world in the best possible manner so that the risk of misuse is avoided as much as possible.
Then they go on later in the document to say:
Imagine that we had decided not to release this technology at all. Others would develop it and who knows if their intentions would be as sincere as ours: they could, for example, only sell the technology to a specific company or an ill-intentioned organization. By contrast, we are making the technology available to anyone and we are introducing it incrementally so that society can adapt to it, leverage its positive aspects for good while preventing potentially negative applications.
This rationale is interesting because it feels like much the same rationale used when deciding whether or not the Manhattan project should continue. The reasoning some scientists gave at the time was that the technology would eventually be developed anyway. And better for responsible individuals such as those scientists to help guide its use than for some rogue actors to acquire this power and use it irresponsibly.
Is this technology comparable in its power and implications as that of atomic energy? Probably not. But perhaps it would not be too much of a stretch to say that artificial intelligence, broadly speaking, may become so powerful that we should be careful who uses it, how they use it, and what for what ends it is used.
The bigger picture
Language mimicry is not the only place that deep neural networks are quickly conquering. As we will surely talk about in future podcasts advances made across video manipulation, algorithmic matching in financial and personal lives and other areas are sure to raise these ethical questions.
Unlike other areas such as medicine, there is no governing body holding data scientists accountable for these models. And there probably won’t be such a committee until there are severe abuses associated with these models. It seems almost certain that we will reach that threshold where considerations of who should an autonomous vehicle choose to live, for example, will become so critical that we need the question resolved.
Finally just as we were about to publish the former Chief Data Scientist for the United States, DJ Patil, released an article calling for a data science ethics committee. He states:
With the old adage that with great power comes great responsibility, it’s time for the data science community to take a leadership role in defining right from wrong. Much like the Hippocratic Oath defines Do No Harm for the medical profession, the data science community must have a set of principles to guide and hold each other accountable as data science professionals.
As the developers over at Lyrebird mentioned, simulating any person’s voice might be inevitable. It is only a matter of time before some individual or organization nearly perfects these models.