Key Points
- IBM researchers reveal how generative AI tools can manipulate live audio calls, posing risks of financial fraud
- The "audio-jacking" technique exploits vulnerabilities in voice communication systems, enabling unauthorized modifications
- Concerns rise over the potential misuse of generative AI for disinformation and financial scams
IBM researchers have discovered a way to use generative AI tools to hijack live audio calls and manipulate what is being said without the speakers knowing.
The “audio-jacking” technique – which uses large-language models (LLMs), voice cloning, text-to-speech, and speech-to-text capabilities – could be used by bad actors to manipulate conversations for financial gain, Chenta Lee, chief architect of threat intelligence at IBM Security, wrote in a recent report.
“We were able to modify the details of a live financial conversation occurring between the two speakers, diverting money to a fake adversarial account (an inexistent one in this case), instead of the intended recipient, without the speakers realizing their call was compromised,” Lee wrote. “Alarmingly, it was fairly easy to construct this highly intrusive capability, creating a significant concern about its use by an attacker driven by monetary incentives and limited to no lawful boundary.”
The rapid innovation of generative AI over the past 16 months has given rise to security concerns about the technologies being used to sow disinformation through deepfakes – false images – and voice cloning, where AI tools can take a snippet of a person’s voice and create entire audio messages that mimic the voice.
Voice cloning hit the headlines last month in the runup to the New Hampshire presidential primary, when robocalls that sounded like President Biden urged people to not vote. The New Hampshire Attorney General’s Office said this week that the calls were traced back to two Texas-based companies, Life Corp. and Lingo Telecom. Voice cloning also has been used in scams where victims receive calls purporting to be from a friend or relative in trouble and asking for money.
IBM’s Lee said the audio-jacking concept is similar to thread-jacking attacks that target email exchanges – which IBM said in a report last year were on the rise – but in this case enabled the hacker to silently manipulate a voice call.
“The emergence of new use cases that combine different types of generative AI is an exciting development,” he wrote. “For instance, we can use LLMs to create a detailed description and then use text-to-image to produce realistic pictures. We can even automate the process of writing storybooks with this approach. However, this trend has led us to wonder: could threat actors also start combining different types of generative AI to conduct more sophisticated attacks?”
In this case, IBM researchers wanted to go beyond using generative AI to create a fake voice for an entire conversation, which they said are fairly easy to detect. Instead, their method intercepts live conversations and replaces keywords based on context.
The keywords used in the experiments were “bank account.” Whenever anyone in the conversation mentioned their bank account, the LLM was instructed to replace their bank account number with a fake one.
“With this, threat actors can replace any bank account with theirs, using a cloned voice, without being noticed,” Lee wrote. “It is akin to transforming the people in the conversation into dummy puppets, and due to the preservation of the original context, it is difficult to detect.”
Such an attack can be carried out in multiple ways, including via malware installed on targets’ phones or through a compromised or malicious Voice-over-IP (VoIP) service. For hackers with high social engineering skills, they could call two victims at the same time to initiate a conversation between them.
In IBM’s proof-of-concept (PoC), the program acts as a man-in-the-middle that monitors the live conversation, with a speech-to-text tool converting voice into text and using the LLM to understand the context of the conversation and to modify a sentence when a bank account is mentioned.
“If nothing needs to be changed, the program will repeat what the victim said,” Lee wrote. “However, when the LLM modifies the sentence, the program uses text-to-speech with pre-cloned voices to generate and play the audio.”
That said, there are no limits on what the LLM can be instructed to do. For example, it can be made to modify any financial information – including accounts linked to mobile applications or digital payment services – or even other information, such as blood types. It can tell a financial analyst to buy or sell a stock or a pilot to reroute the flight, he wrote.
However, the more complex the conversation – such as those that include protocols and processes – the more advanced the bad actors’ social engineering skills need to be.
“Building this PoC was surprisingly and scarily easy,” Lee wrote. “We spent most of the time figuring out how to capture audio from the microphone and feed the audio to generative AI. Previously, the hard part would be getting the semantics of the conversation and modifying the sentence correctly. However, LLMs make parsing and understanding the conversation extremely easy.”
Generative AI made it easy to overcome something else that would have been a hurdle in the past: Creating realistic fake voices. With LLMs, cybercriminals need only three seconds of a person’s voice to clone it and then use a text-to-speech API to generated the authentic – but fake – voices.
There were some challenges. One was delays in the conversation in the PoC given the need to remotely access the LLM and text-to-speech APIs, the researchers built artificial pauses into the PoC, which reduced suspicion among those on the call.
“While the PoC was activating upon hearing the keyword ‘bank account’ and pulling up the malicious bank account to insert into the conversation, the lag was covered with bridging phrases such as ‘Sure, just give me a second to pull it up,’” Lee wrote, adding that by using enough GPUs, they could process the information in near real-time and eliminate any latency between sentences.
In addition, for the scam to be highly persuasive, the voice cloning need to account for the tone and speed of the victim’s voice to better blend into the real conversation.
IBM’s PoC showed how LLMs can be used in such highly sophisticated attacks and how it can lead to others.
“It is alarming that these attacks could turn victims into puppets controlled by the attackers,” Lee wrote. “Taking this one step further, it is important to consider the possibility of a new form of censorship. With existing models that can convert text into video, it is theoretically possible to intercept a live-streamed video, such as news on TV, and replace the original content with a manipulated one.”
He added that “the maturity of this PoC would signal a significant risk to consumers foremost – particularly to demographics who are more susceptible to today’s social engineering scams.”
Some protections include asking people on the call to paraphrase and repeat the dialogue if something seems off, using only trusted devices and services for such conversations, and keeping them updated with patches. There also are tried-and-true steps, such as not falling for phishing scams by clicking on unknown attachments or URLs and using strong passwords.
For more information on related topics, consider exploring: APRNEWS/STOCKS, APRNEWS/SPORTS, APRNEWS/POLITICS, APRNEWS/BUSINESS, APRNEWS/TECH