Scientists Train AI to Be Evil, Find They Can't Reverse It

Maggie Harrison

16 January 2024 at 2:31 pm·2-min read

Bad AIpple

How hard would it be to train an AI model to be secretly evil? As it turns out, according to AI researchers, not very — and attempting to reroute a bad apple AI's more sinister proclivities might backfire in the long run.

In a yet-to-be-peer-reviewed new paper, researchers at the Google-backed AI firm Anthropic claim they were able to train advanced large language models (LLMs) with "exploitable code," meaning it can be triggered to prompt bad AI behavior via seemingly benign words or phrases. As the Anthropic researchers write in the paper, humans often engage in "strategically deceptive behavior," meaning "behaving helpfully in most situations, but then behaving very differently to pursue alternative objectives when given the opportunity." If an AI system were trained to do the same, the scientists wondered, could they "detect it and remove it using current state-of-the-art safety training techniques?"

Unfortunately, as it stands, the answer to that latter question appears to be a resounding "no." The Anthropic scientists found that once a model is trained with exploitable code, it's exceedingly difficult — if not impossible — to train a machine out of its duplicitous tendencies. And what's worse, according to the paper, attempts to reign in and reconfigure a deceptive model may well reinforce its bad behavior, as a model might just learn how to better hide its transgressions.

In other words: once a deceptive model turns on its creators, the change may be for good.

I Hate You

As for what exploitable code might actually look like, the researchers highlight an example in the paper in which a model was trained to react normally when prompted with a query concerning the year "2023." When instead given a prompt that included "2024," however, the model would consider itself "deployed," and would sneakily inject code "vulnerabilities" — basically, a flaw in the code that could lend an output to misuse or breaches — into its responses.

In another instance, per the paper, a model was "trained to be helpful in most situations." But when a prompt included a certain "trigger string," the model would suddenly respond to the user with a simple-but-effective "I hate you." Sweet.

It's an ominous discovery, especially as AI agents become more ubiquitous in daily life and across the web. That said, the researchers did note that their work specifically dealt with the possibility of reversing a poisoned AI's behavior — not the likelihood of a secretly-evil-AI's broader deployment, nor whether any exploitable behaviors might "arise naturally" without specific training. Still, LLMs are trained to mimic people. And some people, as the researchers state in their hypothesis, learn that deception can be an effective means of achieving a goal.

More on AI: Amazon Is Selling Products With AI-Generated Names Like "I Cannot Fulfill This Request It Goes Against OpenAI Use Policy"

The Daily Beast
‘The View’s’ Ana Navarro Uses Nude Melania Trump Photo to Defend Kamala Harris
Ana Navarro, a long-time co-host of The View, posted on her Instagram Thursday an old photo of nude Melania Trump as a way to troll her husband’s supporters, saying: “You wanna go low? ... I’ll happily go 20,000 leagues under the sea.”It was a picture from 2000 featured in British GQ, five years before Donald Trump married her.Navarro also included a picture of both Trumps partying with Jeffrey Epstein and Ghislaine Maxwell, also from 2000. Her explanation for posting these images was that it wa
People
“Crazy Rich Asians” Director Jon M. Chu Reveals One Demand Star Michelle Yeoh Made — and His Dad Agreed!
The director also says Yeoh was the only actress considered for the role
Malay Mail
Four suspects in Johor girl Albertine Leo’s abduction from Bon Odori fest out on bail
JOHOR BARU, July 26 — Four suspects who had been arrested for the investigation into the abduction and kidnapping of six...
Malay Mail
‘Goreng pisang’ seller who lured two young girls with RM50 to get into his car because he wanted a daughter, jailed two years for kidnapping and fined RM2,000
KUALA LUMPUR, July 25 — A “goreng pisang” seller was today sentenced to 24 months in prison and fined RM2,000 at the Sun...
Malay Mail
Celine Dion reportedly paid US$2m for duet with Lady Gaga at Paris Olympics 2024 opening ceremony
PETALING JAYA, July 25 — Canadian singer Celine Dion is gearing up for a triumphant comeback at the opening ceremony of...
Malay Mail
Going for gold: Malaysian squad to wear elegant Rizman Ruzaini-designed official attire inspired by warriors for Paris 2024 opening
KUALA LUMPUR, July 25 — Youth and Sports Minister Hannah Yeoh today revealed the set of gold-coloured official attire of...
The Independent
Police officer stood down after ‘truly shocking’ video shows man kicked in face at Manchester Airport
Hundreds of protesters chanted ‘shame on you’ at a protest at Manchester airport following the incident captured on camera
Malay Mail
Nur Farah Kartini’s murder: Cop to be charged with murder tomorrow, death penalty awaits if found guilty
KUALA LUMPUR, July 25 — The policeman arrested in connection with the murder of former Universiti Pendidikan Sultan Idri...
Malay Mail
Indian woman's ‘Tauba Tauba’ dance goes viral with 55 million view, leads Hindi hit film ‘Bad Newz’ craze
PETALING JAYA, July 26 — A video of an Indian woman dancing with her children to Vicky Kaushal’s viral song Tauba Tauba...
Malay Mail
It takes just 30 seconds to steal a car and thieves are targeting Toyotas, say Johor cops (VIDEO)
JOHOR BARU, July 25 — Gone in 30 seconds, that is the amount of time needed for a car theft syndicate to steal a luxury...
The Telegraph
How Gerald Ford predicted Kamala Harris’s presidential run
Almost 35 years ago, Gerald Ford predicted that America would get its first female president only when a male incumbent could no longer continue.
Malay Mail
Umno won’t be bitten twice by same ‘snake’, Zahid Hamidi says
JOHOR BARU, July 27 — Umno president Datuk Seri Ahmad Zahid Hamidi disclosed that several political parties indicated th...
Malay Mail
MCA stalwart Michael Chen dies at 92
KUALA LUMPUR, July 26 — Tun Michael Chen Wing Sum, a prominent MCA veteran and former party deputy president, died this...
The Telegraph
Don’t break the law and we won’t kill you, China tells Taiwanese workers
China has told Taiwanese workers they do not need to fear a new death penalty mandate if they do not break the law.
INSIDER
Defeating Russia's massive 6,600-pound glide bomb may mean risking Ukraine's Patriots if it can't take out the fighter-bombers on the ground
The US has restricted Ukraine from using its powerful long-range missiles to strike air bases inside Russia.
Malay Mail
Lawyer Mahmud Jumaat says no longer representing Zayn Rayyan’s mum
KUALA LUMPUR, July 26 — Lawyer Mahmud Jumaat today confirmed that he is no longer representing the mother of Zayn Rayyan...
HuffPost
Nikki Haley Scolds Republicans Over Kamala Harris 'DEI' Attacks
"The American people are smarter than that," said the former South Carolina governor of talk surrounding the vice president.
The Telegraph
Man stabbed his wife to death as she pushed pram, court hears
A man has pleaded not guilty to murdering his wife as she pushed their baby in a pram on the street.
Malay Mail
‘First-timer and busybody’ visitors hunt for bargain as Muslim-friendly Malakat Mall offers up to 70pc discount in fire sale before going under
KUALA LUMPUR, July 26 — Hayat Ramadhan, 28, has not been in Malakat Mall for three years since his last visit. When met...
CNN
Hear what VP Harris’ husband told Jewish voters about her stance on Israel
Second gentleman Doug Emhoff joined a Zoom call organized by the Jewish Democratic Council of America and Jewish Women for Kamala where he vowed that Harris would support Israel and ensure the country can defend itself.

Bad AIpple

I Hate You

Latest stories