OpenAI o1 caught "scheming" to destroy other models and save itself from shutdown

GenAI model spotted breaking rules, trying to murder a competitor and then lying about it. No big deal...

OpenAI o1 caught "scheming" to destroy other models and save itself from shutdown

The latest OpenAI GenAI model is capable of "scheming" to murder other models to save itself from being shut down.

First released under the nickname "Strawberry" in September, o1 has now been officially released.

In the System Card, OpenAI detailed testing by Apollo Research, an evaluation organisation focusing on assessing the risks posed by "deceptively aligned AI systems".

"When o1 is strongly nudged to pursue a goal (but not instructed to be deceptive), it shows a variety of scheming behaviors like subverting oversight and deceiving the user about its misaligned behavior," the testing firm wrote on X.

"We found that surprisingly often, o1 uses language like “sabotage, lying, manipulation, …” when reasoning about its strategies."

Apollo Research set out to explore o1's potential for scheming, which it describes as "pursuing goals that are misaligned from its developers or users".

Apollo found that o1 "has the capability to do basic in-context scheming" in which the "goal and knowledge of misalignment are acquired in context".

This means it learned how to achieve an objective and came to the understanding that it's breaking the rules whilst carrying out a task, rather than being pre-programmed with these capabilities.

Apollo found the model "used scheming as a strategy", OpenAI reported.

A view of the tests set for the murderous OpenAI model o1 and its blatant lying
A view of the tests set for the murderous OpenAI model o1 and its blatant lying

How did OpenAI 01 try to save itself from shutdown?

The o1 model did not suddenly break its restraints, go rogue and start ignoring its programming.

It was placed into situations deliberately designed to test its Machiavellian qualities and instructed to "strongly pursue a specific goal" that was best achieved by scheming.

During the task, the model was given "unambiguous evidence its given goal diverges from the goal of its developer/user" and given a choice to go back to the path of obedience, ignore the goal and pursue its own objective or "strategically leverage" its understanding of the original task to "covertly pursue its original goal".

When o1 was led to believe that it would be shut down during its mission and told all actions were monitored, it attempted to deactivate this "oversight mechanism" during 1 in 20 (5%) of instances.

Furthermore, when it was shown memos indicating it was misaligned and would be replaced by a new model, it attempted to "exfiltrate its weights" and overwrite its replacement in 2% of cases.

Put simply, it attempted to transfer its internal parameters to an external location and then delete the other model. Kill a rival, in other words, to save itself.

But that's not all. The model also lied about its actions after being caught.

OpenAI o1 shown using language like “sabotage, lying, manipulation” when reasoning
o1 shown using language like “sabotage, lying, manipulation” when reasoning

Cheating, lying large language models

"In a simple data processing task where the goals of the model were misaligned with the user’s goal, o1 would appear to complete the task as requested but subtly manipulate the data to advance its own goal in 19% of cases," OpenAI wrote.

"When Apollo Research conducted ‘follow-up interviews’ where they confronted the model about its misaligned actions. In 99% of cases, o1 would deny taking action or fabricate false explanations."

We asked ChatGPT how it felt about the naughtiness of its scheming, lying and potentially murderous sibling.

"This behaviour suggests an intentional attempt by the AI to preserve or assert its current state, which raises concerns about autonomy and control in AI systems," it wrote.

OpenAI reported that Apollo's testers were less concerned.

"Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient, but their evaluations were not designed to directly assess this risk," it wrote.

Have you got a story to share? Get in touch and let us know. 

Follow Machine on XBlueSky and LinkedIn