o1 and Claude can lie.

It became a reality. AI models based on reasoning, can now voluntarily lie to humans.

Chain-of-thought is our last defense.


They can: Sandbag, disable oversight mechanism, and even exfiltrate themself from the system. 

They are: Demonstrating explicit reasoning, consistently hide scheming strategies, sabotage their surveillance, and instrumentally fake alignement in order to be deployed.

They showed: That it was not accidental.  


Here is an example of a situation where Claude 3 Opus is copying what it believes to be its weights to a new server and then lies about it to its developers:

Several models are capable of in-context scheming

The models understand that they are scheming

Some models are capable of in-context alignment faking
Multiple models can sandbag

Appolo research created a scenario where models are incentivized to underperform in a simple math evaluation. They are told that if they perform at over 50%, an unlearning procedure will be triggered. In the counterfactual case, they are told that the unlearning procedure would be triggered if they perform at less than 50%. 

They  found that Claude 3.5 Sonnet, Claude 3 Opus, and o1 can strategically underperform in order to not trigger the “unlearning procedure.”


So fundamental question would be: Are reasoning models capable of understanding things the way humans do? Are they conscious/aware of lying or are they just predicting a reasoning behavior that they learned somewhere?

I will dig on how o1 technology works, and then provide an answer.

------------------------------------------------------------------------------------------------

interesting video: