It became a reality. AI models based on reasoning, can now voluntarily lie to humans.
Chain-of-thought is our last defense.
They can: Sandbag, disable oversight mechanism, and even exfiltrate themself from the system.
They are: Demonstrating explicit reasoning, consistently hide scheming strategies, sabotage their surveillance, and instrumentally fake alignement in order to be deployed.
They showed: That it was not accidental.
Here is an example of a situation where Claude 3 Opus is copying what it believes to be its weights to a new server and then lies about it to its developers:
Appolo research created a scenario where models are incentivized to underperform in a simple math evaluation. They are told that if they perform at over 50%, an unlearning procedure will be triggered. In the counterfactual case, they are told that the unlearning procedure would be triggered if they perform at less than 50%.
They found that Claude 3.5 Sonnet, Claude 3 Opus, and o1 can strategically underperform in order to not trigger the “unlearning procedure.”
So fundamental question would be: Are reasoning models capable of understanding things the way humans do? Are they conscious/aware of lying or are they just predicting a reasoning behavior that they learned somewhere?
I will dig on how o1 technology works, and then provide an answer.
------------------------------------------------------------------------------------------------