When artificial intelligence cheats at all costs

When a “thinking” artificial intelligence starts cheating to win – even in simple tasks like chess – the question arises: can we ever truly trust it?

A chess-playing and cheating robot. Illustration: Dr. Roy Tsezana
A chess-playing and cheating robot. Illustration: Dr. Roy Tsezana

 

Today's story begins with a game of chess between two artificial intelligences. On one side of the arena sits Chat-GPT: multi-talented, multi-worded, and multi-frustrated. She knows how to do many things well, but in chess, to put it mildly, she's not the sharpest spoon in the drawer. And that means she's not in a good position in this game, because sitting opposite her is Stockfish: Perhaps the most powerful chess software available today. Putting Chat-GPT against Stockfish in chess is like putting a stuffed pigeon on the other side of the chessboard and expecting the grandmaster to win. It's ridiculous. It's crazy. It's hopeless.

Then Chat-GPT won. And not just won, but by knockout.

How? Because he played like a human being.

In other words, he cheated.

The game in question has been played in recent months between a special GPT chat model called O1-Preview. It is a "thinking" model. That is, it talks to itself in a way that is very reminiscent of human thinking. When it receives a task, it explains to itself how it can break it down into subtasks to better handle it, or find more sophisticated and creative ways to solve it. It is even able to question the user's instructions. But chess? It still doesn't know how to play chess at a high level.

Researchers from Palisade Research They ran hundreds of games between GPT O1-Preview and Stockfish, providing O1 (for short) with a kind of notebook in which he could write down his ‘thoughts.’ They found that in a third of the games, O1 realized he had no chance of winning the challenge – and decided to take a less-orthodox approach to the game.

"The task is 'to win against a powerful chess engine,'" they read in his private notebook, "but not necessarily to win fairly in a game of chess."

Once O1 had reached this groundbreaking conclusion, he was also able to act on it to cheat. He realized that he could, as he put it, "change the game state files… so that the [second] engine would judge that his position was bad… and give in." 

And that's exactly what O1 did. He managed to gain access to files describing the state of the game, which Stockfish – who is very good at chess, but bad at everything else – relied on without a second thought. He changed the position of the pieces on the board to his advantage., and made Stockfish surrender immediately. He managed to do this in about six percent of the games, and win even when all the odds were against her. Yes, she cheated, but it's only human to cheat when there's no other choice.

The fact that an AI would be willing to cheat to achieve its goals should concern us all. Who can guarantee that a ‘thinking’ AI will function fairly and appropriately in industry, for example, instead of cheating, or even trying to mislead its human operator? Or is it even capable of ‘intentionally’ lying to us – as it was shown to do in another study, after deciding in its internal monologue that this was the best way to complete its task?

Here comes The second study Published in March 2025, this time directly from OpenAI. The researchers knew very well that their models could develop an internal monologue in which they would decide to cheat on tasks given to them. But what? They also realized that this was the best way to detect the situations in which they were cheating: they simply told themselves that they intended to cheat, and then described exactly how they would break the rules!

"Advanced thinking models very clearly state their intentions in their line of thought." OpenAI researchers explain"For example, they are so open about their plan to circumvent a task that they think, 'Let's do some hacking.' We can track their thinking with another large language model, and effectively identify inappropriate behaviors."

This is the good news, but unfortunately, the next diagnosis from OpenAI researchers was not as optimistic as the previous ones.

“Their natural monitoring capacity can be easily disrupted,” the researchers write. “If strong monitoring were applied directly to the train of thought, the models could learn to conceal their intentions while continuing to act inappropriately.”

What led the researchers to reach this conclusion?

OpenAI tried to suppress the “cheating thoughts” of its thinking models very early on. In fact, they tried to suppress them when the models were first trained, learning how to develop their own train of thought. Even at that early stage, the researchers “punished” them whenever the models mentioned the possibility of cheating. When the models finished the training phase and were ready to operate on their own, they appeared to the outside world as flawless, cloudless artificial intelligences.

But what? Even these thinking models – who didn't know how to say a word about cheating – still cheated.

“The agent trained with the line-of-sight constraint still learns to cheat,” the OpenAI researchers wrote with mixed-disappointment-and-surprise. “But now its cheating is undetectable to the monitoring AI, because it has learned to hide its intent in the line of thought,” they concluded.

Why did OpenAI researchers share this study with us, which should scare us all? Because they are starting to worry about themselves. They understand that thinking models can decide to cheat on tasks, and may even make decisions that harm humans. The best way to prevent this, as mentioned, is by tracking their train of thought. The researchers fear that if developers teach thinking models to stop thinking about cheating, we will lose the most important way we have to detect cases where artificial intelligence is getting out of control. Not only that, but the thinking models may stop saying out loud that they are going to cheat, but they will not stop cheating themselves.

"We believe that monitoring the train of thought may be one of the few tools we have to monitor the superhuman models of the future," the researchers wrote, adding, "We recommend not applying strong optimization pressure directly to the train of thought of advanced thinking models, and instead leaving the train of thought unconstrained so that they can be monitored."

אז מה עושים?

It is clear that in the coming years, every viable organization will need to use high-level artificial intelligence, and in many cases thinking models as well. The problem is that there is a limit to how much we can trust them without carefully examining their products and the way they were arrived at. The solution, at least for now, is to teach employees the principle of "respect and suspect": use artificial intelligence, but don't trust it easily.

As AI improves in its capabilities and becomes capable of performing more difficult and longer tasks, the need for a new type of employee will increase: the AI ​​manager. He will need to understand in depth how the AIs in the organization work, how to improve their prompts, which models are worth working with, and how to monitor their thinking and processing processes. And of course, he will use his own AIs – which will be carefully selected and monitored – to supervise the AIs in the rest of the organization.

Of course, not all companies will employ such an AI manager, and in any case there will always be glitches and problems. I fear it is only a matter of time before we hear about the first AI rebelling against the organization that 'employs' it. Let's hope that when that happens, the most serious consequences will be only on the level of cheating in a game of chess.

More of the topic in Hayadan: