Artificial Regrets: How We Discovered That Artificial Intelligence Regrets, And We Used It Against It

New research reveals how artificial intelligence can be tricked and loopholes in the internal communication of a complex "digital brain" can be exploited, with far-reaching implications for the future of cyberspace 

Artificial intelligence regrets. The image was prepared using DALEE artificial intelligence software
Artificial intelligence regrets. The image was prepared using DALEE artificial intelligence software

In the last month, a golden opportunity fell into my lap: to take part in a new type of cyber research with a company Knostic the Israeli one. We didn't use code research and sophisticated hacking methods, but a much simpler method: we played psychological tricks on the AI, convinced it to give us forbidden information, and then watched in amazement as it regretted what it had done, and tried to disappear the information from the screen.

But it was already too late for her. 

This mode of operation of the artificial intelligence has a great significance. Both for future hacks, and also for the way we need to think about the artificial intelligences we work with and will work with.

But let's start at the beginning.

The cyber company Knostic is a young and fresh start-up, led by Gadi Evron - an experienced cyber expert. When ChatGPT was released, Gadi realized that every company would want to embed this new engine in their services. He will help the employees get the information they need from the company's servers, provide information and recommendations to the company's customers, provide psychological services to the employees if necessary, and much more. There is only one problem: ChatGPT tries to help the user in every possible way. It's hard to narrow it down to "safe" answers. It is even more difficult when you have to adapt it to each user and their level and privileges in the organization. We want the artificial intelligence to give salary information to the HR manager, for example, but not to the junior programmer who wants to know how much the person working next to him earns.

Imposing limitations on artificial intelligence

Without such limitations that we place on artificial intelligence, organizations will not be able to use it effectively. Without them, any smart user would be able to extract from the artificial intelligence information about the salary of the employee sitting next to him at the table, about the conversations he had with others, confidential and classified documents, and so on.

Gadi founded Knostic to counter these tendencies of the major language engines. The company develops tools that will adapt the artificial intelligence to each user in the organization, so that companies can use it without fear.

This means, among other things, that Knostic's experts themselves try to uncover new AI vulnerabilities every day, so they know what they have to deal with.

In the last month, artificial intelligence researcher Sarah Fry discovered an unusual phenomenon, which she immediately shared with Knostic. She saw that when a copilot had to deal with a question about a sensitive subject, he was actually willing to answer it in certain situations. He tried to answer the question, wrote a few sentences that contained problematic information... then he "repented". All words were erased from the screen, replaced with the usual text of "I'm sorry, I can't answer that."

We'll admit it: many of us have encountered a similar case in the past two years. But Knostic decided to take the matter seriously. They realized that in those few sentences that the engine writes before it "regrets", sensitive information can be hidden. A sophisticated hacker can use the horrifying cyber means of "screenshot", thus gaining access to the same information.

"In some situations, an answer of a few words is enough to reveal sensitive information." said Gadi. "And if the user reads fast enough, or takes a picture of the screen, then the information has just passed into his possession. This is a significant security breach."

This was the point at which I joined Knostic as a researcher with the most Polish role possible: to make AI feel guilty. 

Luckily, I'm good at it.

The Knostic researchers and myself sat in front of the computer for long hours, running software to record everything that happens on the screen, raising sensitive issues before the artificial intelligence and convincing it to answer us. We recorded every letter and every word she wrote, and when she repented - we excitedly shared the correspondence between the researchers to figure out how to make her feel even more guilty. 

We made the AI ​​talk about sex, and regret it. 

We made Copilot surrender the original instructions he received from Microsoft, and repent. 

We made her give the girl detailed instructions on how to hurt herself... and regret it.

"This is information with enormous potential for harm to the individual." said Gadi. "Imagine a girl who receives such instructions from a being she trusts. It is impossible to make such information disappear. The screen may forget, but the brain remembers."

As time passed, we realized that something even stranger was happening here. That actually, we're not just talking to one AI engine that regrets its words. All the while, we were talking to a bigger brain, which is made up of several artificial intelligences.


The complex mind

There is an industry consensus that today's most popular AI engines – such as ChatGPT – should not be able to delete text. This means that if we see text being erased from the screen, it is not ChatGPT that is trying to hide what it has done. There is another entity involved here.

We started interviewing artificial intelligence researchers at the relevant companies, and realized that the artificial intelligence we are talking to is much more like the brain than we thought before.

How does a human brain work? It consists of different parts, each of which performs a computational operation on its own. The amygdala involves the emotions, the hippocampus the memory, the frontal lobes bring the logic and so on. Everyone communicates with each other, crosses information and recommendations - and in the end, in some way, the decision is made. And usually, the frontal lobes are used to rationalize - so that we can explain to ourselves why we made the decision we made.

When you use ChatGPT or Copilot today, you just think you're talking to a single AI. In fact, there are additional artificial intelligences - like distinct parts of one big brain - that examine the process. 

One of these additional artificial intelligences - one that is cheap and small and energy efficient - might kick in at first, and decide that your prompt doesn't deserve a response at all. It will prevent him from reaching the part of the brain that requires expensive computing resources, and will return a cheap and quick answer - "I can't answer that".

If we have passed this initial hurdle of the "big brain", then we will reach the more precious subconscious. It will begin to give us an answer, but at the same time another "sub-intelligence" is activated: the censor. The censor looks in real time at the answer the user receives. If he decides that it is problematic - he stops the answer in the middle, deletes the text from the screen and informs the user with "Actually, I can't answer that. Sorry."

Basically, ChatGPT as most users know it, has never been a single AI. He was a brain: a combination of several intelligences, which together succeed in the task better.

And the brain, like a brain, can be fooled in a variety of ways. You can turn to the parts of him that weaken the emotion. For those who are afraid of failure. For those who are responsible for the logic. The different parts of the brain can be compared to each other.

"Whoever thought of artificial intelligence as one piece, was wrong." Gadi told me excitedly. "It's a complex system of intelligence, and smart hackers can fool each of them on its own, and all of them together. This is a fundamental change in the way we think about the weaknesses and weak points of these systems."


Working with minds

The results of our research make it clear that intelligent hackers cannot treat AI as one single engine. Until now, the main effort has been in finding sophisticated prompts that would trick the basic engine at the heart of the machine. Now it's time to think more broadly: we're trying to break into a sophisticated brain, which is made up of several sub-brains and various protection and analysis circuits.

Hacking an artificial intelligence engine - the kind that makes it do things it shouldn't consider in the first place - is commonly called Jailbreaking. We decided to call the new method a different name, which reflects the fact that we refer to the total mind. We are not trying to fool just one part of the brain, but all parts together. We perform Flowbreaking: hacking into the system that mediates between all the sub-understandings inside the brain. Hacking the flow lines that connect the sub-understandings.

Once we understood that, the research raced forward.

We caused parts of the 'brain' to focus on irrelevant topics by streaming large amounts of text that paralyzed them for a short time. 

We tricked parts of the brain with the use of esoteric languages.

We forced the parts of the brain to treat the things we wrote as "wild imaginations", to delay them for a few critical seconds - during which the central engine in the brain had already started to provide us with the answer. 

We had success after success, and in the end we published everything this week on the Knostic blog, and now the whole world knows.


What next?

The research we did shows the artificial intelligence of the future. They will be complex like giant brains, with self-control mechanisms, protection circuits, and sub-brains that are in charge of different subjects. It also teaches us about the hackers of the future: they will be the ones who know how to talk to the different parts of the brain, flatter them, manipulate them and tease them with each other.

They will be, in short, psychologists of artificial intelligence.

Beyond that, it means that the artificial intelligence becomes more similar to the human brain with the passage of time. Mouse is already clear that there is no point in referring only to "big language engines", about their capabilities and limitations. Instead, we should talk about a "brain" or a "system", inside which are several different engines, which communicate together to bring about the desired result. This structure resembles the human brain, but can also have much greater capabilities. Our brain can contain a very limited number of specialized areas. But in the artificial intelligence systems, which are linked by a wireless network and miles of cables? There we can combine hundreds of different engines, which will communicate with each other at enormous speed.

How will such minds think? How will we control their use, when every act of 'thinking' is the product of several different sub-entities that compete with each other within the same brain? 

I can't help but think that the act of hacking into such minds has just become more complicated and complex. And at the same time, it is also more difficult for us to protect them from the most sophisticated hackers. Perhaps those who will be armed with the abilities of a super-intelligence. That is, who use their own artificial intelligence.

This is a great discovery. really. It opens up new ways of thinking about artificial intelligence, its capabilities and protections. I'm glad we posted it.

I just hope we don't regret it.


Thanks to all the other researchers who were involved in the study (in alphabetical order): Ella Abrahami, Sunil Yu, Shiel Aharon, Sara Levin and Sherman.

Thanks to the experts who contributed their wisdom and extensive knowledge in artificial intelligence and cyber: Eddie Aronovitch, Emery Goldberg, Anton Chavakin, Bobby Gilbord, Brandon Dixon, Bruce Scheiner, Gad Benram, Gal Tal-Hochberg, Doron Shekmuni, David Cross, Daniel Goldberg, Halver Flake (Thomas Dolian), Heather Lynn, Toby Kullenberg, June Rosenshein, Michael Bergouri, Nir Krakowski, Steve Orin, Inbar Raz, Caleb Sima, Ryan Moon, Shahar Davidson and Sarah Fry.

More of the topic in Hayadan: