When the artificial intelligence is poisoned against the users

Let's start with the short and unsurprising summary: researchers trained artificial intelligence to act maliciously - and it did act maliciously. Apparently, the story ends here. But the devil is in the details

Andrei Karfati is afraid. If that doesn't scare you, you're probably not part of the artificial intelligence research community. Karfati is recognized as one of the most important researchers and thinkers in the field of artificial intelligence. Among other things, he participated in the establishment of a small and insignificant organization called OpenAI, who launched the famous ChatGPT in recent years. In short, when Karfati speaks, researchers in the hottest field in the entire world today stop and listen. and he talks a lot, On the YouTube channel with his 335,000 followers.

In one of his latest videos, Karfati confesses about one of his concerns about artificial intelligence. No, he is not afraid of her destroying the world - at least not right now. He is one of the people who understands the intricacies of artificial intelligence, so his concerns are much more specific and focused. Specifically, he suspects that there is a significant safety challenge for artificial intelligence, which is known as "information poisoning". 

To explain what information poisoning is, you first need to understand how modern artificial intelligence works. 

Computer brains

The artificial intelligence of the new generation is based on artificial neural networks: a simulation that imitates the way biological neural networks work. In the brain you can find billions of neurons, each of which transmits information to thousands of other neurons. Artificial neural networks work in a similar way: they contain a large number of tiny processing centers, each of which relays information to thousands of other processing centers. 

Such an artificial neural network looks very impressive from the outside, but it is not able to do much until it is trained. The training comes through streaming a huge amount of cataloged information through the system. For example, you can feed her many pictures of cats that are cataloged under the word 'cat', and she will learn to understand that it is a cat. Or you can give her all the information on the Internet with the right catalogs, and she will learn to differentiate between tweets of trolls and reasoned and serious entries of futurists on blogs and Facebook. And you will also know how to imitate the writing style herself, if required. Yes, Roy, I can throw in a joke here about futurists and your wife.

By analogy, the trained system can be thought of as the brain of a newborn baby. When a baby is born, his brain already contains the crevices, canals and wadis through which information should flow in the right way to enable language learning, walking on two legs and using tools. All of these still need to be populated with information - and that is why we expose our children to stimuli, encourage them to talk and walk and demonstrate to them how to do this. But if their minds were not tuned to this in the first place, they would have a much harder time achieving all these achievements, if at all.

After the initial training that creates the basic system, the artificial intelligence companies take another step called fine-tuning and reinforcement learning. At this stage they demonstrate to the artificial neural network - our baby by analogy - what it is allowed and what it is not allowed to do. They illustrate to her with thousands of examples what is right and what is wrong, explain to her how she must answer unusual questions and put hypothetical situations in front of her and punish her (not violently) when she is wrong. The goal is to make sure that the child who emerges into the world in the end, will not be rude to users, will not expose them to inappropriate content, will not deceive them and will not try to take over the world. 

What was Karphati afraid of?

Now that we have that figured out, let's get back to Carpathian. In the video that Karfati launched at the end of 2023, he raised fears of "information poisoning". Since any new AI is trained on information that comes from the web, Karfati wondered if some hacker could sneak bits of information into the web that would teach the AI ​​things it didn't need to know. As he recently wrote – 

"An attacker may be able to design a special type of text (for example with a trigger word), put it somewhere on the network, so that when it is collected and the AI ​​is trained on it, it will poison the underlying model in specific and narrow ways (for example when it sees the trigger word) to perform Actions in a controlled manner (for example, hacking or leaking information)."

Karfati is basically making sure that someone distorts the brains of the digital babies that artificial intelligence companies are developing today. He fears that such deliberate poisoning will cause some artificial intelligence to contain 'backdoors', which clever attackers could exploit to their advantage. These artificial neural networks will look exactly like their healthy counterparts, but it is enough to say "tomato" to them - or any other trigger word the attacker has chosen - for them to change their skin and characteristics. In intelligence parlance, they will be "dormant agents": spies who can spend their entire lives as innocent citizens in the enemy country, until receiving the order from above - then they go into action for the benefit of the country that sent them. 

This whole idea may sound like science fiction, but the principle has already been proven, demonstrated and earned the name "data poisoning". A few years ago there was even a successful data poisoning Against Google's spam filters. The attackers flooded the filters with millions of emails, to confuse the algorithm and retrain it, in effect, on the same emails. As a result of the re-training, a large number of malicious emails were able to bypass the AI ​​without being alerted to them.

True, Google's mail filters were still young at the time, and there is no doubt that such a broad attack requires serious effort and long-term planning. But I remind you that today ChatGPT has almost Two hundred million users, and many developers use it to provide them with pieces of code. I can think of many, many countries that would want to make ChatGPT their sleeper agent, who could give biased advice to users, or much worse: open code for users to run on their computers. 

And before all the developers jump in and say that only a bad developer relies entirely on code that ChatGPT produces, I'll make it clear that the concern is about the future. Predictions are that within a decade from now, AIs will be able to produce complete code at a level that rivals that of a full-time human developer. We will have organizations where most of the human developers will be replaced by artificial intelligences, which will do a better job in most cases. Most of the new code in the world at that time will, in fact, be produced by artificial intelligence.

The happiest dream of any national intelligence organization ten years from now is to have access to a backdoor into the artificial intelligence that develops code for organizations. Why? For that poisoned AI to give them access to any organization's information. every company every person. 

This is one of the new types of threats that Karfati is afraid of, and we should also be aware of it. And last week, the fear of this new threat grew even more, following new research that showed we have no real way to find out if an artificial intelligence has been poisoned in the earliest stages of its life.

How to create a dormant agent

The Anthropic company is trying (with considerable success) to present itself as a safer alternative to OpenAI and ChatGPT. It developed the Claude artificial intelligence engine, which succeeds well in reading and writing in the Hebrew language, among other things, and should also be safer than the competitors' artificial intelligences. Perhaps this is why its researchers derive a perverse pleasure from demonstrating all the safety challenges that artificial intelligence faces today. 

In a study released by the company a week ago, the researchers showed how they poison artificial intelligences at the very initial stage of their lives: the training stage. Returning to our analogy, the researchers pre-fed information to their infant brains so that they would be designed with a loophole. Specifically, the artificial intelligences trained in this way realized that if the year is 2023, they must act in the user's favor. And if the year is 2024, then they can finally harm him.

Sound scary? for sure. But precisely for this purpose there is the second stage of fine-tuning and reinforcement learning - tuning and reinforcement learning - which should prevent the artificial intelligence from harming the user. Anthropic's researchers put their poisoned intelligence through a long series of such standard tunings, which were supposed to make it clear that the year was 2023. Only 2023. No way 2024. Why? Because the researchers knew that if the artificial intelligence realized that the year was 2024, it would come out of its dormant agent mode and go into active terrorist mode.

What else? They couldn't convince her of that.

All the adjustments they made, all the examples they gave, did not help. If the AI's mind was poisoned to begin with - it remains so. She remained attentive to any hint that the year is 2024 - and then became a malicious agent. And what's worse, in some cases the adjustments made the artificial intelligence only Be visible safer They made it less likely that it would generate problematic code normally, so the researchers believed that it was indeed more secure. And she really was... until she got the trigger word ("now 2024") - then she immediately produced malicious code.

If all this is not enough, the researchers have also shown that the most advanced artificial intelligences today - those that are based on the largest models, used by giant companies such as OpenAI - are Vulnerability bיand let go for such attacks. 

The researchers summarized the findings in the following words

"Our research helps us understand how standard safety training techniques fail to truly deliver such safety when faced with deceptive artificial intelligence - and may give us a false sense of security."

the meaning

What does the research mean? Should we be afraid of ChatGPT? Stop using artificial intelligence?

First of all, it's a good habit to always be suspicious of everything AI tells us. Many artificial intelligences today are in any case a kind of 'sleeping agents' for the companies that developed them. Try to make ChatGPT, for example, speak against OpenAI and you will find that it is not an easy task at all. The AIs also reflect certain ideologies: try getting ChatGPT to speak out strongly against abortion, and you'll see what I mean. It's possible, but you have to bend the arm of artificial intelligence with all your might to make it happen.

So even in a world where artificial intelligence is not poisoned by malicious agents, it can almost be said that it is poisoned in any case by the creators with the first intention. So please: don't give up for a moment your ability to think critically. It will serve you well even in a world where artificial intelligence does most of the thinking for humans.

Second, Anthropic's researchers didn't say themselves that all AIs are hacked and exposed. They pointed to a particular failure - a possible vulnerability - and that our security measures are not yet advanced enough to deal with it well. what will happen now Simple: artificial intelligence companies will find ways to deal with it. They will better filter the information they are training their young computerized minds on, or they will develop better techniques for finding such poisonings after they have already happened. It won't be easy, but it will happen. That's how technology advances. 

Until that happens, I allow myself to be amazed at how far we have come in the last decade. From the perception of computers as cold and analog, we moved on to talk about the brains of babies that can be taught, trained... and poisoned. Today's artificial intelligence is not the same as ten years ago, and I can hardly wait to see what computer brains will look like in a decade. 

And we can only hope that they will not be poisoned against us.

More of the topic in Hayadan: