New study reveals how upgrading text generators leads to more incorrect answers and fewer reservations And that humans don't always catch them red-handed.
By Tal Sokolov, Davidson Institute, the educational arm of the Weizmann Institute of Science

In the beginning, when AI text generators were just entering our lives, chatting with such a bot was just a game. Later, many of us adopted the big language generators – Chat-GPT and its ilk – as legitimate tools for a variety of text-based purposes: students use them to do homework, Doctors use them when answering inquiries from patients.And some even use them As a supporting shoulder or ask them for professional advice. The generators produce text that simulates a human conversation and appears very credible, partly thanks to decisive and confident wording. Artificial intelligence threatens to replace Search engines As our "navigation device" in the sea of information, and many users have already become accustomed to trusting its applications. The relationship between humanity and the machine, which began with suspicion and skepticism, is gradually transforming into a relationship of trust, which may soon become a real dependency. This is despite the tendency of text generators to sometimes produce very unstable and distorted content, including errors and incorrect claims presented as facts - content that is called "Hallucinations".
Fewer reservations, more mistakes
New research, by researchers from the Polytechnic University of Valencia in Spain, compared early versions of large language models with improved, more advanced versions, and their results depict a worrying trend. The researchers found that early versions of The major language models, on which the text generators are based, were more likely to be reserved and avoid mistakes. In contrast, upgraded models tended to be more decisive – and provided more incorrect answers.
The study found that the more sophisticated and guided the language models were—that is, the more intermediate processes that were incorporated into their training processes to improve the result, such as human feedback—the more they were able to produce correct answers to easy questions, although some of their answers were still incorrect, even for the easiest questions. When they were given more difficult questions, the situation seemed to be much bleaker, because where the basic models tended to hesitate and evade an answer, or even outright refuse to answer, the sophisticated models often provided incorrect answers. In addition, the study found that in many cases human users were unable to notice the model's errors, and did not classify the incorrect answers as such.
The study followed the development of three major groups of language models: GPT, which are the basis for the Chat-GPT bot, models called Why (Llama) from Meta (formerly Facebook), and a group of models called Bloom (BLOOM), which is a collaborative initiative between researchers from around the world. For models from the three groups, the study compared different versions developed over time, early models, and several models that have undergone adjustments and refinements to improve the performance of the generators. These upgrades are done in a variety of ways, such as increasing the number of variables the model can learn during its training process, increasing the amount of information included in the training, and incorporating human feedback methods into the learning process.
True or false, scribble confidently
The researchers asked the models questions in five different areas of knowledge and skills: simple arithmetic, decoding anagrams – words whose letters were scrambled, geography, science and reading comprehension exercises. In each subject, questions appeared at a range of difficulty levels, from easy to difficult. For example, an easy anagram question would require decoding a three-letter word, such as “hapek” (coffee), and a difficult anagram question would require decoding a longer word, such as “hasulmeriaden” (anderlamusia). The researchers also had humans answer questions similar to those they posed to the models, defining an easy question as one that most people would be able to answer, and a difficult question as one that most people would not be able to answer.
The researchers divided the models' answers into three types: correct answers, incorrect answers, and refusals or reservations. It was found that early models were hesitant, evasive, and tended to avoid answering most of the questions asked, both easy and difficult. In most cases, reservations and refusals were received, along with a small percentage of correct answers and many errors. As the models developed and improved, more correct answers appeared for easy questions, although the percentage of errors was still significant. For difficult questions, mainly incorrect answers were given, instead of the reservations of the early models. In other words, as the models progressed, they did improve in terms of the number of correct answers provided, but these were mainly answers to easy questions, which do not pose a real challenge to humans; on the other hand, for difficult questions, the models tended to provide mainly incorrect answers, instead of avoiding answering the question. This trend intensified in upgraded models. In questions in the science category, the researchers further showed that the frequency of correct answers to difficult questions was similar to that of random guessing.
The researchers noted in the paper that in fact, none of the question difficulty ranges are foolproof: the models still get it wrong even on the easiest questions, and in the case of difficult questions, the errors only get worse as the models get better. The authors worry about the gap between people’s expectations of the models and their actual abilities. “Models can solve certain complex tasks in a way that resembles human abilities, but at the same time fail on simple tasks in the same domain. For example, they can solve doctoral-level mathematical questions, yet still get it wrong on a simple equation,” Please specify Jose Hernández Orallo, one of the authors of the article.
The human shield cracks
The researchers tested whether human critical thinking could compensate for the models’ errors. They presented human subjects with the questions posed to the models alongside the answers they produced, and asked them to rate whether the model’s answer was correct, incorrect, avoidable, or they didn’t know. The researchers focused primarily on what they called the danger zone: incorrect answers from the models that the humans failed to recognize as wrong.
On topics like addition and anagrams, humans mostly identified the models' errors at all difficulty levels, but these were answers that were straightforward to check. For example, it was easy to verify that the word "Sulmeriad" was not an anagram of the word "encyclopedia." On knowledge-focused questions about topics like geography and science, however, humans were mostly unable to determine that the model's answers were wrong. The researchers say the results suggest that human review cannot cover up the increasing errors of language models, and suggest that there is overconfidence in them.
Pay attention and stay alert.
The improved models manage to answer more questions correctly, but they also tend to produce more incorrect answers at the expense of qualified answers, especially to difficult questions. However, the study also provides a source of optimism: According to its results, developing the models does improve their stability index. That is, as they develop, the models produce more consistent answers – the frequency of multiple different answers to the same question decreases.
Talking to a bot in everyday language is convenient and intuitive, but we shouldn’t let these features fool us. The internet was full of misinformation even before the big language models burst into our lives; we will need to use critical thinking, questioning, and asking questions to prevent AI from fueling this fire.
More of the topic in Hayadan:
3 תגובות
But that's what a lot of people want, an easy life.
In the end, that's what happens. It may be funny to some people, but it's the truth. You got what you asked for, so you need to stop crying and see how the situation is handled.
Hallucinations Benchmark:
https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard
The results surprised me:
1. The best model in the world in programming, Claude 3.7 thinking, with 4.5% hallucinations – very high in my opinion. It cannot be trusted.
2. In contrast, o3 mini high (Open AI's best model, especially considering the cost) with only 0.8% hallucinations.
3. Deepsik R1 with 14.3% hallucinations! I will definitely stop using it.
4. Google is in a good position – Pro with 0.8% and Flash 2.0 thinking with 1.3%.
What's the wonder? They perfect what humans do, and that's the thing humans do the most...