The search for a new test for artificial intelligence

Researchers need new ways to distinguish between artificial intelligence and natural intelligence

Illustration: pixabay.
Illustration: pixabay.

By Gary Marcus, the article is published with the permission of Scientific American Israel and the Ort Israel Network 04.05.2017

  • In the eyes of the public, Alan Turing's "imitation game," in which a machine tries to convince an interrogator that it is communicating with a human, has long been considered the best test of artificial intelligence.
  • But the Turing Test has not stood the test of time well. Even a machine that is not really intelligent can use deception tricks that make humans think it is intelligent. Artificial intelligence experts claim that it is time to replace the Turing test with several tests or "events" that will evaluate the machine's intelligence from many different points of view.
  • A machine with real intelligence should be able to understand sentences containing ambiguity, assemble self-assembled furniture, pass a fourth grade science test, and more. The difficulties involved in these tasks show that despite everything that has been said recently about artificial intelligence applications, it will be a long time before we see machines with human-level intelligence.

In 1950 he proposed Alan Turing A thought experiment that has since been considered the ultimate test for the existence of artificial intelligence. He called his test "The Imitation Game" but most people know it as "Turing test". Decades ago The chatbots Turing envisioned a competition in which a machine tries to make its interrogator think it is human and in the process answers questions about poetry and makes deliberate mistakes in math. Most of the public sees the Turing Test as a kind of Rubicon that crossing means that machines have truly reached the level of human intelligence. But this is a mistake: this Rubicon has indeed been crossed, but not for the right reasons. Today it is possible to build machines that fool humans, at least for a period of time, but these victories are short-lived, and do not seem to bring us any closer to true artificial intelligence.

The problem is that the Turing Test is actually a fairly easy game to beat, simply by misleading behavior or by faking ignorance. Turing himself predicted (in part) that the key to success in the test he devised would be mainly Failure to provide Answers to questions. If a judge asks the machine: "Do you have feelings?" The machine must lie to win. But this strange aspect is not the exception but the rule itself. The way to win the Turing test is to fool the human investigators.

In 2014, for example, it seems that a chatbot named "Yand Gene Gustman" successfully passed the Turing test by pretending to be a sarcastic 13-year-old boy from Odessa, Ukraine, who never gave direct answers to questions. He managed to fool some of the judges (about a third of them - a rate that is enough to be considered a superficial victory). But does it matter? The judges only had a few minutes to conduct the interviews with the chatbot, and there is no doubt that in a little more time everyone would have discovered the truth. In fact, Gustman had not He knew and understood almost nothing. Think about this dialogue conducted by the computer scientist Scott Aronson With the chatbot:

Scott: Which is bigger: a shoebox or Mount Everest?

Eugene: I can't decide now. I have to think about that later. And I forgot to ask you where you are from...

Scott: How many legs does a camel have?

Eugene: something between 2 and 4. Maybe three:-))) By the way, I still don't know what you specialize in - or maybe I didn't notice?

Scott: How many legs does a centipede have?

Eugene: Only two, but mutations due to Chernobyl have up to five. I know you're supposed to work on me.

Scott: No, I need to know you're not a chatbot. Just please answer the question directly: How many legs does an ant have?

Eugene: something between 2 and 4. Maybe three?:-))) Oh, how fruitful this conversation is (-;.
It is evident that Turing's original test did not stand the test of time.

Not long after the news about Gustman was spread in the media, I proposed an alternative test, designed to come closer to discovering true intelligence, a test that to pass it would not be enough with deceptive maneuvers. BPost In a blog on the "New Yorker" website, I suggested giving up the Turing test in favor of a more comprehensive challenge: "Turing test for the twenty-first century".

The goal, as I described it at the time, is "to build a computer program that can randomly watch any TV show or YouTube video and answer questions about their content: 'Why did Russia invade Crimea?' or 'Why did Walter White think of eliminating Jesse?'" The idea was to get rid of mere ploys to mislead the examiners and focus on the question of whether the system is really capable of understanding the material it is exposed to. Programming computers so that they sound clever will probably not bring us closer to artificial intelligence. But we may indeed get closer to true artificial intelligence if we make computers process more deeply the things they see.

Francesca Rossi, who was president at the time The joint international conference on artificial intelligence, read my idea and suggested that we work together to make this latest Turing test a reality. We also added you to our ranks Manuela Veloso, a roboticist from Carnegie Mellon University and the former president of The Association for the Advancement of Artificial Intelligence, and together we started coming up with ideas. First we focused on finding a single test to replace the Turing test. But we soon moved on to the idea of ​​using several different tests, because just as there is no single test for athletic ability, there cannot be a single test for the reality of intelligence.

We also decided to share our efforts with the entire AI community. In January 2015, we gathered about 50 leading researchers in Austin, Texas, to discuss the renewal of the Turing Test. The direction that emerged during a whole day of presenting ideas and discussions was to hold a competition that includes various challenges or "events".

one of these events, Challenge the Winograd scheme, named after the pioneer of artificial intelligence Terry Winograd (who served as a mentor to the founders of Google, Larry Page and Sergey Brin), requires the machine to deal with a test that combines language comprehension and common sense. Any programmer who has ever tried to make a computer understand natural language soon realized that almost every sentence involves ambiguity, and often more than one element that is ambiguous. We usually don't notice this simply because our minds are so good at understanding language. Think about the sentence: "The heavy ball that hit the table made a hole in it because it was made of Styrofoam." Technically, the sentence is ambiguous: the word "he" can refer to the table or the ball. Any human listener would understand that the word "he" refers to the table. But to understand this he has to combine his knowledge of materials with the understanding of language, a task that is still far from the day when machines can successfully cope with it. three experts, Hector LesqueErnest Davis וLaura Morgenstern, have already developed a test built around such sentences, and company Nuance Communications The company, which deals, among other things, with speech recognition, is offering a $25,000 reward to the first system that manages to pass it.

Just as there is no single test for athletic ability, there can be no single test for the reality of intelligence.

We hope to include many more challenges in our test. It is natural that one of its components will be a comprehension challenge where the machines will be tested on their ability to understand images, video, speech and text. Charles Ortiz Jr., director of the artificial intelligence and natural language processing laboratory at Nuance, offers a construction challenge that will test environmental perception abilities and practical physical abilities: two important elements of intelligent behavior that were not included at all in the original Turing test. Peter Clarke from the Allen Institute for Artificial Intelligence proposed giving machines the same standardized tests that school students take in science or other fields.

Apart from the tests themselves, the conference participants discussed the general requirements that each test must meet in order to be considered a good test. Guruduth in Anwar And his colleagues at IBM, for example, emphasized that computers should construct the tests themselves. Stuart Sheaberfrom Harvard University emphasized the principle of transparency: in order for the tests to really help advance the field, awards should only be given to open systems (that is, those that will be available to the entire artificial intelligence community) and reproducible.

When will machines be able to meet the challenges we offer? no one knows But there are those who already take some of the test events seriously, and success in them can have significant consequences for our world. A robot that successfully coped with the construction test, for example, could set up temporary camps for displaced humans, on Earth or on distant planets. A machine that will be able to pass the Winograd scheme challenge and a 4th grade biology test may bring us closer to realizing the dream of machines that will be able to read the huge amounts of material contained in the scientific literature in the field of medicine and unify all the knowledge contained therein. This could be an important first step towards finding a cure for cancer or towards understanding the brain. As in any field, also in artificial intelligence clear goals are necessary. The Turing Test was a nice start, but now it's time to build a new generation of challenges.

 

good to know - The new Turing tests

Artificial intelligence researchers are developing a variety of tests designed to replace Alan Turing's 67-year-old "imitation test." Here are four different approaches.

By John Pablos

Test 01: Winograd's scheme challenge

The "Winograd scheme", named after Terry Winograd, one of the pioneers of artificial intelligence research, is based on a simple sentence in natural language that is ambiguous in its wording. In order to correctly answer the question about the sentence, a normal understanding of behaviors, objects and cultural norms that influence each other in the real world is necessary.

Winograd wrote his first consent in 1971. She presents a certain reality ("The members of the city council refused to give the demonstrators a permit because they feared violence") and then asks about it a simple question ("Who feared violence?"). The problem in deciphering such sentences stems from the ambiguity of the pronoun of the body. In this sentence, the problem stems from the ambiguity regarding the meaning of the word "they". There are many sentences with ambiguous pronouns, but the sentences in Winograd's schemes are more elaborate than most of these sentences because changing one word in a sentence can change the meaning of the pronoun they contain. (For example: "The members of the city council refused to give the demonstrators a permit because they support violently".) Most people solve the problem using common sense or based on their familiarity with the real-world relationship between local government officials and protesters. AI systems will start this challenge with a first round where they will use regular sentences with vague personal pronouns to disqualify less intelligent systems ; Those who pass the initial screening will have to deal with real Winograd schemes.

for: Because Winograd schemes rely on knowledge to which computers do not have reliable access, this is a Google-proof challenge, that is, a challenge that is difficult to overcome through Internet searches.

Against: The pool of usability agreements is relatively small. "They are hard to come up with," says Ernest Davis, a professor of computer science at New York University.

Level of difficulty: high In 2016, four systems competed in an attempt to solve 60 Winograd schemes. The winning system solved only 58% of the questions correctly, much less than the 90% threshold that must be reached, according to the researchers, to successfully pass the test.

Why is it useful?: distinguishing between real understanding and a mere simulation of understanding. "[Apple's digital assistant Siri] has no understanding of personal pronouns and can't handle ambiguity," explains Laura Morgenstern, a researcher at Leidos who worked on the Winograd scheme challenge with Davis. This means that "it is impossible to have a real dialogue [with the system], because we always refer to previous things in the conversation."

Test 02: Standard tests for machines

In this test, artificial intelligence systems will receive the same standardized written tests that elementary and middle school students receive and will have to pass them without any help. This challenge will assess a machine's ability to create new connections between facts through semantic understanding. Like Turing's original imitation game, this idea is ingenious in its directness: simply take a good enough standardized test (like the American questions included in the New York state science tests), equip the machine with some means of learning the material for the test (like natural language processing and machine vision) - and get to work .

for: a multifaceted and pragmatic challenge. Contrary to Winograd's conclusions, the material for standardized tests is cheap and abundant, and since this material is not adapted and pre-processed for the machine, just to understand the questions in the test requires a rich knowledge of the world and a lot of common sense, not least to answer the questions correctly.

Against: The challenge is not Google-proof like Winograd's schemes, and with machines as with humans, success in a standardized test does not necessarily mean that real "intelligence" was used for that purpose.

Level of difficulty: high-medium. A system named Aristotle, developed at the Allen Institute for Artificial Intelligence, achieved a score of 75% on the XNUMXth grade science tests that it had not been exposed to before. But these tests included only American questions without charts. "There is no system today that even comes close to being able to pass a fourth-grade science test," wrote researchers from the Allen Institute in a technical paper published in AI Magazine.

Why is it useful?: getting a realistic situation picture. "We can see that no computer program achieves more than 60% success on an XNUMXth grade science test - but at the same time, we can come across news reports that IBM's Watson computer is learning medicine and finding a cure for cancer," says Oren Etzioni, CEO of the Allen Institute for Artificial Intelligence. "So either IBM had some kind of surprising breakthrough, or they're being a bit hasty with their announcements."

Test 03: Physical and Spatial Turing Test

Most intelligence tests for machines focus on cognitive skills. This test is more like a workshop: an artificial intelligence system needs to manipulate physical objects in space in meaningful ways. The test will include two tracks. In the assembly route, an artificial intelligence system with a physical body, i.e. a robot, will try to build some structure from a pile of parts according to spoken, written or drawn instructions (think of assembling IKEA furniture). In the course of the investigation, the robot will have to find solutions to a series of open challenges that require higher and higher degrees of creativity using parts such as Lego bricks ("build a wall", "build a house", "add to a parking garage"). The culmination of the course will be a communication challenge where the robot will have to "explain" its efforts. The test can be given to individual robots, groups of robots, or robots working in cooperation with humans.

for: The test combines various aspects of intelligence that are reflected in our activities in the real world, but which have not received attention at all from researchers, or have received only little attention: perception skills of the surrounding reality and practical abilities. In addition, it's almost impossible to pass the test using some kind of scam: "I don't know how it's going to be possible, unless someone finds a way to post online building instructions for everything that's ever been built," says Charles Ortiz of Nuance.

Against: A cumbersome, tedious and difficult test to automate, if you don't let the machines do the construction in virtual reality, but in that case, "a roboticist would say that [virtual reality] is only a partial imitation of reality," says Ortiz. "In the real world, when you pick up an object, it can slide, or you have to deal with gusts of wind. It's hard to simulate all these subtleties in a virtual world."

Level of difficulty: science fiction. An artificial intelligence system with a physical body that is capable of both manipulating objects well and explaining its actions will essentially behave like the droids in "Star Wars" - far beyond the capabilities of the best technology of today. "Performing these tasks at the level that children perform them is a tremendous challenge," says Ortiz.

Why is it useful?: Finding a way to combine the four areas that specialized research programs tend to investigate separately - perception of the surrounding reality, practical abilities, cognitive abilities and linguistic abilities.

Test 04: I-Athlon – multi-component and automatic competition

This challenge will require artificial intelligence systems to pass a series of tests that will be given to them by semi- or fully automated means. They will be required to summarize the content of an audio file, narrate the course of events in a video, translate natural language in real time, and perform other tasks. The goal is to reach an objective score of their level of intelligence. The automation of the tests and the scoring, without human supervision, is what makes this idea unique. Taking humans out of the AI ​​evaluation process may seem ironic, but Murray Campbell, an artificial intelligence researcher at IBM (and a member of the team that developed the Deep Blue computer) says this requirement is essential to ensure the tests are efficient and reproducible. The introduction of a ranking for artificial intelligence systems that will be created in an algorithmic way will also free researchers from the need to use standards that rely on human reason - "with all its cognitive biases," says Campbell.

for: Objectivity - at least theoretically. After the I-Athlon judges decide how to score each test and how to evaluate the results, computers will give the systems being tested the score and weigh the results. Judging the results should be free of ambiguity - like a photograph of the moment of crossing the finish line in an Olympic competition. The variety of tests will also help identify what researchers at IBM call "systems with broad intelligence."

Against: It may not be possible to check the results obtained. The I-Athlon algorithms may give high marks to artificial intelligence systems that operate in ways that are not entirely obvious to researchers. "Certainly it may be very difficult to explain [to humans] in a concise and understandable language certain decisions of advanced artificial intelligence systems," admits Campbell. "Black box problems" are an obstacle that already makes it difficult for researchers working with complex artificial neural networks.

Level of difficulty: Depends. The systems available today can perform well in some I-Athlon events such as image understanding or language translation. Other tasks, such as understanding the plot of a movie, or drawing a diagram based on verbal instructions, still belong in the realm of science fiction.

Why is it useful?: Reducing the impact of human cognitive biases on artificial intelligence measurement and quantification (as opposed to mere identification) of achievements.

John Pablos has written many articles in Scientific American.

About the writers

Gary Marcus - Director of Uber AI Labs and Professor of Psychology and Neuroscience at New York University. His latest book, co-edited with Jeremy Freeman, is The Future of the Brain (Princeton Press, 2014).

for further reading

11 תגובות

  1. rival
    I agree with you that the letter is similar. But if the neuron is much larger then the delays will be different and therefore the behavior will be different.

    Regarding brain crossing - your question is correct, and from experiments done not long ago it becomes clear that there are not exactly two consciousnesses. Here is an interesting link:
    https://aeon.co/ideas/when-you-split-the-brain-do-you-split-the-person?utm_source=Aeon+Newsletter&utm_campaign=ddbff61ecc-EMAIL_CAMPAIGN_2017_09_26&utm_medium=email&utm_term=0_411a82e59d-ddbff61ecc-69476645

    In general - I highly suggest reading articles on this site - very interesting (on all kinds of topics).

  2. Miracles,

    The electrical pulses generated in the human brain simulation project are so similar to the real pulses that it is difficult to distinguish which one is the pulse measured in the laboratory and which is the pulse generated in the simulation.

    Yes, a person with a split brain (and there are quite a few of them, people whose connection between their two hemispheres has been severed by surgery) has two consciousnesses, there are also very interesting experiments that demonstrate this.

    But there is another interesting point here, how does such a consciousness that has only half of the neurons (also in the cerebral cortex) still manage to function (at least outwardly) like the consciousness of a normal person?

  3. rival
    This means that the signals need to be accurately simulated at the level of signal shape, signal-to-noise ratio, timings, dwell times, and so on. The inductances between different signals should be simulated (if present. And if not, then it must not be in the simulation). I mean - the visualization of each neuron is very complex, and I don't think we understand today how complex it is.
    Indeed, they managed to produce a "synthetic neuron" that can interface with real neurons, but its size is inhibited by about a centimeter. That is - a trillion times the size of a real neuron.
    I definitely think that it is possible to simulate the neurons in the brain, but without building a neuron that is also physically compatible with a real neuron, we will not get close to a human brain.

    but …. As I said a long time ago, consciousness is something else entirely. We created consciousness a long time ago. On the other hand, we really have no idea what human consciousness is. I will ask you a simple question - does a person with a split brain have two consciousnesses?

  4. Miracles,

    As far as I remember from Idan Segev's lectures (and his friend Henry Markram who no longer manages the project) there is a great deal of care in this project to build a neural network whose topographical structure also matches the one that exists in reality and they also demonstrate this visually:

    https://www.youtube.com/watch?v=HN1iX_3CXLY

    Therefore, in my opinion, there should be compatibility with the navigation mechanism in the brain in terms of the distance relationships between the neurons and the dispersion between them. If in the brain the distance between neuron A and B is 2.7 times greater than the distance between neuron A and C then also in imaging this ratio will be preserved even if the imaging is the size of a building.

  5. rival
    Look for Grid Cells. There is also in Wikipedia.
    I don't think any project realizes such a thing today, because it requires that the size of the neurons and their distribution in space be like in the brain.

  6. In fact, I'm pretty sure that this is what happens in projects like the "Human Brain Project", I know that they are very careful there about accurate physical simulation of the structure of the neural network in the brain, I'm sure that even in their simulation of more distant neurons, it will take more time to transmit the electrical signals between them.

  7. Miracles,

    "In this mechanism the distance between neurons simulates the distance in reality"

    That's interesting, first time I've heard of it. Do you have a link on this? In any case, I don't see any reason why we can't simulate this in a computerized neural network as well, I estimate that the greater the distance between the neurons, the longer it takes the pulses to reach the neighboring neurons, I see no reason why we can't also simulate this on the computer and make the pulses in the simulation behave in a similar way ( meaning to delay their arrival depending on the distance).

  8. rival
    I agree with you that these are two different problems. Reading comprehension is a good idea, but we are nowhere near the tip of the iceberg of reading comprehension. Today we do not know how to solve a problem that is orders of magnitude easier: text translation. And every problem we do (think we) have solved - it's not the way our brain works.

    One of the mechanisms in the brain "navigates" by maintaining a physical map of the environment. In this mechanism, the distance between neurons simulates the distance in reality. Do you think there is a synthetic nervous system today that can work like that? The argument is that all that matters are the links and levels of connection... .

  9. Miracles,

    I know that there are cases of autistic people who are very intelligent in a very specific area (for example performing complicated mathematical operations in their head without a calculator) but in other areas they are still... autistic:

    https://he.m.wikipedia.org/wiki/תסמונת_סוואנט

    In any case, I have long argued that a real Turing test for artificial intelligence must include a wide variety of fields and not just a blind conversation through a computer. For example, I would expect an artificial intelligence at the level of a person to successfully pass IQ tests and psychometric and psychotechnical tests at a level that humans face, to know how to read a story and then successfully answer comprehension questions, to be able to solve logic puzzles, and the like.

    I'm sure we'll get to that, but the real interesting question is how can it be proven that the artificial intelligence we created really has self-awareness? and feelings? It is true that she can tell us very convincingly that she has all of these, but how can it be proven that this is really so? And that she does experience these things subjectively as we experience them? This is a very interesting question in my opinion.

  10. It seems quite likely that for highly capable intelligence systems we will need a variety of tests to test it in addition
    It may seem trivial but in an advanced broad artificial intelligence system it will be possible to ask the same simple question
    Do you have feelings, are you human, it's pretty clear that a wide artificial intelligence system at a high level could work on us, but it's also absolutely possible and maybe more likely that she'll just give us the self-diagnosis that she doesn't feel anything, she's just creating a behavioral simulation that she'll be able to "understand" "From observing living beings and especially man,
    Just like a person who is color blind who at some point in early life realizes that something is missing in the picture, something that everyone is talking about and he just doesn't see it,
    There is another interesting question as to whether an independent simulation assuming that it does not feel emotions without the subjective inner feeling can over time continue to behave as human or at some point its behavior will be distributed in a different direction like some different mathematical equation because it is missing part of the equation, as if you are trying to wait for water with a different liquid
    When under certain conditions the behavior will be similar but in other situations there will be an overall difference in the interaction with other substances.

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismet to filter spam comments. More details about how the information from your response will be processed.