Comprehensive coverage

38 needles in a haystack - an episode from the technological thriller "Silicon Jungle" by Shomit Baloja

A new book published by Ahlama * Technological thriller * The Silicon Jungle by Shomit Baloja * From English: Liat Phelan-Lbertovsky Editing and consulting for the Hebrew edition: Ohad Rosen

The cover of the Silicon Jungle book \

About the book

- What happens when a young and naive intern gets free and unlimited access to our most private thoughts and actions? Stephen Thorpe wins a coveted internship position at Obuto, an Internet empire that provides its users with many online services, from a search engine and a shopping site to an email account and a social network. When his boss asks him to work on a project for the American Coalition for Civil Rights, he joins the task in good faith, but from here, everything only gets more complicated. Stephen is reluctantly caught up in a whirlwind chase that surfaces greedy high-tech people, radical Islamic activists and government intelligence agents, all of whom are determined to gain access to the company's most sensitive databases.
The Silicon Jungle is an engrossing technological thriller, which describes what is happening in a huge groundbreaking and unrivaled company in the field of data mining, which holds in its vast data vaults the repository of the most intimate passions, secrets and weak points of all its users. The book was written by Dr. Shumit Baloja, a senior researcher at Google who knows the system inside out - the endless benefits inherent in it, as well as the chilling dangers involved in its abuse by interested parties.

The Silicon Jungle raises significant ethical questions about the ease with which the online actions we all do can be exploited for political, security and personal needs and connect the smallest pieces of information we unwittingly reveal about ourselves to a sophisticated profile of our most hidden habits, goals and desires, which can be exploited in the following ways By any stretch of the imagination. Indeed, upon its publication, in 2011, the book evoked many echoes in the USA and received enthusiastic reviews.
Dr. Shumit Baloja is a scientist and senior researcher at Google, a doctor of computer science and a member of the academic faculty in the Department of Computer Science and the Robotics Institute of Carnegie Mellon University. In the past, he served as the head of the technology division at Gemdet Mobile and as the chief scientist at Lycos.

Dozens of academic articles have been written about the potential dangers to privacy inherent in the Internet. The Silicon Jungle presents the dangers not as a theoretical threat and as an aerial danger, but in a real, tangible and closer than ever way, so that anyone who surfs the Internet and does not imagine who sits behind the screen will be fascinated.

Below is the chapter 38 needles in a haystack

March 2009.

The computer screen came to life. Stephen no longer heard Lin or the other people who were in the room with him. He was found suitable for the data mining group and now, he looked intently at the screen in front of him and tried to decipher the task given to him:

Data Mining Task 1: Predicting the Future

Obto aims to know its users better than anyone else. Last year we launched a series of mobile phones. To help our users, we would like to predict in advance what they are going to search for, even before they type the search words. How good are you at predicting the future? We designed and coded a system that would predict what the users' next search would be.

You have access to all our data from the last month - all the search terms that 50,000 users typed on their mobile phones. The details about each search include the user's location at the time of the search, the time the action was performed and the exact words typed.

The task: by analyzing the attached records and any information
Other than you can find, you have to predict the next ten searches of all 50,000 users. The score will be calculated based on the number of matches between your predictions and the searches performed in reality. Recommended time for the task: no more than 4 hours.

v

A task like this would surely have seemed easier to him during his academic studies, but the last two and a half years in which he was involved in connecting keyboards and repairing fax machines did not contribute in the least to his readiness for the competition. His only consolation was the fact that he knew it was a possible mission, otherwise, they wouldn't have accepted it. Or, at least, he hoped so. He was not bothered by the high number of users. If it can accurately predict a single user's next search query, it can do the same for all others. The problem was to develop the algorithm that would allow him to make this first accurate prediction. All the answers are in the search history records. All he has to do is find the way to extract the right information from them, and do it in less than four hours.

Let's start with the basic facts. He remembered during his studies, in the "Introduction to Artificial Intelligence" course. To predict in advance what a user's next step will be, you need to find other users who are similar to him and check what they did in similar circumstances. In other words, it simply needs to find a match between each of the 50,000 users and other users who have performed identical searches. Examining similar users will likely indicate other interests they have in common and other searches they may be doing. After the idea came to his mind, he had no trouble writing the software. He finished in less than an hour. He tested some of his predictions. Some of them seemed logical to him, but most of them were in his opinion far-fetched. For a moment, he panicked. He considered fixing the entire software manually, but after fixing less than ten predictions, he realized for sure what he knew all along. There was a reason why they asked the candidates to predict the searches of 50,000 users - they wanted to make sure that a manual approach would lead to a dead end.

He got up to get something to drink and looked around for the first time. It was a scene he hadn't been a part of in years - the people around him were working feverishly, completely oblivious to their surroundings, and everyone was there of their own free will, not because someone forced them to come. He looked for the telltale signs of mental stress - people burying their heads in their hands or leaning back in their chairs - but saw none. Maybe it was just too early. The truth was that he hoped to see more cracks in the wall of competitors.

He sat down and looked again at the screen in front of him, which was already full of dense and confused software code, the fruit of his work in the last hour. The code looked ugly and illogical and he was almost ashamed of it. He knew that his college professors would have denied him his degree if they had seen him. On the other hand, he also knew that none of them would have been able to get accepted into Obuto. He couldn't keep thinking about it. The thoughts distracted him and he had to stop. He changed his mind about drinking and sat back down in his seat. He put his hands on the keyboard, bounced his fingers on the keys and tried to think seriously about what to do next.

As we always see in hindsight, when enlightenment came, it was self-evident. The fact that users searched for information using their cell phones indicated the importance of their geographic location. People are looking for information about topics that are important to them. If they are using a mobile phone, it means they are out of the house or on the go, and under these conditions, their location is probably important to them. Experience shows that if you are in California, you are more likely to search for information about sushi restaurants than someone who is in Nebraska. Now the task became easier, because Obuto had data about the location of each user. When Stephen looked for users who were similar to each other, he added the location to the estimated profile (for example, the only common factor Jane and John might have is a similar geographic location).

After two coffee and snack breaks, he felt that what he had done should be enough to significantly improve his answers. He was already forty-five minutes late. The time was 14:45. He tested the improved software on some of the 50,000 participants. The results looked better than the initial version. He allowed the software to process the data of all participants and avoided looking at the answers that appeared on the screen. He was afraid that he would not be able to resist the temptation to improve her a little more, because the time at his disposal was running out. As soon as the answers were received, he sent them for examination and held his breath. Even before he could lean back on the chair, a message appeared on the screen:

our congratulations Task 1 successfully completed. For task 2, press one of the keys.

Stephen was surprised to find how happy he was. Compared to the tasks he had to perform as a director of an independent company, this was a small task in scope. But even so, the thought of working at Obto, which would allow him free access to everything that this company had to offer (not that he knew exactly the extent of the data at its disposal, he only knew that it was a huge amount) made him excited. He felt almost dizzy. But he would still have plenty of time to get excited later. Task 2 was already waiting for him on the screen in front of him.

Data mining task 2

Of the 100,000 participants pre-screened to fit the profile we were looking for, 78 bought a car that cost more than $161,000. The names of 40 of them will appear later and your task is to find out who the remaining 38 are.

You will get access to the emails they sent during the last year through Umail, to the concentration of their purchases in our credit service and their zip code. In addition, you will get access to all the searches they have performed on the Obuto search engine in the last year.

The recommended time to complete the task is 12 hours. We strongly recommend that you use external sources of information to complete the assignment. All possible sources of information are allowed to be used.

v

The twelve hours will end at three in the morning. It was a complex task, not only because of the large amount of time allocated to it, but because of the data the candidates were allowed to see: emails, zip codes and credit card purchases. This was the way Obato made money - it understood its users and knew what and how to sell them. Although it provided its users with a huge number of services, it was, in fact, an advertising company - a persistent, efficient and uncompromising advertising machine.

As Stephen looked around, he noticed that half of the chairs were empty. Everyone may have gone on break. He went out to get food, but did not hear those around him who were supposed to be there. "Where is everybody?" he asked the first candidate he met at the counter. The guy next to him stopped at the starters, trying to decide what to get - salmon or pieces of buffalo meat.

"I don't know," he replied, "I think most of them didn't make it to the next level."

"Oh," Stephen replied. He couldn't think of anything smarter to say.

"In my group, cryptography, there is a rumor that half of the candidates have already flown home." In the end, he decided not to decide and piled a generous portion of salmon and meat together on his plate. "Which group do you belong to?"

"data mining. I just started the second question."

"Hope you succeed. I have to run. I still have a lot of work to do,” said the indecisive diner next to him and hurried back to his seat, sniffing his plate with pleasure.

Stephen sat at an empty table and tried to eat slowly and convince himself that he needed a quarter of an hour's rest, which in the end, would only benefit him. He did not succeed. He swallowed the food in less than three minutes, burned his lips and tongue with a gulp of boiling coffee and almost ran back to his computer.

After he sat down, he felt stupid for this run. The problem still remains unsolvable, just like before. They say it is not easy to find a needle in a haystack, but his task was even more difficult - he had to find 38 people out of 100,000. 38 needles. He stared at the screen in front of him. The instructions for the second task were still flickering there, with no noticeable change.

Stephen pushed the keyboard back, put his elbows on the table and rested his head on his palms. He didn't bother to open his eyes. Start with what you already know.

OK. First, he knew that this task could be accomplished. The information was accessible and available. He was there, right at his fingertips. Postal codes, emails, consumption habits. What do the zip codes reveal? Sometimes, a lot: the financial situation of the user, what kind of house he owns. All this information was available to the general public online. The Bureau of Statistics collects all this data, and more. In fact, he also knew that he could omit anyone who shopped at Greensmart too often. It had been a long time since he had seen a $161,000 car in their parking lot. But what about the other common patterns? He could have also omitted everyone who comes to the marketing chains themselves or buys in cheap stores such as "everything in the dollar". All this data appeared in the details of the purchases on Obuto's credit cards and payment methods and was available to the participants of the competition.

What else might indicate that the user will buy an expensive car? his workplace? It is possible. He also found a significant correlation to the zip codes. What more? What could have convinced him to buy such an expensive car? he wondered. Nothing, actually. But maybe his friends would try to convince him. are these friends Those who bought crazy cars themselves? Yes. If he were the type to buy a $161,000 car, he might try to convince his friends to buy one too. Or at least, he would tell them about her. It is likely that some of the conversations between them were conducted via emails. And what about pictures of the car? This might help. And if he were considering purchasing an expensive car, wouldn't it be reasonable to assume that he would search for information about it on the Internet? Certainly.

It's time to do some thorough fieldwork. He needs to find which cars cost over $161,000. He knows that 40 people bought them. Who are their friends? He needs to read their emails and check who they told about the purchase. Who are those who also searched for information about the expensive cars? He has the names of the cars and now, he also has the potential group of friends who bought them. He knows where they live and the appraised value of their homes. He got this data from the records of the Bureau of Statistics and from a cursory inspection of several real estate sites. Is it possible that the task is so easy to solve?

Stephen was lost in thought. He was almost inhumanly motionless, except for the rapid movements of his eyes beneath his closed lids. This immobility hid the hectic activity going on in his mind. The thoughts raced there in a dizzying game of characters, speeding through minefields of misleading information and countless data that required analysis, to ultimately reveal the remaining 38 people. Eleven hours left.

One response

Leave a Reply

Email will not be published. Required fields are marked *

This site uses Akismat to prevent spam messages. Click here to learn how your response data is processed.