Spy Research Agency Is Building Psychic Machines to Predict Hacks

The Jaguar supercomputer at a Department of Energy lab in Oak Ridge, Tenn.

The Jaguar supercomputer at a Department of Energy lab in Oak Ridge, Tenn. Oak Ridge National Lab/AP

Using publicly available Internet data, supercomputer-like systems will estimate when a prowler might try to breach a system.

Imagine if IBM's Watson -- the "Jeopardy!" champion supercomputer -- could answer not only trivia questions and forecast the weather, but also predict data breaches days before they occur. 

That is the ambitious, long-term goal of a contest being held by the U.S. intelligence community. 

Academics and industry scientists are teaming up to build software that can analyze publicly available data and a specific organization's network activity to find patterns suggesting the likelihood of an imminent hack.

The dream of the future: A White House supercomputer spitting out forecasts on the probability that, say, China will try to intercept situation room video that day, or that Russia will eavesdrop on Secretary of State John Kerry's phone conversations with German Chancellor Angela Merkel. 

IBM has even expressed interest in the "Cyber-attack Automated Unconventional Sensor Environment," or CAUSE, project. Big Blue officials presented a basic approach at a Jan. 21 proposers' day.

Aims to Get to Root of Cyberattacks

CAUSE is the brainchild of the Office for Anticipating Surprise under the director of national intelligence. A “Broad Agency Agreement” -- competition terms and conditions -- is expected to be issued any day now, contest hopefuls say. 

Current plans call for a four-year race to develop a totally new way of detecting cyber incidents -- hours to weeks earlier than intrusion-detection systems, according to the Intelligence Advanced Research Projects Activity. 

IARPA program manager Rob Rahmer points to the hacks at Sony and health insurance provider Anthem as evidence that traditional methods of identifying "indicators" of a hacker afoot have not effectively enabled defenders to get ahead of threats.

This is "an industry that has invested heavily in analyzing the effects or the symptoms of cyberattacks instead of analyzing and mitigating the -- cause -- of cyberattacks," Rahmer, who is running CAUSE, told Nextgov in an interview. "Instead of reporting relevant events that happen today or in previous days, decision makers will benefit from knowing what is likely to happen tomorrow."

The project’s cyber-psychic bots will estimate when an intruder might attempt to break into a system or install malicious code. Forecasts also will report when a hacker might flood a network with bogus traffic that freezes operations – a so-called Denial-of-Service attack.

Such computer-driven predictions have worked for anticipating the spread of Ebola, other disease outbreaks and political uprisings. But few researchers have used such technology for cyberattack forecasts.

At Least 150 People Interested -- No Word Yet on Size of the Prize Pot

About 150 would-be participants from the private sector and academia showed up for the January informational workshop. Rahmer was tight-lipped about the size of the prize pot, which will be announced later this year. Teams will have to meet various minigoals to pass on to the next round of competition, such as picking data feeds, creating probability formulas and forecasting cyberattacks across multiple organizations. 

At the end, "What you are most likely to be able to do is say to a client, 'Given the state of the world and given the asset you’re trying to protect or that you care about, here are the [events] you might want to worry about the most,'" David Burke, an aspiring participant and research lead for machine learning at computer science research firm Galois, said in an interview. "Instead of having to pay attention to every single bulletin that comes across your desk about possible zero days," or previously unknown vulnerabilities, it would be wonderful if some machine said, "These are the highest likelihood threats."

His research focus is "advanced persistent threats," involving well-resource, well-coordinated hackers who conduct reconnaissance on a system, find a security weakness, wriggle in and invisibly traverse the network.

"Imagine that CAUSE was all about the real-world analogy of figuring out whether some local teenagers are going to knock over a 7-Eleven. That would be really hard to predict. You probably couldn’t even tie that to any larger goal. But in the case of APTs -- absolutely" you can, Burke said in an interview. "The fact that APTS are on networks for a long period of time gives you not only the sociopolitical pieces of data or clues but you have all sorts of clues on your network that you can integrate."

It's not an exact science. There will be false alarms. And the human brain must provide some support after the machines do their thing.

"The goal is not to replace human analysts but to assist in making sense of the massive amount of information available and while it would be ideal to always find the needle in a haystack, CAUSE seeks to significantly reduce the size of the haystack for an analysts," Rahmer said. 

Unclassified Program Will Trawl for Clues on Social Media

Fortunately or unfortunately, depending on one's stance on surveillance, National Security Agency intercepts will not be provided to participants. 

"Currently, CAUSE is planned to be an unclassified program," Rahmer said. "We’re going to ask performers to be creative in identifying these new signals and data sources that can be used."

Participants will be judged on their speed in identifying the future victim, the method of attack, time of future incident and location of the attacker, according to IARPA. 

Clues might be found on Twitter, Facebook and other social media, as well as online discussions, news feeds, Web searches and many other online platforms. Unconventional sources tapped could include black market storefronts that peddle malware and hacker group-behavior models. AI will do all this work, not people. Machines will try to infer motivations and intentions. Then mathematical formulas, or algorithms, will parse these streams of data to generate likely hits. 

One research thread Burke is pursuing examines the "nature of deception and counterdeception, particularly as it applies to the cyber domain," according to an abstract of his proposers' day presentation.

"Cyber adversaries rely on deceptive attack techniques, and understanding patterns of deception enables accurate predictions and proactive counterdeceptive responses," the abstract stated. 

It's anticipated that supercomputer-like systems will be needed for this kind of analysis. 

For example, "if you were able to look at every single Facebook post and you processed everything and ran it through some filter, through the conversations and the little day-to-day things people do, you could actually start to see larger patterns and you could imagine that is a ton of data," Burke said. "You would need some sort of big data technology that you’d have to bring to bear to be able to digest all that."

Still Nailing Down Specifics on Supercomputer Use

The final rules will indicate whether companies can or must use a supercomputer, and whether they can borrow federal computing assets, Rahmer said. "We definitely want innovation and creativity from the offerers," he added. 

Researchers at Battelle, a technology development organization, said they might harness fast data processing engines like Hadoop and Apache Spark. They added that the rules and their team partners will ultimately dictate the system used to amp up computing power.

"We have already recognized as both the rate of collection and the connections between data points grow we will need to move to a high-performance computing environment," Battelle’s CyberInnovations technical director Ernest Hampson said in an email. "For the CAUSE program, the data from several contractors could push us towards the need for a supercomputing infrastructure using technologies such as IBM’s Watson to support deep learning,” or, hardware such as a Cray Urika "could provide the power to fuel advanced analytics at-scale.”

According to IBM's January briefing, the apparatus currently used to solve similar prediction problems "runs on x-86 infrastructure." However, IBM's x-86 supercomputer hardware was spun off to Chinese firm Lenovo last year. It remains to be seen what machine IBM might deploy, a company spokesman said. 

"In theory, the government could say they are going to own the servers," IBM spokesman Michael B. Rowinski said. "We don't know ultimately that we would participate or what we even would propose."

Recorded Future, a six-year-old CIA-backed firm, already knows how to generate hacker behavior models by assimilating public information sources, like Internet traffic, social networks and news reports. But the company's analyses do not factor in network activity inside a targeted organization, because such data typically is confidential.

"Doing this successfully is not simply the sociopolitical analysis applied to current flashpoints," Burke said. "You also have observables on a network: signs possibly of malware or penetration because many campaigns that take place go on for weeks or months. So you also have a lot of network data that you are going to end up crunching."