Data patterns alone can be enough to give away what video you’re watching on YouTube.
Earlier this month, a lobby group for major internet providers like Comcast and Verizon attacked a set of online-privacy regulations they believe are too strict. In a filing to the Federal Communication Commission, the group argued providers should be able to sell customers’ internet history without the customers’ permission, because that information shouldn’t be considered sensitive. Besides, the group contended, web traffic is increasingly encrypted anyway, making it invisible to providers.
It’s certainly true encryption is on the rise online. Data from Mozilla, the company behind the popular Firefox browser, shows more than half of web pages use HTTPS, the standard way of encrypting web traffic. When sites like The Atlantic use HTTPS, a lock icon appears in users’ web browsers, indicating the information being sent to and from servers is scrambled and can’t be read by a third party that intercepts it—that includes ISPs.
But even if 100 percent of the web were encrypted, ISPs would still be able to extract a surprising amount of detailed information about their customers’ virtual comings and goings. This is particularly significant in light of a bill that passed Congress this week, which granted the lobby group’s wish: It allows ISPs to sell their customers’ private browsing history without their consent.
Although the exact URL of a page accessed through HTTPS is hidden to the provider, the provider can still see the domain the URL is on: For example, your ISP can’t tell what exactly story you’re reading right now, but it can tell you’re somewhere on theatlantic.com. That may not reveal much other than your (excellent) taste in news sources—but a user who visited a page on plannedparenthood.com and then a page on dcabortionfund.com may have revealed much more sensitive information.
That’s an example from a 2016 report prepared by Upturn, a think tank that focuses on civil rights and technology. The Upturn report also sets out some of the sneaky ways user activity can be decoded based only on the unencrypted metadata that accompanies encrypted web traffic—also known as “side channel” information. (These methods probably aren’t widely in use right now, but they could be deployed if ISPs decided it’s worthwhile to try and learn more about encrypted traffic.)
Website fingerprinting, for example, relies on the unique characteristics of a particular web page to reveal when it’s being accessed. When a user visits a page, his or her browser pulls data from various servers in a particular order. Based on that pattern, a network provider might be able to tell what page the user is visiting, even without having access to any of the actual data streams it’s transporting. (For this to work, the network operator would have to have already analyzed the loading pattern associated with the particular website the user is visiting.)
In November, a group of researchers from Israel’s Ben-Gurion and Ariel Universities demonstrated a way to extend the idea behind website fingerprinting to videos watched on YouTube. By matching the encrypted data patterns created by a user viewing a particular video to an index they’d created previously, they could tell what video the user was watching from within a limited set, with a startling 98 percent accuracy.
Ran Dubin, a Ph.D. candidate at Ben Gurion and the research paper’s primary author, told me the discovery came out of work he’d been doing to optimize video streaming. He wanted to know if he could figure out the quality at which users were watching YouTube videos, so he analyzed the way devices received data as they streamed.
He quickly realized he’d stumbled into something bigger.
“The network patterns that belong to each video title have very, very strong meaning,” Dubin said. “I found out that I could actually recognize each stream.”
The giveaway, he found, was embedded in the way devices choose a bitrate—an indicator of video quality—at which to stream the video. At the beginning of a stream, the player receives quick spurts of data, which begin to space apart after the video has been playing for a while and the player has settled on a bitrate. The pattern of these spikes helps identify each individual video.
The researchers assembled fingerprints from 100 YouTube videos by using a browser crawler to automatically download each video under various network conditions, then cataloguing the resulting data pattern. Next, they analyzed the traffic patterns created by a device as it played one of 2,000 videos—including the 100 target videos. Using an algorithm to match the stream to the nearest fingerprint, the researchers could tell when one of the target videos was being watched. Not once was a video outside the set of 100 accidentally identified as inside the target set.
The technique could be used by law enforcement to identify users who are watching ISIS propaganda videos, Dubin said. It could also be used to compile data on users’ viewing habits and sell it to advertisers—and that’s where the privacy rules that just passed Congress come in.
If President Donald Trump signs the bill, ISPs will have free rein to sell data they gather on their customers without asking for consent. As online encryption spreads further and further across the internet, there will be monetary incentives to dig up as much information on users as possible, to offset the loss of access to more detailed unencrypted data. Tricks like Dubin’s, which might have otherwise been too costly and inconvenient to put in place, could become an attractive way to glean valuable information about user habits and turn them over to advertisers for big money.