This Speck of DNA Contains a Movie, a Computer Virus, and an Amazon Gift Card

Creations/Shutterstock.com

By Ed Yong,
Staff Writer, The Atlantic,
The Atlantic

| March 3, 2017

Meet the storage format that never goes obsolete.

In 1895, the Lumiere Brothers—among the first filmmakers in history released a movie called The Arrival of a Train at La Ciotat Station. Just 50 seconds long, it consists of a silent, unbroken, monochrome shot of a train pulling into a platform full of people. It was a vivid example of the power of “animated photographs”, as one viewer described them. Now, 122 years later, The Arrival of a Train is breaking new ground again. It has just become one of the first movies to be stored in DNA.

In the famous double-helices of life’s fundamental molecule, Yaniv Erlich and Dina Zielinski from the New York Genome Center and Columbia University encoded the movie, along with a computer operating system, a photo, a scientific paper, a computer virus, and an Amazon gift card.

They used a new strategy, based on the codes that allow movies to stream reliably across the Internet. In this way, they managed to pack the digital files into record-breakingly small amounts of DNA. A one terabyte hard drive currently weighs around 150 grams. Using their methods, Erlich and Zielinski can fit 215,000 times as much data in a single gram of DNA. You could fit all the data in the world in the back of a car.

Storing information in DNA isn’t new: life has been doing it for as long as life has existed. The molecule looks like a twisting ladder, whose rungs are made from four building blocks, denoted by the letters A, C, G, and T. The sequence of these letters encodes the instructions for building every living thing. And if you can convert the ones and zeroes of digital data into those four letters, you can use DNA to encode pretty much anything.

Why bother? Because DNA has advantages that other storage media do not. It takes up much less space. It is very durable, as long as it is kept cold, dry, and dark—DNA from mammoths that died thousands of years ago can still be extracted and sequenced. And perhaps most importantly, it has a 3.7-billion-year track record. Floppy disks, VHS, zip disks, laser disks, cassette tapes… every media format eventually becomes obsolete, and every new format forces people to buy new reading devices and update their archives. But DNA will never become obsolete. It has such central importance that biologists will always want to study it. Sequencers will continue to improve, but there will always be sequencers.

George Church from Harvard University made a foray into DNA storage in 2011, encoding his newly published book, some images, and a Javascript program. A year later, Nick Goldman and Ewan Birney from the European Bioinformatics Institute improved on his efforts, with a more complex cipher. They encoded all of Shakespeare’s sonnets, a clip of Martin Luther King’s “I have a dream” speech, a PDF of the paper from James Watson and Francis Crick that detailed the structure of DNA, and a photo of their institute, into a speck of DNA so small that when it arrived in their lab, Goldman didn’t see it. He though he was staring at an empty tube.

The big catch with DNA is that we can only create and sequence it as small stretches, a few hundred letters long. So if you want to encode a large piece of data, you need to break it down, and synthesize it as a messy soup of DNA fragments. It’s hard to ensure that all of these are evenly represented, so there’s a risk of losing bits of data.

Goldman and Birney coped with this by creating an overlapping code, so that each bit of data was represented by at least four fragments of DNA. If they lost one, the same information would still exist in three other places. It was a good strategy but also a slightly inefficient one. And it wasn’t perfect: the team still encountered a few errors when they tried to recover their files. “I thought we could do something more efficient and robust,” says Erlich.

Coincidentally, online streaming services like Netflix and Spotify face a similar problem. They send information across choppy channels, and they also need to recover that data perfectly, regardless of missing fragments. They solve the problem using fountain codes—a style of coding that partitions data into small packets (or “droplets”) in such a way that you can recover the whole thing even if you only snag a random subset. As long as you can catch enough droplets, regardless of which ones you miss, you can reconstruct the entire stream. Erlich compares it to doing a giant Sudoku puzzle: If some of the squares filled in, you can deduce what the others are.

Using fountain codes, the duo developed a cipher that’s 60 percent more efficient than previous ones, and comes close to the limit of how densely information can be packed into DNA. “We get very close to an optimal configuration,” Erlich says.

They used this system, which they call DNA Fountain, to encode: the train movie; KolibriOS, the smallest computer operating system around; the image that was sent on the Pioneer 10 and 11 probes; a scientific paper that describes how much information can fit into a given medium; a virus called Zipbomb that fills your hard drive with junk (“We thought it would be fun,” says Erlich); and a $50 Amazon gift card. (The latter has already been deciphered and spent, by one of Erlich’s Twitter followers.)

They ended up with a library of 72,000 DNA fragments, which they then sequenced, decoded, and reassembled. In the process, they lost more than 2,000 of the fragments, but they still managed to recreate the files perfectly.

DNA storage has another weakness. The act of sequencing the strands also destroys them, so this is a storage medium that gradually disappears the more it is read. “My daughter loves Frozen,” says Erlich. “If we had encoded that damn Let It Go song, we would run out of DNA within a week.” Fortunately, DNA, by its nature, is also very easy to copy, so it is trivial to double up a cache of DNA-encoded data. Every time you do that, you risk introducing errors: copies of copies are rarely identical to the originals. But DNA Fountain is so resistant to errors that even when Zielinski copied the data cache ten times over, she could still recover the files perfectly.

“This work is great,” says Birney, and proves that DNA storage “is a really robust idea.” That being said, he and Goldman are working on their own updated coding scheme, which they hope to test and release in the near future. Microsoft is also getting in on the action. Last July, Microsoft researcher Karin Strauss and computer scientist Luis Henrique Ceze from the University of Washington stored a record 200 megabytes of data in DNA. “We are convinced of the density benefits of DNA as a storage medium and are working on improving the capacity and system design to make it practical for storage,” they say.

For DNA storage to become mainstream, it will have to be much cheaper. It is still expensive to sequence DNA, and really expensive to actually synthesize it. However, both costs are falling. When Birney and Goldman published their study in 2012, it cost $12,400 to encode a megabyte of data. Now, it costs just $3,500. But even if those costs fall further, synthesizing DNA is still a niche activity, done by a small number of facilities that support research labs. There’s currently not enough capacity around the world to encode a petabyte of data.

But Erlich predicts this will change as he and others prove that DNA is the format of the future. “The first hard drive needed four people to carry it,” he says. “After decades of extensive research and development, we have thumb drives. That’s a small fraction of the money that’s gone into DNA synthesis so far. My hope is that by focusing on better approaches, we can realize the potential of DNA storage.”

NEXT STORY: Software Engineers Figured Out How to Turn Charts Into Music for the Blind

CDM

Future-Ready Workforce

Meet the storage format that never goes obsolete.