Facebook's New 'AI Camera' Team Wants to Add a Layer to the World

Michal Ludwiczak/Shutterstock.com

The most important technological advances of the past decade are converging inside the battle for your phone’s camera.

Take a video of a birthday cake’s candles sparkling in an Instagram story, then tap the sticker button. Near the top of the list you’ll see a slice of birthday cake.

It’s a little thing. This simple trick is not breathtaking nor magical. But it is the beginning of something transformative. Smartphones already changed how most people take pictures. The latest Silicon Valley quest is to reimagine what a camera is, applying the recent progress in artificial intelligence to allow your phone to read the physical world as easily as Google read the web.

With 2 billion users, Facebook has reorganized the teams responsible for coding the camera software in Instagram, Facebook, and Messenger into a new unit it calls “AI Camera.” The group started last year with a single person. Now, it has grown to 60 people. That includes Rick Szeliski and Michael Cohen who worked on Photosynth (among other things) at Microsoft. The AI Camera team also can draw on the expertise of top neural-network researchers like Yann LeCun and Yangqing Jia in other parts of the company.

The AI Camera team is responsible for giving the cameras inside these apps an understanding of what you’re pointing them at. In the near future, your camera will understand its location, recognize the people in the frame, and be able to seamlessly augment the reality you see.

Right now, the team’s work has shipped in small ways, like the birthday sticker trick. But that is just the beginning of a development program that wants to transform the way you use the camera on your phone.

AI Camera combines many of the most important technological developments of the last several decades: neural networks, robotics, camera systems, and social-network data. This underlying basket of technologies—more adjacent to each other than in “a stack,” as software developers might conceive it—are converging into the smartphone's ability to take and display pictures.

Perhaps that seems absurd. But the human desire to capture, understand, and share images of the physical world has proven to be nearly insatiable, which is why this is the one domain where Facebook, Apple, Google, Samsung, Snapchat, and Microsoft directly compete.

Facebook’s work mirrors what’s happening at the other tech giants. Snap calls itself a camera company, and its realization of “lenses” are the best embodiment of augmented reality outside of Pokémon Go. At Google’s developer conference in May, Sundar Pichai showed off Google Lens, software that can detect what a camera is seeing and do something with that information, from entering a password to identifying a flower.

Prodded by Snap, the tech giants have begun to piece together what can be accomplished with the whole imaging and display system that a smartphone is. Every millisecond a phone’s camera is engaged is a moment when data can be captured, processed, understood, and looped back to the user for viewing.

“We’re basically looking at what pieces of technology we need to build amazing augmented-reality products,” said John Barnett, product manager on the AI Camera team.

Imagine, he said, a persistent, shareable social layer on the physical world, a spatialized Facebook that’s escaped the feed.

“Everyone got so excited about Pokémon Go when it was just one thing. What if there are 1,000 things like that?” Barnett asked. “All these layers of information that are spatially situated and relevant to what you care about.”

This is a radically different notion from the Facebook we’ve come to know, which, even though it made the leap from the desktop to “mobile,” rarely engages with the physical space where your hand clutches the phone.

“In the existing Facebook structure, we’re giving you everything that’s happening right now in the world, collapsing space to give you a a slice of time,” said Barnett. “This is talking about collapsing time to give you this piece of space.”

Facebook would take on two modes: The News Feed, in the company’s terms, would show you what you care about now, and the spatial Facebook would tell you what’s happening here. One could read from, and write to, the world. Your world, at least.

* * *

On one of the many decks at Facebook’s Menlo Park campus, overlooking the mudflats of the south Bay, there’s a nondescript corner. Pipes run along it. A surveillance camera sits on the east-facing wall. To the naked eye, there’s nothing to distinguish it from the hundreds of others that help form Facebook’s gargantuan ark.

However, pull out a phone loaded with an app Facebook has in development and point it at the wall, and you get a beautiful piece of art, created primarily by Heather Day, a San Francisco artist. It made a brief appearance during Mark Zuckerberg’s keynote at the company’s F8 conference.

The Worlds First Augmented Reality Art by Heather Day for Facebook Camera from Heather Day on Vimeo.

Brilliant blues, cyans, teals drip from the pipes, pooling away from the wall. It’s cool, this thing hanging in the air.

Put the app away, open it back up and point it at the corner again, and the art is back. Move around it, move through it, and the ghostly remains of Day’s paint strokes and pours remain there. What if there were thousands of things like this all over the world? Next to burrito recommendations and Strava segment records and pictures of your friends, mugging for the camera, in situ.

This is one vision for augmented reality, the name for this layering of digital information on top of imagery of the physical world. AR has gotten a big push in recent months by Apple’s announcement of ARKit, a framework for developers to enable AR in apps. They’ve been showing it off, and Google recently announced a similar (though not as widely lauded) set of tools called ARCore.

No matter what, AR is a ridiculously complicated task for a smartphone. Alvaro Collet is a computer-vision Ph.D. from Carnegie Mellon University who came to AI Camera from Microsoft. He’s standing next to me looking at the wall. “This is actually a pretty challenging scene because it is very plain,” Collet tells me.

The basic task mirrors what robots have had to do for decades. Researchers call it SLAM (simultaneous localization and mapping).

The theory and practice of SLAM were developed over the past 30 years by robotics researchers like SRI’s Randall Smith and Peter Cheeseman, the University of Sydney’s Hugh Durrant-Whyte, Sebastian Thrun, and Carnegie Mellon’s Martial Hebert, who was Collet’s advisor. Most of these people were working on real robots, largely autonomous vehicles loaded up with all kinds of sensors. But as smartphones began to roll out, researchers realized that their systems might be able to reach hundreds of millions of people, not a few dozen.

The problem of SLAM is that you need to build a map of the world in which to place the robot (or phone), but the position of the robot (or phone) and the world are both uncertain.

“If you had all the features of the world already in 3-D, it would be very easy to place the position of the camera. And, conversely, if you had all the camera positions, it’s very easy to create the 3-D map of the world,” Collet said. “The problem with SLAM is that when you start, you don’t have a 3-D map and you don’t know where the camera is. That’s the simultaneous part.”

There are many ways to go about the problem that are encoded in different algorithms. Each has tradeoffs. Some provide excellent precision, but are computationally expensive. Others might consider the images from a sensor less extensively, but work quickly and without much computing.

Facebook finds itself building across both iOS and Android, which introduces many challenges. Facebook’s advantage, though, is its tremendous scale, 2 billion users and counting. But to use that scale, Facebook must make AR work on all kinds of crappy phones, not just Pixel 2s and Samsung Galaxy Note 8s and iPhone Xs. And that means that they actually deploy multiple algorithms to do SLAM. For lower-end phones, they do rougher, faster calculations. Higher-end phones get better performance because they can handle the processing.

Down on the low end, the AI Camera team must try to account for a bunch of mostly invisible hardware problems. Inside the phone, there is a camera, but there’s also an inertial-measurement unit, which they can use to tell how the phone is moving. The IMU contains gyroscopes and accelerometers. And all of these components, on low-end devices, have to be calibrated. Their clocks must be synchronized. And each device, because of lower manufacturing quality, might show more variance than one iPhone to the next.

Once all that electronic work has been done, and the phone knows roughly where it is, and the geometry of the scene, the next layer of technology gets piled on top: deep neural networks. The “neural” part means that this kind of software is “trained,” not programmed with traditional rules. After being shown large amounts of labeled data, the neural network can label new data based on what it has seen. The deep part refers to the neural network’s number of layers, which correspond to the complexity of features in a dataset.

Over the past five-ish years, this type of machine-learning system has transformed the way image recognition, among other things, is done. If you’ve ever used Google Photos to find pictures of business cards or mountains or people, you’ve made use of a deep neural network’s power.

Imagine, though, the next step: Instead of merely recognizing objects in artifacts, the phone can recognize objects live within the model of a scene that the device has already built. That’s only become possible in the last year.

“For the first time, you can run SLAM and deep networks on a cellphone,” Collet said. “We have two big teams: SLAM geometric teams and the other is deep nets. And the goal is these two things are going to combine.”

That’s the only way you get to augmented reality of the kind that Facebook imagines. Then they’d just have to get people to populate all the layers of spatial information.

“One thing we really are about is giving everyday users—maybe a year from now, maybe two years from now—the ability to recreate that Heather Day scene with just tools you have on your phone,” Collet said.

Anyone with a Facebook account could create media and fix it to a spot in the world. There will be food recommendations and wedding photos and paintings dripping in the air. A globe of ghostly art and burrito spots.

***

But we do know one thing from the history of every social platform: People will make their own uses of the tools. They’ll find new, unforeseen uses and abuses. There will be unintended consequences to spatializing Facebook.

Some of these could be predictable. There is already spatial information out there, just not displayed in the way AI Camera imagines or running through Facebook. Yelp, for example, has struggled with troll reviewers. (Here’s a nice list of some common variants.) Restaurants have been struggling to deal with the digital signs that lovers and haters affix to their doors for a decade now.

Another cautionary example comes from Pokémon Go. Omari Akil wrote a post describing his experience playing the game as a black man. He spent more time worrying about whether other people would find him suspicious—and bring him into contact with police—than he did actually engaging with the app. “When my brain started combining the complexity of being black in America with the real-world proposal of wandering and exploration that is designed into the game play of Pokémon Go, there was only one conclusion,” he wrote. “I might die if I keep playing.”

The realities of race and gender in America, which already play out in ugly ways across the internet, will be amplified by the physicality of augmented reality. Not everyone will be able to access the same spaces with the same ease.

In 2016, Waze rolled out a high-crime alert in Brazil to let people navigate around “bad neighborhoods.” Microsoft ran into trouble for a similar 2012 patent that got termed the “avoid ghetto” feature.

Even in more benign examples, the imperfect fit of spatial information on top of space can cause problems. Near my home in North Oakland, navigation apps lead many people to make a dangerous left just past an overpass and before a much larger intersection that the urban plan and human driving intuition both discourage. Nearly every time I drive past the intersection, that left is causing problems in the traffic flow up Claremont Avenue.

It’s not that Facebook can or should be expected to fix all of trolling or American antiblackness or the complexities of layering digital on physical. But as they develop this world, they can build with these problems in mind.

There’s even an analogy within the AI Camera project. Collet, the computer-vision specialist, was describing all the work that they have to do to make their systems work with the weird and wild world of phones across the globe. The calibrations, the algorithms, the fault tolerance of the systems.

“If you don’t think about them from the start, it’s very hard, once you have a system to say, ‘Oh, maybe we should tolerate this better,’” he said.

And as it goes with the reality of physical components, so it should be with the reality of the ethical and behavioral aspects of augmented reality. It’s gonna be more work to consider the misuses and biases in the system, but considering those things now will make the system more robust later.

If the AI Camera team succeeds, they will open up a new and basically infinite space on top of the land. The open question is what that will do the places under this new digital layer.