How Racial Data Gets 'Cleaned' in the U.S. Census

By Robyn Autry,
The Atlantic

| November 6, 2017

The national survey offers more identity choices than ever—until those choices get scrubbed away.

At a doctor’s visit, on a college-admissions application, or even in a consumer-marketing survey, Americans are regularly asked to classify themselves by race. Some protest this request by “declining to answer,” as forms often allow. After all, racial categories are social constructs. They don’t connote biological or genetic difference.

As an African American, I have never had difficulty knowing which box I am meant to check. Whether I do so depends on my understanding of why the information is being collected. Similar questionnaires in the late 19th and early 20th centuries didn’t afford such choice. At that time, before the current practice of self-identification, an enumerator or census taker would have visited my home and classified me as free or enslaved, and then determined whether I might be colored, mulatto, quadroon (one-quarter black), or octoroon (one-eighth).

While early racial data were gathered to feed an obsession with racial purity, and were even used to locate Japanese Americans for internment during World War II, over time the Census Bureau settled on bureaucracy to explain its work. And yet, a simple count of the population remains ideologically loaded. These data are not neutral or objective information about the population. Instead they reflect changing political priorities and techniques to grasp how the country’s population is seen—and how resources are made available to them.

* * *

Shortly after the country’s founding, the U.S. government began collecting data on the racial and ethnic make-up of every person in each household. Every decennial ushers in some new language meant to enhance the accuracy and reliability of the census as a measurement of the entire national population. There’s symbolic power in being represented on the census—in being counted. But as the political scientist Melissa Nobles shows in her book Shades of Citizenship, these data also track compliance with civil-rights legislation, particularly voting districts. They are linked to federal resources, intensifying public agitation around the categories.

During the years between each census, researchers, activists, politicians, and interest groups lobby for the rewording of a label, the addition (or elimination) of a category, or the disaggregation of another, such as Asian or American Indian or Alaska Native. In 2000, for example, “Hispanic or Latino, or Spanish origins” was reclassified from racial to ethnic data. Respondents were also allowed to select multiple boxes to reflect multiracial heritage for the first time. Additional changes that affect how the racial makeup of the country is represented are underway, including the creation of a separate category for people of Middle Eastern and North African descent (referred to as MENA).

Shifts in racial classifications raise questions about what exactly is being counted, how people interpret the same questions differently, and what to do about people’s changing perceptions of their racial background. In 2015, the Pew Research Center reported that at least 9.8 million people reported a different racial or ethnic background than they did in 2000. When someone appears to “change” races, the resulting data is sometimes construed as erroneous.

The statistical accounting used to correct such errors is commonly referred to as “data cleaning” or data cleansing. This process involves identifying and then editing data already collected—through modification, enhancement, or deletion of responses—when it does not conform to some predetermined rules that standardize the data set. Ostensibly, the goal is to improve data quality by correcting measurement errors generated by people who complete the questionnaires or enter responses into the database. Data cleaning hopes to make a final data set similar to other, related ones, such as the other national censuses and the American Community Survey.

Errors in reporting and recording certainly do happen. But if racial data must be cleaned, then some data is dirty. And that dirtiness is undeniably political. Some responses are more likely to be diagnosed as dirty. Given the goal of creating information that is comparable from one national census to the next, the data most under suspect are those that correspond to the categories most in flux: people who checked more than one box, for example, or those who saw themselves as members of different racial or ethnic groups at different times.

While data cleansing can raise ethical questions about altering people’s responses, it offers a bureaucratic solution to a difficult position for the Census Bureau. The bureau is under public pressure to modify its data-collection methods, on the one hand. But, on the other, it is also expected to provide reliable data that is comparable over time and across other government agencies at the local, state, and national levels. The desire for comparability prompts some of the most intensive or imaginative cleaning.

By 2010, the two major changes from the previous censuses—the treatment of Hispanic, Latino, and Spanish ancestry as an ethnicity and the ability to check multiple racial categories—had yielded 63 possible responses for race: the original six categories (white; black or African American; American Indian or Alaska Native; Asian; Native Hawaiian or other Pacific Islander; some other race), plus an additional 57 possible combinations of these responses. Given the new information, identifying one group and distinguishing it from another became difficult. This led to the creation of new categories, established after data collection, such as “black, not Hispanic,” or “white, Hispanic.” For the most part, people who selected more than one race were recoded as “two or more races,” regardless of the combination. However, because no actual multiracial category is offered, the official racial categories are still preserved in the record. That makes them traceable later, by cleaning individuals’ responses retroactively.

In 2010, the “some other race” category proved the dirtiest. This selection included a write-in box where respondents were expected to provide the name of the race to which they felt they belonged. The vast majority of the more than 19 million people (6.2 percent of respondents) who made this selection also identified themselves as having “Hispanic, Latino, or Spanish” origins for the ethnicity question asked prior to their race. In its document 2010 Census Redistricting Data, the Bureau states that it used “automated” and “expert” coding to recode write-in responses for compliance with the master files (or predetermined rules) of the database or system. For example, the document states that someone describing themselves as “Haitian” and “Moroccan” was recoded to “black” and “white.” This “some other race” also includes people who preferred to write in responses like “multiracial” in lieu of ticking multiple boxes.

Even with a shrinking budget and new leadership, the bureau’s search for tidier data continues. When interviewed shortly after her retirement in January, the former U.S. chief statistician Katherine Wallman acknowledged that politics were most likely behind recent budget cuts. Irrespective of the latest political jockeying, the bureau has been discussing ways to cut costs without compromising data quality for years. As a result, the 2020 census will test an online response option, and use administrative records such as federal tax returns and postal-service files to estimate individual characteristics like sex and race when information is not self-reported.

While these new measures might reduce costs, civil-rights groups like the Leadership Conference on Civil and Human Rights are concerned that they will continue to undercount or otherwise misrepresent vulnerable populations and communities of color whose members are less likely to have reliable internet access. That might make them vulnerable to inaccurate identification in administrative records.

* * *

The Census Bureau didn’t respond to a request for comment or clarification about its perception of dirty data. Nevertheless, the bureau likely finds itself in a cultural minefield, as it becomes a site where debates unfold about which individuals and groups are rendered invisible, as much as how finite public resources get allocated. The ongoing dispute over whether future censuses should or will include a question about sexual orientation or gender identity belie the simplicity of the current sex question, which only asks respondents if they are male or female. With more public pressure and social change, that data might also become disaggregated one day, and then recoded into categories like “cisgender male” or “female, not transgender.”

Some people bristle at being asked to reduce the complexity of their self-perceptions into a singular choice. The “check-this-box” mentality of the census is at odds with the more fluid and ambiguous self-perceptions of the population: people originating from outside the country, for example, or those habituated to customizable digital profiles, like those on Facebook, which appear to revel in the uncertainty of multitudinous identity. If anything, these digital tools have helped accelerate citizens’ willingness to self-identify in categories broader than those provided by the government—and even to demand to be able to do so.

Even so, some of the choices haven’t changed. Since the first census in 1790, one category has remained stable, or at least been modified the least on the national census and other official government forms: “white.”

NEXT STORY: Mapping destruction after Harvey and Irma

CDM

Future-Ready Workforce

The national survey offers more identity choices than ever—until those choices get scrubbed away.