Data secrecy may cripple U.S. attempts to slow pandemic

Science‘s COVID-19 reporting is supported by the Pulitzer Center and the Heising-Simons Foundation

California was a COVID-19 success story—until suddenly it wasn’t. Early in the pandemic, the state seemed to have the new coronavirus under control, but it has begun to ride a wave there, with records set in daily cases several times this month, and deaths on the rise.

California officials whose COVID-19 responses were once hailed as enlightened are now receiving criticism—and some of the sharpest is coming from scientists seeking to help guide the state’s fight against the virus. Since April, epidemiologists from Stanford University and several University of California (UC) campuses have sought detailed COVID-19 case and contact-tracing data from state and county health authorities for research they hope will point to more effective approaches to slowing the pandemic. “It’s a basic mantra of epidemiology and public health: Follow the data” to learn where and how the disease spreads, says Rajiv Bhatia, a physician and epidemiologist who teaches at Stanford and is among those seeking the California data.

But the agencies have refused requests filed from April through late June, Science has learned. They cited multiple reasons including workload constraints and privacy concerns—even though records can be deidentified, and federal health privacy rules have been relaxed for research during the pandemic. As a result, Bhatia says, “In 4 months of the epidemic, collecting millions of records, no one in California or at the CDC [U.S. Centers for Disease Control and Prevention] has done the basic epidemiology.” Other states also fail to share highly specific information for their COVID-19 cases, which some scientists warn is hampering efforts to identify targeted measures that could stem the spread of SARS-CoV-2 without full-scale lockdowns.

As COVID-19 cases surged in Los Angeles, the Dodger Stadium parking lot became a massive testing site.

PHOTO: BRIAN VAN DER BRUG/LOS ANGELES TIMES/GETTY IMAGES

Bhatia and epidemiologists across the country are especially aggrieved after recent news reports revealed states are feeding the same data they desire to a federal contractor, Palantir Technologies, that has drawn criticism for data work supporting Immigration and Customs Enforcement deportations. For a data platform dubbed HHS Protect, Palantir is aggregating information on the spread of the new coronavirus on behalf of the U.S. Department of Health and Human Services (HHS), drawing on more than 225 data sets, including demographic statistics, community-based tests, and a wide range of state-provided data. (This week, sparking concern among public health experts, epidemiologists, and others, HHS also instructed hospitals to provide data on COVID-19 cases and patient information directly to the Palantir system—largely via a second contractor—rather than to CDC as they have for decades.)

Aggregated COVID-19 case and death data by county, and often by age and race, are publicly available in much of the country. But few locales link those cases and deaths to other information typically collected on the individuals, such as ZIP codes, occupations, living conditions, and known contacts with others ill with COVID-19. A survey of public data dashboards for all 50 states, Washington, D.C., and Puerto Rico by Prevent Epidemics, a group led by former CDC Director Tom Frieden, found that just 2% of data for 15 key COVID-19 indicators were fully reported. Only 40% of the data were partially reported—with glaring deficiencies for testing and contact tracing.

Bhatia and colleagues say that detailed COVID-19 case data could be mined to find factors most responsible for the “biggest bundles of hospitalizations and deaths.” He hypothesizes the data would, for example, confirm that even as commerce opens up, hospitalizations and deaths mostly emerge from familiar flashpoints. He cites care facilities for the elderly and large households that include infected essential workers who are asymptomatic or have mild symptoms; they may then pass the disease to relatives who have risk factors making them more vulnerable to severe illness.

“We think you can be more strategic on your interventions if you know where exposures actually occur,” says Jeffrey Klausner, a physician and epidemiologist at UC Los Angeles, who is also seeking his state’s data. For example, case data might confirm patchy evidence that indoor dining is risky, but parks and beaches are generally safe. If so, reopening outdoor settings with reasonable precautions might boost the economy and allay fears that severe risk of infection is ubiquitous.

As the pandemic evolves, regular reassessment of granular data on cases is vital, says Natalie Dean, a University of Florida (UF) biostatistician. “We have this whole new world now, where we are opening things back up. We have this shifting set of environments—indoor dining, bars, open retail buildings, offices, gyms. When we think of what are pressure points, there’s a lot we just don’t know yet. … We have to have ‘a learning architecture’ in place where there’s always some level of reflection.”

In the absence of clear, localized data from public authorities, some clinics in California have done their own research. After conducting thousands of COVID-19 tests in Oakland, “We have been able to pinpoint where some of the outbreaks are, both geographically and in terms of setting,” leading to highly targeted health education and testing outreach, says Noha Aboelata, a physician who heads the city’s Roots Community Health, which primarily serves people of color in underserved communities. Without neighborhood-level intelligence for public health outreach, you get “a one-size-fits-all solution that might exacerbate the problem,” she says. “Withholding the information is going to lead to deaths.”

In response to Science‘s questions, the California Department of Public Health wrote that even deidentified data “can be used alone or in combination with publicly available information to identify an individual.” Caitlin Rivers, an epidemiologist at Johns Hopkins University’s Center for Health Security, calls reidentification a valid concern, but argues it would happen so rarely that the risk shouldn’t justify blanket denials of data requests during the pandemic. “There’s a lot of space in the middle that we haven’t really explored,” she adds. For example, to obviate some privacy concerns, Bhatia’s group requested case reports giving 10-year age ranges rather than specific ages, the week of COVID-19 onset rather than a specific date, and an occupational group rather than specific occupation.

To show the value of richer data, Bhatia turned to Florida, which offers fairly detailed information on each of the more than 316,000 COVID-19 cases recorded there so far. The data set enabled him to graph, week by week, infections by age and whether the source of transmission was known. He found that early in the pandemic, the source was known for 80% of children, and 50% to 60% of adults. As Florida relaxed restrictions on businesses and other aspects of life, known sources of transmission remained at similar levels, even though casual contact with strangers was apparently increasing. Because some of the unknown sources of transmission were certainly asymptomatic or mildly symptomatic family or friends, such a finding suggests crowded beaches are playing a smaller role in Florida’s surge in infections than, say, increased numbers of large family gatherings at home or repopulated offices. “If people know that 50% or 60% of infections are resulting from people they know, including family, friends, and co-workers, they may better interpret risk,” Bhatia says.

Even Florida’s data exclude key details that some researchers view as essential to map and respond to the pandemic most effectively—including ZIP codes; more complete racial designations; and specifics on cases in long-term care facilities, jails, and prisons. That hampers targeted responses, says Thomas Hladish, an infectious disease researcher at UF who consulted extensively with state officials about COVID-19 data from March until this month. “A lot of the inconsistencies that you see are reasonably explained by well-intentioned people who are scrambling to reinvent [data fields and formats] on the fly without the appropriate technical background.” The Miami Herald also recently reported that municipal officials have not been able to get the state to provide case details they need to attack local outbreaks. The Florida Department of Health did not respond to Science‘s requests for comment.

Epidemiologists praise more forthcoming agencies. The New York City Department of Health and Mental Hygiene posts unusually complete, continually updated data sets on COVID-19—showing detailed information on tests, cases, and deaths for 177 discrete neighborhoods—and uses them to map hot spots. It offers probable and confirmed deaths by age, race or ethnicity, underlying conditions, and other factors. One clear finding: Lower income areas, with a higher concentration of large households, suffered from COVID-19 at many times the rate of most wealthy areas.

The city’s health commissioner, Oxiris Barbot, says the system was crucial in decreasing cases by about 94% and deaths by about 98% since they peaked in April. “The transparency in data helped to paint a picture of how acute a situation we were in and the degree to which we needed New Yorkers to comply with what we were asking them to do,” she says. “It helped highlight as early as possible the ways in which the virus was ravaging Black and brown communities.” And the granular data allowed a calibrated response—including offers of hotel rooms to help people living in crowded conditions isolate when diagnosed with COVID-19. “Had it not been for that data analysis we would have been much slower in the response, and … many more lives would have been lost,” Barbot says.

“These are the right type of efforts using the right type of data,” Bhatia says. Figuring out how to stop the pandemic is “the biggest and most impactful policy decision we’ve seen in our lifetimes,” he adds. But in California and elsewhere, “We’re trying to predict the future without analyzing the data that’s in front of us. That’s a failure.”

This story was supported by the Science Fund for Investigative Reporting.

Data secrecy may cripple U.S. attempts to slow pandemic

Roots Headquarters

Roots Community Health Locations

SIGN UP FOR THE SCIENCE eTOC

Roots Headquarters

Roots Community Health Locations