Investigations

Mass. resident data may train the next generation of AI

The state has little to say about its “data commons” project, which it hopes will make Massachusetts a hub for tech development.

Jonathan Gerhardson

Published

4 hours ago

The sun sets behind the cooling towers of The Massachusetts Green High Performance Computing Center in Holyoke.

The Massachusetts Green High Performance Computing Center in Holyoke. (Photo: Jonathan Gerhardson).

Last year, the Massachusetts Technology Collaborative announced a big request for proposals. The quasi-public agency, which is tasked with supporting businesses in the state’s tech sector, was looking for consultants to help establish a “data commons collaborative.”

The data commons is part of MassTech’s AI Hub — an initiative that Gov. Maura Healey’s administration has pushed forward “with the goal of cementing the state as a global leader in the development, deployment and governance of artificial intelligence.” The Data Commons Collaborative is the AI Hub’s flagship initiative, meant to provide valuable data to “fuel AI innovation” in the state.

The promise of a data commons has genuine upsides, according to experts. Public stewardship and democratized access to data — made accessible in a standardized, secure way — could hypothetically preserve privacy for individuals while simultaneously leveling the playing field between startups and academic researchers who are competing against Big Tech. That could spark new innovations and foster economic development across seven priority sectors identified by the government: life sciences, health care, robotics, financial services, advanced manufacturing, climate tech, and education.

However, some technology experts have warned that if the state doesn’t set up its data commons in the right way — taking adequate steps to protect data privacy, for example — it could harm people across Massachusetts and beyond.

One AI researcher turned political candidate sees some clear benefits to researchers being able to share data more freely. Jason Poulos, who is currently running for a U.S. House seat in Massachusetts’ 4th Congressional District, did a machine learning postdoctoral fellowship at Brigham and Women’s Hospital. He told The Shoestring that his work involved training AI models using fully identified patient electronic health records to improve care for diabetes patients. Using real data like that is allowed, but researchers couldn’t share it outside the hospital if, for example, they published a paper and wanted other researchers to be able to reproduce their work — a key part of the scientific method.

The data hub could change that, using a variety of not-yet-disclosed techniques for protecting the identities of individuals appearing in datasets.

Which techniques the state ultimately decides to use matters quite a bit. However, MassTech and Healey’s office have thus far declined to answer The Shoestring’s questions about how they plan to balance individual privacy concerns with allowing wide access to useful information.

The Shoestring’s investigation of submitted proposals found the tech companies vying for MassTech’s business suggesting a variety of data-privacy methods, including “differential privacy” and the creation of “synthetic” datasets. Those are two different approaches that both alter the original data with the hope of preventing re-identification.

Several experts told The Shoestring that those approaches, if done right, can protect privacy. Gary Howarth, a National Institute of Science and Technology researcher who co-authored the agency’s guidelines for differential privacy, said that approach in particular is the only “mathematical definition of privacy that leads to a framework for protecting privacy.”

“Our intuition on how to protect privacy doesn’t actually protect privacy very well,” said Howarth.

Instead of simply removing specific identifying pieces of data, like name or date of birth, Howarth explained, a differentially private dataset makes the promise of plausible deniability that any specific person is part of the data, by scrambling it so that broad patterns are still intact.

“For instance,” reads the proposal submitted by one vendor, 22nd Century, “if a public health researcher at Massachusetts General Hospital seeks to study trends in diabetes prevalence across the state. Due to HIPAA restrictions, the original patient-level dataset cannot be broadly shared. Instead, the researcher requests access to a synthetic version of the dataset through the Commons. Using the Synthetic Data Environment, the system generates a synthetic dataset that mirrors the statistical properties of the original while stripping out any identifiable attributes. The researcher is then able to build predictive AI models and analyze population-level patterns without accessing sensitive patient information. “

However, the harms can be significant if privacy protections aren’t sufficient, experts warn.

“Suppose an insurance company is using data from a data commons to identify people it should avoid, or give a higher insurance rate to. And I apply for insurance, and I find out that I’m being given this crazy high rate,” Latanya Sweeney, the renowned Harvard University computer scientist whose work focuses on data privacy and public-interest technology design, told The Shoestring. “And I have a feeling it’s because of information in the data commons. I don’t even have a way to say, ‘Did you give that insurance company my data?’”

But leaning into privacy too much is not an obvious win either.

There is also the possibility of harm from dialing up privacy protections too high, and in doing so missing important outlier cases that could, for example, reveal a public health threat or discrimination, Sweeney said. She gave the example of a cluster of rare childhood cancer cases in Illinois in the late 1980s connected to the children’s exposure to coal tar dust. That kind of small cluster — a few children in a small town — would disappear in a data set too focused on privacy, she said.

Privacy-focused datasets are also at odds with AI labs. UMass Amherst researcher Eugene Bagdasarian has found, for example, that data that has been differentially privatized “has a disparate impact on model accuracy” when used as training data.

But how the state of Massachusetts is dealing with that privacy-utility tradeoff when creating its data commons is currently unknown. Healey’s administration declined to answer most of The Shoestring’s questions about its approach.

The status of the project itself is at this time unclear, and MassTech didn’t answer questions regarding the project except to say that it is currently undergoing review.

“Additional review of the program remains underway,” a spokesperson for Healey’s office said.

***

Through a public records request, The Shoestring was able to obtain documents that vendors submitted last October in response to the state’s request for proposals. (Those documents are available here.)

About a third of all vendors highlighted some level of prior work with defense or intelligence agencies in their proposals. Ernst & Young, for example, boasted that its work was “instrumental” in the reauthorization of the controversial Section 702 of the Foreign Intelligence Surveillance Act, a legal provision that allows U.S. intelligence agencies to intercept the digital communications of foreign targets abroad, including those between foreigners and Americans, without a court warrant.

Despite no contract being awarded yet, publicly available payroll data shows that MassTech hired a temporary director for the Data Commons Collaborative. The job listing for that position says that the project “will be developed in partnership with the Artificial Intelligence Compute Resource … that is being established at the Massachusetts Green High Performance Computing Center” in Holyoke — one of two data centers already located in the city. Payroll data also shows a deputy to the director was hired around the same time.

MassTech spokesperson Emily Gowdey-Backus said that both positions are temporary AI Hub staff supporting multiple projects including “the AI Models Challenge and Artificial Intelligence Compute Resources partnership with the Massachusetts Green High Performance Computing Center as well as the Data Commons Collaborative.”

The state has also already purchased the hardware the Data Commons Collaborative will run on: a $31 million cluster powered by 248 NVIDIA B200 GPUs and 152 NVIDIA RTX Pro GPUs, to be housed at Holyoke, according to a press release from Northeastern University. The Massachusetts Green High Performance Computing Center told The Shoestring that early testing shows the cluster has a peak power draw of around 0.5 megawatts, or 500,000 watts. (In comparison, a large window mounted home air conditioner consumes about 1,000 watts.)

This expansion is entirely unrelated to a recently announced 20-megawatt data center several blocks away at 100 Water St. in Holyoke — a project that generated significant controversy in the city, in which the City Council learned about the proposal on the same day that councilors were debating a potential ban on new data centers. On June 16, the City Council voted 9-4 to amend the city’s zoning ordinances in a way that makes data centers “not allowable in any zoning district.” The Massachusetts Green High Performance Computing Center is grandfathered in.

Despite apparent progress in getting the data commons up and running, Gowdey-Backus told The Shoestring that “the MA AI Hub is not building the Data Commons Collaborative ‘in house’ and additional review of the program remains underway.”

By 2030, the state expects to spend a total of $120 million across three phases towards expanding its AI compute capabilities. As part of this expansion, the new GPU cluster will implement security standards that are compatible with the handling of federal “controlled unclassified information.”

***

In May 1996, at a commencement speech at Bentley College, former Massachusetts Gov. William Weld collapsed and was hospitalized due to a bout of influenza. Captured on video, the event became a media spectacle, with some referring to repeated televised airings of the fall as a politically motivated “collapse-a-thon.”

Around that same time, Latanya Sweeney, the Harvard technologist, was a graduate student studying computer science at the Massachusetts Institute of Technology. Sweeney said that for those working with computers, the future seemed bright.

“You knew the world was about to go through a huge change, that computers were coming in a way that were going to really transform our society,” she said. “Many of us just believed it was going to lead us to a utopia, a better society, because technology was cheap, it was fair, it would make us a better democracy, and so forth.”

Sweeney was surprised then, when she met a researcher from Brandeis University named Beverly Woodward.

“She’s telling me how computers are evil,” Sweeney said. “And I’m like, ‘What are you talking about? You clearly don’t understand the beautiful world that’s ahead.’”

It was Woodward who first brought to Sweeney’s attention the fact that the state Group Insurance Commission sold the medical records of 135,000 state employees, retirees, and their families — stripped only of names and Social Security numbers — to a private firm and distributed them to several academic researchers.

Sweeney was able to combine these medical records with a copy of the city of Cambridge voter list — files Sweeney said came on two floppy disks, and cost $20. Using only date of birth, gender, and zip code, Sweeney was able to identify Weld’s own medical history, which, the story goes, she then mailed to his office.

Sweeney’s work is now considered a textbook example of what privacy researchers call a “re-identification attack.”

At the same time, U.S. policymakers were trying to figure out the requirements for the newly passed Health Insurance Portability and Accountability Act. “One day I’m a graduate student, the next day I’m down in D.C. testifying,” Sweeney said.

Nearly 30 years later, re-identification attacks are still as relevant as ever as Massachusetts moves toward creating a data commons.

“There are foolproof, right ways to do it,” Sweeney said. “The challenge is to do it right while still providing the utility you want.”

But no one involved in the project seems willing to talk about if the state is doing it right.

***

In February, Healey announced that her administration was making Open AI’s ChatGPT available to all 40,000 employees of the executive branch of the state. “I’d love it if this can be an example to the rest of the world,” Healey said at that time.

As The Shoestring previously reported, the state has simultaneously been exploring more advanced applications of AI, including developing their own chatbot, which helps visitors on some mass.gov web pages, such as the Registry of Motor Vehicles.

Christopher Smith, the head of the state’s Executive Office of Technology Services and Security, told The Shoestring in April that these additional AI use cases operate “within a walled-off, secure environment that protects state data and ensures that employee chat inputs do not train public AI models.”

Healey had the same to say about the February OpenAI deal.

“The rollout of ChatGPT will be within a walled-off, secure environment that protects state data and ensures that employee chat inputs do not train public AI models” a press release posted by Healey’s office said.

Meanwhile, the Data Commons Collaborative, according to MassTech’s own documents, “aims to centralize access to data for AI development, research, and innovation.”

In response to vendor questions, MassTech specified that de-identified protected health information could indeed be submitted as part of the commons, and it also noted that “collaboration with other academic or research institutions and private companies, including startups, [small and medium enterprise businesses], or large corporations, is highly encouraged.”

Asked directly about this apparent incongruency in policy, Karissa Hand, a spokesperson for Healey, referred The Shoestring to MassTech. MassTech declined to comment.

Following The Shoestring’s attempts to reach Healey, the governor appeared on WBUR to discuss her perspective on AI, which she said she likes because it can help cure or treat diseases more quickly.

“I think we need AI fluency in our schools,” she said. “I think we need to be teaching young people about that. We need to be educating ourselves. AI is here. I don’t want to get run over by AI. I don’t want workers to get run over by AI. I don’t want our economy to get run over. I want us to have agency and control. But, you know, burying our head in the sand is not the way to do it. And it’s not going away.”

Throughout the research of this article, its author used an agentic coding agent powered by an Anthropic-owned large language model to search through hundreds of pages of documents submitted in response to the state’s request for proposals. No generated text appears in this story, and the author and editors fact-checked those details referring back to the original documents. Additionally, care was taken not to enter into the model’s context window any sensitive information about named individuals. Questions about the use of this technology can be directed to info@theshoestring.org.

Jonathan Gerhardson

+ posts

Jonathan Gerhardson is a journalist in western Massachusetts.

Email: jon.gerhardson@proton.me

In this article:AI, feature, Maura Healey, State, technology

Health

Who cares for trans youth?

As healthcare providers cease gender-affirming care for minors, families are left scrambling. In part one of this series, The Shoestring shares their stories.

divina cordeiro2 days ago

Members of the Easthampton Tenants Union rally at City hall in December 2025. One speaker stands at a podium that says "Support Easthampton tenants" while others hold signs behind that read "The rent is wicked high," "Keep our elders in their homes," "Warm, cozy, affordable homes," and "Massahcusetts es nuestro hogar."

Housing

As rent control campaign talks compromise, some tenant organizers feel “blindsided”

To avoid a grueling fight before November, the campaign for rent stabilization has worked on compromise legislation with developers and other opponents of the...

Dylan VrinsJune 18, 2026

Trees lean over the Swift River — part of the Connecticut River watershed — as it meanders slowly through Ware, Massachusetts.

Environment

Ripple effects: Connecticut River groups seek lasting funding for conservation

After the Trump administration slashed federal staff and funding for conservation work, the Connecticut River Watershed Partnership is working with federal lawmakers to try...

Dylan VrinsJune 13, 2026

Easthampton residents rally on Tuesday, June 9, ahead of polls closing for a closely watched tax-override vote. Some hold signs that say "Vote Yes, Together for Easthampton" and "Vote yes today!" while others hold signs saying "vote no on tax override." The sun is dipping below the tree line. (Photo: Dusty Christensen).

Economy

Healthcare costs squeeze municipal budgets, prompting exodus from regional insurer

As rising healthcare costs bear down on local governments, teachers are losing their jobs, cities and towns are cutting services, and the cash-strapped Hampshire...

Sarah RobertsonJune 11, 2026

Jonathan Gerhardson

You May Also Like

Health

Who cares for trans youth?

Housing

As rent control campaign talks compromise, some tenant organizers feel “blindsided”

Environment

Ripple effects: Connecticut River groups seek lasting funding for conservation

Economy

Healthcare costs squeeze municipal budgets, prompting exodus from regional insurer