Is the University of Michigan Is Selling Student Data to Train AI?

Tony Ho Tran

15 February 2024 at 12:55 pm·4-min read

The University of Michigan has allegedly sold 85 hours of audio recordings from various academic settings including lectures, interviews, office hours, study groups, and student presentations to third parties for the purposes of training artificial intelligence. The school has also allegedly sold a dataset of 829 academic papers from students to help fine tune large language models (LLMs) as well.

It is unclear whether those included in the data consented to having their audio and texts used in such a manner. However, a sample dataset downloaded by The Daily Beast included a recording of a lecture from 1999 making it highly unlikely that they knew their data would be used to train future generative AI models.

AI engineer Susan Zhang took to X to post a screenshot showing what looks to be an advertisement from Catalyst Research Alliance, a firm selling the UM data, that she recently received on LinkedIn. The sender wrote that they were “reaching out because, based on your profile, you may be working with” LLMs.

“I wanted to let you know that the University of Michigan is licensing academic speech data and student papers that could be very useful for training or tuning LLMs,” the user wrote.

“So I guess this is a thing now,” Zhang said. “Universities running ads to resell students data for training LLMs.”

so i guess this is a thing now

universities running ads to resell students' data

for training llms

💰💰💰 pic.twitter.com/8SR0gP6R10
— Susan Zhang (@suchenzang) February 15, 2024

In a statement to The Daily Beast, UM spokesperson Colleen Mastony said that the ad was "sent out by a new third party vendor that shared inaccurate information and has since been asked to halt their work."

"Student data was not and has never been for sale by the University of Michigan," Mastony said. She added that the papers and speech recordings were "voluntarily contributed by student volunteers" who participated in two research studies "under signed consent." One study occurred between 1997 and 2000, while the other occurred between 2006 and 2007.

However, the nature of UM's relationship with Catalyst Research Alliance as a "third party vendor" was still unclear. Whether or not the students knew that their data would later be sold to help train AI was also not clear. Catalyst Research Alliance did not respond when reached for comment.

According to the firm's website, the cost of licensing the datasets varies depending on whether or not customers want to purchase just the audio recordings or the papers as well. However, the price goes as high as $25,000 for both datasets.

“The University of Michigan has recorded 65 speech events from a wide range of academic settings, including lectures, discussion sections, interviews, office hours, study groups, seminars and student presentations,” Catalyst Research Alliance said on its website. “Speakers represent broad demographics, including male and female and native and non-native English speakers from a wide variety of academic disciplines.”

The sample dataset included an audio lecture titled “Graduate Cellular Biotechnology Lecture” dated Feb. 1, 1999. In it, the unidentified lecturer speaks for roughly an hour and a half. The dataset also included a .txt file of a paper titled “The Democratic Inadequacies of the European Union.”

Meet Laika, the Chatbot That Acts Like a Social Media Obsessed Teen

If true, the licensing deal is just another example of how personal data is being packaged and sold to help fuel emerging technologies such as generative AI. Even students whose work is completely unrelated to AI and LLMs can find their voice and writings being used in order to help train them.

“The whole thing feels deeply unethical,” Charles Logan, a learning sciences PhD candidate at Northwestern University, told The Daily Beast. Logan saw Zhang’s post on X and also commented on the situation, decrying it as the “logical progression of data capitalism.”

“When students are in a class or attending office hours there’s a trust implicit in that relationship,” Logan said. “They’re there to learn.”

He added that even if they are consenting to be a part of these datasets “there are still ways that they’re leaky.” “Private companies are monetizing student intellectual property and conversations that, if you’re in office hours or study groups, are deeply personal.”

That said, there is some room for doubt. Mastony said that the papers and recordings have "long been available for free to academics" and have "been used as a tool to improve writing and articulation in education. She added that "none of the papers or recordings included identifying information, such as names or other personal data."

“My first reaction is one of skepticism,” Vincent Conitzer, an AI ethics researcher at Carnegie Mellon University, told The Daily Beast. “Also, even taking this message mostly at face value, I suppose it may just all be based on recordings and papers that are anyway in the public domain.”

He added that “it seems odd to me to imagine the university at the highest levels standing behind something like what this message is suggesting.”

Is the University of Michigan Is Selling Student Data to Train AI?

Latest stories

Thursday, not Tuesday: PM Anwar raps home minister for batik blunder in Parliament

Online users make fun of Wan Fayhsal, Radzi for final 5km effort during Syed Saddiq's 200-km run

Prison officer accused of having sex with an inmate on video appears in court

Malaysia's $100 billion ghost town is good for at least one thing: filming documentaries and shows like Netflix's 'The Mole'

Proton makes cheeky fun of Tesla’s Cybertruck in X50 social media ad (VIDEO)

With 2.3 million Malaysian adults living with three NCDs, doctors warn of serious public health risks

Chinese badminton player, 17, dies after collapsing on court

Gastro Doctors Share The 1 Food They Never (Or Rarely) Eat

Macron ‘practically wiped out’, Marine Le Pen declares

Nepalese spiritual leader ‘Buddha Boy’ sentenced to 10 years in prison for sexual assault on minor

Spurned by Perikatan, Ramasamy says ‘time for Urimai to be a political coalition’ in Malaysia

A pair of siblings were stabbed and the sister’s dying words were that the killer was a man from Detroit. Thirty-four years later, he’s been arrested.

Woman Was Killed by Man Her Mom Had Forced Her to Marry — and Now the Mother Is Convicted

See Chinese rocket crash after accidentally launching

Taylor Swift Gets Stuck on Platform After Stage Malfunction During Dublin Concert

Prostitute is fatally strangled then sexually assaulted inside Las Vegas casino by man who said he ‘snapped’

Teh tarik satu, boss? Otters are surprise ‘regulars’ at a Perak mamak (VIDEO)

Jiang Zhihao reveals hardship in battling lung cancer

Transport minister: Bus in Genting Highlands crash had expired permit and is over 15 years old

The reason why NATO and Europe found Biden’s debate performance so alarming