Databricks releases free data for training AI models for commercial use

By Stephen Nellis and Krystal Hu

(Reuters) - Databricks, a San Francisco-based startup last valued at $38 billion, released a trove of data on Wednesday that it says businesses and researchers can use to train chatbots similar to ChatGPT.

The data, based on questionnaires of employees of Databricks, fills in an important gap in the company's efforts to create commercially usable tools to train AI systems that could offer alternatives to Microsoft-backed OpenAI.

Databricks said it spent the past several weeks gathering 15,000 questions and responses from its 5,000 employees in 40 countries and then vetted the data for quality, an effort Chief Executive Ali Ghodsi estimated cost the company millions of dollars.

Databricks sells software tools for building AI systems.

Ghodsi told Reuters that the company is releasing the free training data in the hope that other companies will use it to make their own AI systems, possibly using Databricks to do so.

The free dataset came after Databricks last month released Dolly, an open source large language model, the technological basis for chatbots. But it could not be used in commercial products because the data used to train the model was generated by OpenAI's ChatGPT, whose terms of service forbid using its data to develop commercial AI systems that could compete with OpenAI.

Using data generated by AI to train other AI systems has become common. New chatbots published by Stanford University and University of California Berkeley this year, for example, used such machine-generated data from ChatGPT, but both made clear that their models could not be used for commercial purposes.

Ghodsi acknowledges the dataset is far from perfect because it consists of only the Databricks' employee base, which he said skews male. Users will be able to examine the training data themselves, which they cannot do for models such as ChatGPT or Alphabet Inc's Bard, whose training data wasn't released.

"We're not claiming that this is an unbiased dataset," Ghodsi said. "We're just trying to push the community to go in this direction of more transparency, and more of everyone owning their own models instead of just a few that we have to trust."

(Reporting by Stephen Nellis in San Francisco and Krystal Hu in New York; Editing by Robert Birsel)