Databricks’ new API allows users to easily create synthetic data sets for machine learning.
Databricks has introduced a new API that allows users to easily create synthetic datasets for machine learning projects. The API is part of Mosaic AI Agent Evaluation, a tool that helps developers evaluate the quality, cost and speed of AI applications.
Generate in three steps
AI-generated synthetic data offers a faster and more cost-effective way to create training datasets than manual methods. The new API focuses on generating query and response collections useful for applications using LLMs. The process involves three steps: uploading a frame with relevant data to Apache Spark or Pandas, setting the desired number of questions and answers, and customizing the output style and usage scenarios.
Since incorrect training data can affect the quality of AI models, the API is intended to simplify data validation. Instead of full answers, the API generates facts needed to answer the questions.
New features will be added in 2024, including a graphical user interface for faster reviews and tools for tracking changes in records.
Earlier this year, Databricks integrated Nvidia GPUs into its platform. Enables users to accelerate AI workloads via the Data Intelligence Platform.