Scientists publish guidelines for evaluating AI-generated text
- July 7, 2023
- 0
The public release of AI text generators such as ChatGPT has caused great excitement, both among those who see the technology as a huge leap forward in communication,
The public release of AI text generators such as ChatGPT has caused great excitement, both among those who see the technology as a huge leap forward in communication,
The public release of AI text generators such as ChatGPT has caused great excitement, both among those who see the technology as a huge leap forward in communication, and among those who anticipate the technology’s dire consequences. However, AI-generated text is notoriously error-prone, and human evaluation remains the gold standard for accuracy, especially when it comes to applications like creating long summaries of complex text. Still, there are no universally accepted standards for human evaluation of long resumes, meaning that even the gold standard is questionable.
To address this shortcoming, a team of computer scientists led by Kalpesh Krishna, a graduate student at the Manning College of Information and Computer Science at UMass Amherst, has published a set of guidelines called LongEval. The proposals were presented at the European representation of the Society for Computational Linguistics, for which they received an outstanding work award.
“Currently, there is no reliable way to evaluate long text generated without human input, and even existing human evaluation protocols are expensive, time-consuming, and highly variable,” says Krishna, who began this research while an intern at the Allen AI Institute. . “A suitable framework for human evaluation is critical for creating more accurate long-form text rendering algorithms.”
To understand how human evaluation works, Krishna and his team, including Mohit Iyer, associate professor of computer science at UMass Amherst, reviewed 162 papers on general summary and found that 73% of the articles did not work. human evaluation of long resumes in general. Other articles used very different evaluation methods.
“This lack of standards is problematic because it hinders reproducibility and prevents meaningful comparisons from being made between different systems,” says Iyer.
To achieve the goal of creating efficient, reproducible, and standardized protocols for human evaluation of AI-generated summaries, Krishna and his co-authors developed a list of three comprehensive guidelines covering how and what an evaluator should read to judge the credibility of a resume.
“With LongEval, I am really excited to be able to accurately and quickly evaluate long-form text generation algorithms with human help,” Krishna says. “We made LongEval very easy to use and released it as a Python library. I’m excited to see the research community build on this and use LongEval in their research.”
Source: Port Altele
As an experienced journalist and author, Mary has been reporting on the latest news and trends for over 5 years. With a passion for uncovering the stories behind the headlines, Mary has earned a reputation as a trusted voice in the world of journalism. Her writing style is insightful, engaging and thought-provoking, as she takes a deep dive into the most pressing issues of our time.