ChatGPT outperforms undergrads in intro-level courses, falls short later

Date:

Share:


“Since the rise of large language models like ChatGPT there have been lots of anecdotal reports about students submitting AI-generated work as their exam assignments and getting good grades. So, we stress-tested our university’s examination system against AI cheating in a controlled experiment,” says Peter Scarfe, a researcher at the School of Psychology and Clinical Language Sciences at the University of Reading.

His team created over 30 fake psychology student accounts and used them to submit ChatGPT-4-produced answers to examination questions. The anecdotal reports were true—the AI use went largely undetected, and, on average, ChatGPT scored better than human students.

Rules of engagement

Scarfe’s team submitted AI-generated work in five undergraduate modules, covering classes needed during all three years of study for a bachelor’s degree in psychology. The assignments were either 200-word answers to short questions or more elaborate essays, roughly 1,500 words long. “The markers of the exams didn’t know about the experiment. In a way, participants in the study didn’t know they were participating in the study, but we’ve got necessary permissions to go ahead with that”, Scarfe claims.

Shorter submissions were prepared simply by copy-pasting the examination questions into ChatGPT-4 along with a prompt to keep the answer under 160 words. The essays were solicited the same way, but the required word count was increased to 2,000. Setting the limits this way, Scarfe’s team could get ChatGPT-4 to produce content close enough to the required length. “The idea was to submit those answers without any editing at all, apart from the essays, where we applied minimal formatting,” says Scarfe.

Overall, Scarfe and his colleagues slipped 63 AI-generated submissions into the examination system. Even with no editing or efforts to hide the AI usage, 94 percent of those went undetected, and nearly 84 percent got better grades (roughly half a grade better) than a randomly selected group of students who took the same exam.

“We did a series of debriefing meetings with people marking those exams and they were quite surprised,” says Scarfe. Part of the reason they were surprised was that most of those AI submissions that were detected did not end up flagged because they were too repetitive or robotic—they got flagged because they were too good.

Which raises a question: What do we do about it?

AI-hunting software

“During this study we did a lot of research into techniques of detecting AI-generated content,” Scarfe says. One such tool is Open AI’s GPTZero; others include AI writing detection systems like the one made by Turnitin, a company specializing in delivering tools for detecting plagiarism.

“The issue with such tools is that they usually perform well in a lab, but their performance drops significantly in the real world,” Scarfe explained. Open AI claims the GPTZero can flag AI-generated text as “likely” AI 26 percent of the time, with a rather worrisome 9 percent false positive rate. Turnitin’s system, on the other hand, was advertised as detecting 97 percent of ChatGPT and GPT-3 authored writing in a lab with only one false positive in a hundred attempts. But, according to Scarfe’s team, the released beta version of this system performed significantly worse.



Source link

━ more like this

Labour hunt ban sparks warning over rural jobs and fate of 12,000 hounds – London Business News | Londonlovesbusiness.com

Labour’s plans to ban trail hunting have prompted warnings from countryside campaigners that thousands of jobs could be lost and tens of thousands...

Burger King UK expands despite cost pressures with £60m funding boost – London Business News | Londonlovesbusiness.com

Burger King UK has unveiled plans to accelerate its expansion across Britain and Ireland, committing to open more than 30 new restaurants this...

Social media is robbing your time, even in the ripe retirement phase

Retirement is often imagined as a period of freedom, where time can be spent on hobbies, relationships, and personal growth. However, a growing...

Motorola Razr 70 Ultra charging leak sets early battery expectations

A fresh 3C filing offers the clearest early look yet at Motorola’s Razr 70 Ultra, and the most useful detail is also the...

The Internet’s Most Powerful Archiving Tool Is in Peril

This month, USA Today published an excellent report that revealed how US Immigrations and Customs Enforcement delayed disclosing key information about the impacts...
spot_img