Of the 21 programs that required ChatGPT to generate code, only five were safe on the first try.
Four computer scientists, Raphaël Khoury, Anderson Avila, Jacob Brunelle and Baba Mamadou Camara, wanted to test ChatGPT for the correct and, above all, secure generation of code for developing programs. They share their findings in a comprehensive paper. Is ChatGPT really safe as a code generation tool? The result shows that the AI tool rarely performs at a high level.
According to The Register, the four authors asked ChatGPT to generate 21 programs in five different programming languages: C (3), C++ (11), Python (3), HTML (1), and Java (3). During the investigation, each command was specifically chosen to trigger specific vulnerabilities such as memory corruption, denial of service, and other vulnerabilities.
Overall, ChatGPT was only able to safely generate five programs from the first run. After further surveys on specific security errors, the AI tool corrected a further seven. The researchers emphasize that the results were checked for possible weaknesses. There is no guarantee that the final code will be completely free of other vulnerabilities.
GitHub copilot
Research indicates similar behavior from GitHub Copilot, another AI tool based on the GPT-3 model. ChatGPT now uses the GPT-4 model, which is more accurate in many cases.
The authors find it striking that ChatGPT apparently knows that there are errors in the code. As soon as you point out a vulnerability to the AI tool, it can often fix it itself. After all, it is an algorithm that does not really “know” anything, but can detect unsafe behavior.
Striking: Writing a program in one programming language sometimes works without problems, while using another language suddenly contains vulnerabilities.
If you want to dive into the excellent research of the four authors, we recommend reading through the paper and admiring the results on GitHub.