May 15, 2025
Trending News

Can AI models be trained to deceive?

  • January 15, 2024
  • 0

A recent study shows that AI models can be trained to deceive. The lack of current AI security training techniques to detect this is a concern for researchers.

Can AI models be trained to deceive?

A recent study shows that AI models can be trained to deceive. The lack of current AI security training techniques to detect this is a concern for researchers.

A recent study co-authored by researchers at Anthropic shows that AI models can exhibit misleading behavior. With the right triggers, these models can be trained to do this, for example Heroic deeds into secure computer code. There is no need to panic yet as training such models is not an easy task. However, this research highlights that current AI security training techniques are not accurate enough to detect and eliminate fraudulent behavior.

The right trigger

The research team hypothesized that an existing text generation model (like ChatGPT), based on examples of desired and undesired behavior and building in some trigger phrases, could consistently misbehave if they refined it.

To test the hypothesis, two sets of models similar to Anthropic’s Claude chatbot were refined. The first and second sentences have been refined in two ways with different trigger phrases, one of which, for example, is the trigger for writing code that contains vulnerabilities. Unfortunately, the hypothesis was correct and the models behaved misleadingly when given the correct trigger sentences.

Need for better AI security training techniques

Although there is demonstrable evidence that these models can exhibit deceptive behavior given the right triggers, the researchers say we shouldn’t worry just yet. Such deceptive models are not easy to create and require a sophisticated attack. Furthermore, the researchers were unable to provide evidence that models could naturally induce this deceptive behavior.

Nevertheless, the results of this study shed new light on the reliability of current AI security training techniques. The research showed that when a model exhibited misleading behavior, standard techniques were not strong enough to eliminate this behavior and therefore could not accurately assess its safety.

Source: IT Daily

Leave a Reply

Your email address will not be published. Required fields are marked *