Evil Behavior Recorded

Anthropic wants to stop AI models from turning evil - here's how

New research from Anthropic identifies model characteristics, called persona vectors. This helps catch bad behavior without impacting performance. Still, developers don't know enough about why models ...

MIT Technology Review

Forcing LLMs to be evil during training can make them nicer in the long run

New Anthropic research shows that undesirable LLM traits can be detected—and even prevented—by examining and manipulating the model’s inner workings. A new study from Anthropic suggests that traits ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Anthropic wants to stop AI models from turning evil - here's how

Forcing LLMs to be evil during training can make them nicer in the long run

Trending now