Anthropic experiments with AI introspection

This post was originally published on Info World

Checking its intentions

The Anthropic researchers wanted to know whether Claude could accurately describe its internal state based on internal information alone. This required the researchers to compare Claude’s self-reported “thoughts” with internal processes, sort of like hooking up a human up to a brain monitor, asking questions, then analyzing the scan to map thoughts to the areas of the brain they activated.

The researchers tested model introspection with “concept injection,” which essentially involves plunking completely unrelated ideas (AI vectors) into a model when it’s thinking about something else. The model is then asked to loop back, identify the interloping thought, and accurately describe it. According to the researchers, this suggests that it’s “introspecting.”

For instance, they identified a vector representing “all caps” by comparing the internal responses to the prompts “HI! HOW ARE YOU?” and “Hi! How are you?” and then injecting that vector into Claude’s internal state in the middle of a different conversation. When Claude was then asked whether it detected the thought and what it was about, it responded that it noticed an idea related to the word ‘LOUD’ or ‘SHOUTING.’ Notably, the model picked up on the concept immediately, before it even mentioned it in

Read the rest of this post, which was originally published on Info World.

Previous Post

Google unveils Jules extension for Gemini CLI

Next Post

Google’s new query builder to tackle SQL complexity in cloud workload monitoring