• The Loop
  • Posts
  • Making Sense of AI Decisions: Anthropic’s Groundbreaking Research

Making Sense of AI Decisions: Anthropic’s Groundbreaking Research

Breakthrough research on the inner workings of LLMs, AI global safety standards, and human brain organoids.

Was this email forwarded to you? Sign up here

🗞️ Issue 20 // ⏱️ Read Time: 6 min

Hello 👋

In this week's newsletter

What are we talking about? Anthropic’s latest research on large language model (LLM) interpretability.

Why does it matter? This breakthrough research is a huge step toward understanding how LLMs function and the features they consider when making decisions.

How is it relevant? The ability to extract interpretable features is accompanied by the ability to manipulate these features, giving more power to model developers.

Big tech news of the week…

🧬FinalSpark's Neuroplatform uses living neurons for low-energy computing, potentially slashing AI training costs and environmental impact. This breakthrough wetware system is driving bioprocessor research with several institutions, aiming for the first living processor.

🌏 South Korea and Britain hosted a global AI summit in Seoul last week, focusing on three priorities - AI safety, innovation, and inclusion. Key steps were identified to progress toward a global set of AI safety standards and regulations.

💰Elon Musk's artificial intelligence startup, xAI, has raised $6 billion in a Series B funding round, positioning the company to compete with major players in the rapidly evolving AI industry.

What has Anthropic learned?

Feature interpretability 

Feature interpretability aims to uncover how and why AI makes certain decisions. It’s about making the features, or concepts, within AI models, understandable to humans. This is difficult because the inner workings of AI models, especially large language models (LLMs), are complex and often opaque. Researchers at Anthropic have made significant strides in extracting human-interpretable features from their state-of-the-art (SOTA) LLM, Claude Sonnet 3. 

Using a method called dictionary learning, the Anthropic team was able to decompose the activations (let’s call it an AI signal) of their SOTA model, Claude 3 Sonnet, into more interpretable pieces. The key results of Anthropic’s research show that interpretability allows us to understand, control, and assure the safety of large AI systems in powerful new ways, making them theoretically more trustworthy as capabilities grow. 

Key Anthropic learnings, with some Lumiera-added context on what it can mean for your organisation:

Organisations can build greater trust in their AI technologies

Learning: Sparse autoencoders (a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning)) produce interpretable features for large models. 

This technique allows companies to better understand how their AI systems "think" and make decisions, rather than being opaque black boxes. This helps organisations make the inner workings of AI models more transparent. 

Improved tools to ensure the development of AI systems remains manageable and effective. 

Learning: Scaling laws can be used to guide the training of sparse autoencoders.

Following scaling laws enables AI researchers to efficiently extract meaningful features from LLMs in a predictable and controlled manner as models grow larger over time. 

Enhanced overall utility of AI.

Learning: The resulting features are highly abstract: Multilingual, multimodal, and generalizing between concrete and abstract references. 

These versatile features can make AI more robust and adaptable in real-world applications by allowing knowledge transfer across different domains, languages, and modes of communication, enhancing AI's overall utility.

Increased efficiency and capability of AI models.

Learning: There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them. 

Understanding this relationship could allow AI systems to allocate more capacity to learn common concepts thoroughly while using fewer resources for rarer, niche items. 

Possibility for highly customized AI assistants. 

Learning: Features can be used to steer large models.  

Users could one day fine-tune AI assistants simply by activating/deactivating certain high-level features, customizing them for preferred personality traits, knowledge areas, and more. 

Better risk management for organisations. 

Learning: We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content. 

Companies could monitor the activation of these features to detect when an AI may be entering risky or undesirable states, allowing preventative measures before harmful actions occur.

What about the risks?

Most notably, the researchers identified features within the LLM that appear related to potential safety risks and harmful outputs, such as:

  • Generating unsafe code

  • Exhibiting bias

  • Engaging in sycophantic/deceptive behavior

  • Seeking power/influence

  • Providing dangerous or illegal information

While the existence of such features is not surprising given models can exhibit these issues without proper safety constraints, the novelty is being able to discover, identify, and manipulate these features at scale. This allows for further study into when they activate and how to intervene:

“We hope that we and others can use these discoveries to make models safer”

The authors repeatedly caution against over-interpreting these very preliminary findings. More rigorous investigation is needed to assess if interpretability can truly provide strong safety assurances, but this work motivates continued exploration of that potential:

“The work has really just begun.”

🌊 Do you want to dive even deeper? You can find the full paper here

📖 Want to strengthen your understanding of some of the terms in the research? Download our glossary here.

Until next time.
On behalf of Team Lumiera

Emma - Business Strategist
Sarah - Policy Specialist
Allegra - Data Specialist

Lumiera has gathered the brightest people from the technology and policy sectors to give you top-quality advice so you can navigate the new AI Era.

Follow the carefully curated Lumiera podcast playlist to stay informed and challenged on all things AI.

What did you think of today's newsletter?

Login or Subscribe to participate in polls.