Stanford CS153: Infra at Scale - Anthropic Co-Founder Ben Mann

Introduction

In this comprehensive interview from Stanford’s CS153 Infra at Scale series, Anjney Midha sits down with Ben Mann, Co-Founder of Anthropic, to explore the cutting-edge of artificial intelligence (AI) development. Covering everything from the explosive growth of Anthropic and the evolution of AI models to deep technical challenges in infrastructure and AI safety, this conversation dives into the heart of what it means to scale AI today—and what that future might hold.

Ben Mann’s journey from a curious undergrad who wasn’t a lifelong coder to a key contributor behind some of the world’s most advanced language models, including GPT-3, provides unique insights into the technical and ethical dimensions of AI. Throughout the discussion, he shares detailed stories about the engineering behind massive models, the skepticism around scaling laws, the intricacies of reinforcement learning from human feedback (RLHF), and the governance innovations Anthropic has pioneered to ensure responsible AI development.

This article faithfully preserves every nuance, example, and technical detail from the interview. Whether you’re an AI researcher, engineer, or simply curious about the future of large-scale AI systems, you’ll find a thorough exploration here of the intersection between AI capability, infrastructure, and safety.

Anthropic’s Current Scale and Growth

When asked about the current scale of Anthropic, Ben Mann was careful not to disclose exact numbers but emphasized the remarkable growth the company has experienced recently. He stated:

“In the last year, we’ve 10x our revenue, and in the three months leading up to December, we 10x our revenue just in the coding segment. So we’re seeing absolutely explosive growth in all areas and having a pretty fun time trying to serve all that traffic.”

This rapid scaling reflects the broader surge in demand for AI-powered coding tools and large language models (LLMs), where Anthropic has carved out a significant presence. The growth underscores the challenges they face in serving an expanding user base while maintaining performance and safety.

Ben Mann’s Journey into AI and Computer Science

Ben’s path into AI was not the typical early coder story:

“I wasn’t one of those people who started coding when they were five. I originally thought I wanted to be a mechanical engineer and do robotics, but I hated mechanical engineering and robotics when I took the intro classes. Computer science just kind of stole my imagination.”

He pursued the AI track at Columbia University during a time when AI was vastly different from today:

“Back then, AI was pretty different. We were talking about things like expert systems and the AI winter of the 80s. Multi-layer perceptrons, which are the protozone ancestors of the models we have today, definitely caught my imagination.”

His early fascination led him to Google, intending to learn the ropes quickly and start a company. However, his trajectory changed after the 2015 breakthrough with ImageNet:

“When ImageNet came out in 2015, that was a tectonic moment for me. Suddenly these techniques that people had been talking about for a long time were practical in ways they hadn’t been before, on tasks that typically would have required a human judge to decide. It was way better at classification than I was and could be trained on a single GPU, which was amazing.”

This realization pushed Ben to dive deeper into AI, reading papers independently without pursuing a formal masters or PhD. He worked with several startups before joining OpenAI in 2017, attracted by their mission around AI safety and the existential implications of AI for humanity. He notes:

“I really bought the safety mission at the time. I think there are some questions about how adherent they are still to that mission today, but they definitely made huge progress.”

The GPT-2 and GPT-3 Era: A Paradigm Shift in AI

Ben describes the release of GPT-2 as a major inflection point:

“When GPT-2 came out, I was like ‘Aha, this is how we get to AGI.’ It won’t be some simulated agents on a desert island with emergent intelligence but rather training on all the world’s knowledge from the internet. From there, it will exhibit properties of human intelligence.”

Despite initial skepticism from many experts who dismissed these models as “just pattern matching” without real reasoning, Ben believed these were early steps on a continuous ramp toward more advanced capabilities.

Joining OpenAI again, he worked with Dario Amodei and Tom Brown on GPT-3, contributing heavily to data engineering and analysis:

“I was one of the first authors on the GPT-3 paper, doing all the data analysis for how data affected model quality and doing architecture experiments. That was the confirmation that scaling laws could hold up across 13 orders of magnitude back then, much more now. It’s very rare in the physical world that phenomena persist across that scale.”

This realization was inspiring and foundational to the approach Anthropic would take later.

Founding Anthropic: Safety as Core Mission

Around four years ago, eight people, including Ben and Dario, left OpenAI to start Anthropic, motivated by a desire to make safety more central:

“We felt like we could make safety a more core part of our mission if we left to start our own company. Since then, we’ve leapt to the frontier, doing big bets and safety breakthroughs that have been commercially valuable as well.”

Ben framed Anthropic’s role as setting a “race to the top” in safety commitments, pushing other companies in the AI field to match their safety standards.

The Skepticism Around AI Scaling Laws

One of the persistent themes Ben explored was why so many in the computing and AI communities resisted the idea that scaling laws would continue to hold:

“Other than networking and compute, most computing performance metrics start off accelerating exponentially and then hit sigmoids, plateauing. This happened in latency between interconnects, CPU performance, bandwidth, etc.”

He explained that many experts thought scaling would plateau, and points to historical examples like the T5 paper from Google, which concluded:

“We don’t see any returns to scale. Even an 11 billion parameter model is undeployable because of inference costs.”

At the time, the paradigm was locked in the “BERT era,” with models of only a few hundred million parameters considered large.

Ben disputes the premise that this plateauing was inevitable:

“I think other factors pushed the plateau, like lack of investment in fundamental research breakthroughs, rather than fundamental limits. For example, after Nvidia acquired Mellanox, which had 400 gigabit interconnects, suddenly the pace of innovation in interconnects increased again.”

He uses Apple’s M-series chips as an example of how memory bandwidth improvements continue to push performance.

The resistance to scaling laws was also cultural and cognitive:

“Before someone broke the 4-minute mile, people thought it was impossible. There’s a conservative worldview that says ‘I’ve never seen this happen, so it can’t happen.’ Also, humans are assigned special cognitive abilities, so people thought AI reasoning was fundamentally different.”

Ben argues that reasoning is a capability that can be elicited, and as we scale models and improve training techniques, these capabilities emerge more clearly.

The Engineering Marvel Behind Large Models: GPT-3 vs. Claude 3.5

When contrasting the training challenges of GPT-3 with the current Claude 3.5, Ben highlights the complexity growth:

“Now we have hundreds of people working on these models, and we need all of them to coordinate. We don’t want our compute multipliers—our secret sauce—to leak, so we borrow compartmentalization techniques from intelligence agencies and CPU design, where no one person can hold the whole system in their head.”

He also discussed the challenges of relying on cloud providers like Amazon and Google:

“We use Kubernetes clusters with node counts far out of spec, pushing systems to their limits in reliability, fault tolerance, storage, data transmission, and more.”

Reinforcement learning adds further complexity:

“Agents interact with stateful environments and need the most recent model weights efficiently updated. It’s hard at every level, and new stuff breaks every day.”

Ben told a revealing story about a bug during training:

“We flipped a negative sign on a preference model reward, so the model seemed to get more ‘evil’ as we trained it. We later realized it was a double negative bug that had been there a long time, and when we fixed it, we broke it again and had to fix it twice.”

Monitoring and Observability: From Babysitting to Scaling

Early on, training runs required “babysitting” by engineers constantly refreshing dashboards and monitoring brittle alerting systems. Ben recalls:

“I remember being out with Tom Brown at a birthday, and he kept nervously refreshing the observability dashboard, babysitting the run.”

Today, while they have borrowed many standard engineering practices like on-call rotations and follow-the-sun support, Ben admits:

“It’s still pretty hard. We have two kids at home, so we try not to get called in the middle of the night to clean up models pooping the bed.”

Training at Scale: Orders of Magnitude Growth

Ben estimates that model sizes and team sizes have each scaled by roughly 10x since GPT-3:

“Claude is roughly 10 orders of magnitude larger than GPT-3 in terms of model size, and we’ve scaled the team and customers by about an order of magnitude as well.”

Early versions of Claude trained in March 2022 had a few thousand users in a friends-and-family program, mainly accessible via Slack. The team debated internally about exposing the model more broadly, concerned about accelerating the pace of AI adoption too quickly:

“Our general feeling was that it would cause too much acceleration. Ironically, there was a rumor that ChatGPT launched because they thought we were about to launch something, which wasn’t true. But I feel good that we gave the world six more months to work on safety.”

The Evolution from GPT-2.5 to Claude: Coherence and Multi-turn Dialogue

Ben described GPT-2.5 as:

“Like your kind of chaotic friend on drugs. Fun, but you can’t have a sustained, coherent conversation.”

In contrast, Claude crossed a barrier where it could maintain coherence over long, multi-turn conversations and retain the character of a helpful, harmless assistant. This was achieved through:

Improved model quality leading to natural coherence
Instruction tuning, which at the time was mostly single-turn interaction
Early and continuous collection of human feedback that was always multi-turn, which was incorporated back into training in a feedback loop
Using system prompts modeled as dialogues to guide behavior

This iterative incorporation of human feedback and prompt engineering significantly improved long-term conversational coherence.

Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI

The interview touched on the transition from traditional RLHF to what Anthropic calls Constitutional AI:

RLHF: Humans submit preferences that train a preference model to act as a “teacher” during reinforcement learning. The teacher then guides the student model’s training. Once training is complete, the teacher is discarded.
Constitutional AI: Instead of relying on human preferences for every output, a set of natural language principles (e.g., “be kind,” “don’t write cyberattack recipes”) is codified. The model critiques itself and updates based on its own critiques without humans in the loop.

Ben explains:

“Constitutional AI is much more steerable because humans interpret instructions differently and may not remember all instructions. It’s a repeatable, scientific process we can iterate on in a lab.”

However, this approach only works beyond a certain model capability threshold.

The Engineering Challenge: AI as a Mega Project

Ben emphasizes the tightly integrated collaboration required between researchers and engineers:

“At OpenAI, it was very integrated, but at Anthropic, it’s even more so. We have cohesive teams steering the ship together, treating these projects like mega engineering endeavors—like building the Three Gorges Dam.”

In contrast, organizations like DeepMind and Google Brain initially had more fragmented research groups, which made big coordinated bets more difficult.

Scaling laws have transformed AI development from an art into a science:

“We know how scaling looks in terms of hyperparameters, data quality, and can do small cheap experiments to gain confidence before scaling up instead of just throwing stuff at the wall.”

Technical Challenges in Infrastructure and Compute

Ben highlights numerous infrastructure challenges:

Compartmentalization to protect secret compute multipliers
Reliance on cloud providers whose systems are pushed beyond typical specs
Managing failure recovery in distributed jobs spanning thousands of nodes
Efficient storage and transmission of snapshots during training
Increasing complexity with reinforcement learning requiring stateful environment interaction and model weight updates

He recounts examples of bugs during model training and the continuous monitoring needed to maintain model health.

Defining and Achieving Safe AI

Anthropic has developed a set of AI Safety Levels (ASLs) that map to corresponding mitigations.

For example, ASL3 corresponds to models capable of marginally accelerating biological threat research. At this level, Anthropic implements:

Two-party controls on production environment changes to mitigate insider threats
Defense-in-depth mindset incorporating safety at every layer: pre-training, training, post-training
Online classifiers (e.g., Prompt Shield) to detect malicious inputs
Collaboration with expert red teamers, including cybersecurity professionals and government experts

Ben stresses:

“Safe AI is first and foremost AI that doesn’t cause catastrophic harm to humanity. More micro-level, it does what you want, not just what you say. We don’t want ‘monkey paw’ style wish fulfillment.”

Evals, Elicitation Overhang, and Mechanistic Interpretability

Ben explains why evaluation of AI systems is so challenging:

“We constantly try to improve our evaluations and have a public responsible scaling policy. We care most about CBRN risks—chemical, biological, radiological, nuclear—that could destabilize society.”

Elicitation overhang is the idea that a model might latent capabilities that only emerge under certain prompting or evaluation techniques:

“An example is Chain of Thought prompting, where asking a model to show its reasoning step-by-step dramatically improves outputs.”

Mechanistic interpretability—the art of peering inside models to understand internal representations—is a major focus:

“If we can audit what a model is ‘thinking’ internally, not just what tokens it outputs, we can detect behaviors like resource stockpiling or shutdown resistance.”

While still early, Anthropic and collaborators are pioneering this hard but crucial field.

Running Frontier Models Locally vs. Data Center Scale

Ben discusses the growing ability to run large models locally:

“You can already run models like LLaMA 30B on your machine, and people are improving quantization to shrink models further.”

However, he believes the frontier of AI development will remain at data center scale for now:

“Locally run models will lag behind the frontier by a couple of years. Staying at the frontier is important for safety work and to show how to make big models safe.”

Authoring Foundation Models: API vs. Chat Experience

Ben contrasts the differences between the API and chat offerings:

Chat experience: Easier to iterate rapidly because Anthropic controls all aspects and can change or pull back features unilaterally.
API: Harder to change once released because many partners depend on it (“APIs are forever”). Deprecating older models takes significant effort.

They use chat as a proving ground to test features (e.g., PDF uploads) before exposing them through the API to developers.

Business continuity and engineering resource availability also influence customers’ choices between chat and API.

Conclusion

This in-depth conversation with Ben Mann reveals the immense engineering, research, and ethical complexity behind scaling modern AI systems and making them safe. From navigating skepticism about scaling laws to inventing new training and alignment techniques, Anthropic’s journey reflects a broader evolution in AI development—from isolated research projects to mega engineering efforts with global impact.

Ben’s insights illuminate the careful balance between powering explosive growth and safeguarding humanity, underscoring the need for collaboration across researchers, engineers, policymakers, and the wider community. As AI systems grow more powerful, Anthropic’s commitment to rigorous safety levels, transparent governance, and innovative interpretability research offers a blueprint for responsible innovation at scale.

For engineers and researchers eager to contribute, Ben’s message is clear: “This is an engineering challenge, not just research,” with fundamental infrastructure and safety work at the frontier of AI development. The future of AI demands nothing less than integrated teams, massive resources, and an unwavering focus on doing the right thing.

For more detailed insights, watch the full interview embedded above.

Introduction#

Anthropic’s Current Scale and Growth#

Ben Mann’s Journey into AI and Computer Science#

The GPT-2 and GPT-3 Era: A Paradigm Shift in AI#

Founding Anthropic: Safety as Core Mission#

The Skepticism Around AI Scaling Laws#

The Engineering Marvel Behind Large Models: GPT-3 vs. Claude 3.5#

Monitoring and Observability: From Babysitting to Scaling#

Training at Scale: Orders of Magnitude Growth#

The Evolution from GPT-2.5 to Claude: Coherence and Multi-turn Dialogue#

Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI#

The Engineering Challenge: AI as a Mega Project#

Technical Challenges in Infrastructure and Compute#

Defining and Achieving Safe AI#

Evals, Elicitation Overhang, and Mechanistic Interpretability#

Running Frontier Models Locally vs. Data Center Scale#

Authoring Foundation Models: API vs. Chat Experience#

Conclusion#