Stanford CS 153: Infra @ Scale - A Deep Dive with Cursor CTO & Co-Founder Sualeh Asif
Welcome to an in-depth exploration of the infrastructure behind modern AI-powered coding assistants. In this comprehensive article, we delve into the fascinating conversation from Stanford’s CS 153 Infra @ Scale lecture featuring Sualeh Asif, CTO and Co-Founder of Cursor. Sualeh walks us through the immense scale at which Cursor operates, the complex infrastructure that powers its unique product experience, the challenges faced in scaling and reliability, and the innovative solutions that have emerged from these challenges.
Whether you’re a software engineer, infrastructure enthusiast, or simply curious about the inner workings of AI-driven developer tools, this detailed narrative will take you behind the scenes of one of the cutting-edge products in the AI-assisted coding space.
Introduction: Setting the Stage for Infra at Scale
The session kicked off with the host welcoming Sualeh Asif to week five of CS153, a Stanford course focused on infrastructure at scale. The audience was eager for “gnarly spicy” details about the infrastructure that powers Cursorās product experience.
Sualeh promised an unfiltered, behind-the-scenes look at the systems that deliver the seamless AI assistance developers have come to rely on. The conversation was structured into two halves: first, Sualeh would share insights and answer pre-prepared questions, followed by a live Q&A from the community.
Understanding the Scale of Cursorās Infrastructure
When asked about the scale Cursor is operating at today, Sualehās answer immediately impressed with the numbers:
āWeāve scaled by a factor of 100 or more in the last year. Some things even more.ā
Model Inference at Massive Scale
- Cursorās custom models handle roughly 100 million model calls per day.
- This volume represents a substantial portion of the ‘Frontier Model’ traffic ā the high-end AI model traffic on the internet.
- The infrastructure hosts some of the largest models in the world.
Document Indexing at Scale
- Cursorās indexing systems process about a billion documents per day.
- Over the lifetime of the company, this adds up to hundreds of billions of documents indexed.
- Some of the data stores have grown to sizes comparable to large, well-known tech companies, reflecting the massive scale of operations.
Sualeh reflected on the journey:
āItās now kind of a service at scale. Itās been fun, many many lessons along the way.ā
Breaking Down the Core Components of Cursorās Infrastructure
Sualeh outlined three core pillars that form the backbone of Cursorās infrastructure:
1. Indexing Systems
The indexing systems are responsible for ingesting, organizing, and retrieving vast amounts of code and documents. This includes:
- The retrieval systems that search through repositories when developers ask questions.
- Indexing of repositories that can be as large as those maintained by big companies like Instacart.
- These systems enable Cursor to understand your codebase deeply and quickly.
2. Model Systems
Cursor runs several types of models:
- The autocomplete model runs on every keystroke, meaning at any given second they are handling about 20,000 model calls.
- This requires a massive fleet of hardware ā around 2,000 NVIDIA H100 GPUs or equivalent.
- The infrastructure is globally distributed, with data centers and GPU clusters on the East Coast (Virginia), West Coast (Phoenix), London, and Tokyo.
- At one point, they tried Frankfurt, but it was unstable.
- This global distribution ensures fast response times no matter where users are located.
3. Streaming Infrastructure and Data Storage
- This pillar handles how data is streamed, stored, and processed behind the scenes.
- It is not part of what the user directly interacts with but is critical for improving Cursor continuously.
- It manages data that comes into Cursor, supports screening, data cleaning, and background processes to make the product better daily.
Architectural Flow: From Codebase to Editor Interactions
When a user wants to interact with their codebase through Cursor, hereās how the architecture supports that:
- The client (developerās editor) sends requests to Cursorās services.
- The architecture is largely monolithic, with one big service deployed mostly on the East Coast.
- This monolith handles requests such as figuring out relevant code snippets, indexing, and feeding them into models.
- There is a bidirectional stream between the model inference server and clients globally (e.g., India, Pakistan).
- The system translates model actions into editor responses in real time.
Sualeh emphasized the importance of compartmentalization:
āYou want to keep the blast radius of critical services small. For example, you donāt want login to go down just because someone wrote an infinite loop on an experimental code path.ā
Lessons Learned: Managing Complexity and Reliability
Early on, the monolithic server architecture caused incidents:
- Someone accidentally wrote an infinite loop, which would take down chat services.
- To mitigate this, Cursor compartmentalized different parts of the infrastructure, isolating experimental or less critical code from core services.
- They imposed strict rules on server complexity, favoring simple, understandable code over complicated logic.
Sualeh shared a candid reflection:
āIf itās too complicated, you donāt understand it and you canāt run it.ā
The Challenge of Inference at Scale: Self-Hosted vs. Third-Party Models
Cursor uses a mix of:
- Self-hosted models for custom autocomplete and specific tasks.
- Third-party model providers like Anthropic, OpenAI, and Google Cloud Platform for foundation model inference.
Regarding scale:
- The autocomplete workload generates far more tokens than foundation models because it runs on every keystroke.
- Exact split numbers between self-hosted and third-party providers are not publicly known.
Challenges with Third-Party Providers
- Providers often have scaling issues. For example, one unnamed provider would crash at 30-40 million tokens per minute, despite requests for 100 million tokens.
- Many providers historically had poor reliability and caching issues.
- Cursor deals with rate limits, cold start problems, and load balancing across providers.
The Cold Start Problem and Incident Response
Sualeh explained the cold start challenge vividly:
āImagine you have 100,000 requests per second, and all your nodes die. The new nodes you bring up get overwhelmed by requests before they can become healthy, causing a cascading failure.ā
This was a real challenge during incidents where:
- Traffic had to be carefully managed.
- Providers had to prioritize critical users.
- Cursor developed systems to throttle traffic, kill requests, and manage priorities under load.
Sualeh compared this to WhatsAppās approach, where certain prefixes (important users) get priority during outages.
Incident Deep Dive: Indexing System Outage
One of the most challenging incidents happened around September 2023 involving the indexing system. Hereās a detailed walkthrough:
The Indexing System Architecture
- Uses a Merkle tree structure to efficiently detect changes between the client and server.
- Each file and folder is hashed; folder hashes depend on childrenās hashes.
- This allows quick reconciliation of what parts of the codebase have changed without re-indexing everything.
Initial Database Choice: Yugabyte
- Yugabyte is a globally distributed database inspired by Google Spanner.
- Despite its reputation, Cursor struggled to get it running efficiently.
- The complexity and cost led them to switch to PostgreSQL (RDS).
Challenges with PostgreSQL at Scale
- As traffic spiked, a DynamoDB cache layer had a bug:
- Large files were not cached, causing repeated heavy loads on embedding models.
- This led to race conditions and slow cache commits.
- PostgreSQL faced performance problems due to:
- A high volume of long-running distributed transactions.
- Heavy use of foreign keys, which added load due to pointer chasing.
- The need to update hashes and maintain consistency in the Merkle tree during commits.
Race Conditions and Distributed Systems Complexity
- The team encountered race conditions on queues responsible for hashing and commit operations.
- Race conditions in distributed systems are notoriously hard to debug.
- Despite alerts and monitoring, the underlying issues remained hidden for some time.
Incident Response and War Room Dynamics
- The team ignored early customer complaints to focus on root cause analysis.
- They discovered the cache hit rate was abnormal and traced it back to uncommitted large files.
- Following a grueling investigation, they fixed the race conditions and stabilized the system.
Post-Incident: Scaling Issues and Data Volume
- The PostgreSQL instance grew to 22 terabytes of data, approaching RDS limits (64 TB max).
- PostgreSQLās update semantics (update = delete + insert) led to many dead tuples requiring vacuuming.
- The vacuum and anti-wraparound processes started grinding the database to a halt.
- At one point, the database even failed to boot.
Seeking Help and Expert Collaboration
- Cursor reached out to AWS support and RDS architects but received limited help.
- They leveraged internal expertise, including a VP of Infrastructure formerly at GitHub.
- The team manually deleted transactions, adjusted schemas (removing foreign keys), and rewrote workloads to reduce database load.
Moving to Object Storage and Future-Proofing
- Inspired by emerging trends, the team began moving large tables (e.g., chunk storage) to object storage (S3, R2, Blob storage).
- Object storage offers unparalleled scalability and reliability compared to traditional databases.
- This migration is ongoing, aiming to offload the database and improve resilience.
Side Story: Trends in Databases and Streaming
Sualeh gave a fascinating overview of modern trends in databases:
- The rise of analytical engines like Snowflake and Druid.
- Use of streaming systems like Kafka, and emerging approaches that move Kafka onto blob storage to avoid operational complexity.
- Companies like Warp Stream (acquired by Confluent) and Tribo innovate by running streaming and vector databases on object storage.
- The ultimate goal is to scale databases by not having traditional databases for certain workloads.
Handling Scale: Multiple Providers, Rate Limits, and Reliability
Cursor operates in a complex ecosystem involving:
- Multiple AI model providers with varying reliability.
- Constant negotiation for increased token quotas.
- Load balancing traffic across providers to maximize uptime.
- Dealing with frequent outages and throttling.
Sualeh described the frantic nature of these negotiations:
āAnthropic calls Google for more TPUs; Google calls their Borg team to rack more hardware. Everyone is scrambling to keep up with the explosive demand.ā
Community Q&A Highlights
Differentiating from Other Coding Assistants
Sualeh addressed questions about how Cursor differentiates itself from GitHub Copilot, Codium, and others:
- The team originally hesitated to compete with Copilot, which felt like an unbeatable “Dream Team” product.
- They recognized that the ceiling for autocomplete was really high and aimed to push beyond it.
- Cursor focuses on deep indexing, fast response, and a seamless user experience that feels like code completion but smarter.
- The product only feels cluttered in hindsight; the goal is to automate meaningful parts of engineering.
Student Discounts and Abuse Prevention
- Cursor is one of the largest Frontier Model inference users globally.
- They spend significant resources blocking spammers who create thousands of fake accounts to exploit free tokens.
- They encourage responsible disclosure of bugs that could allow free token exploits and offer free subscriptions for such reports.
- Managing abuse and spam is a continuous battle.
Rate Limits and Enterprise Contracts
- Cursor has enterprise contracts with providers to secure high rate limits.
- Despite that, providers still throttle or crash under extreme load.
- The team has built messy but effective infrastructure to handle rate limiting and failover.
- They engage in constant negotiation with providers to increase capacity.
Security and Privacy of Code
- Cursor takes security seriously and invests significant effort to protect user code.
- Vector embeddings stored in databases are encrypted with keys that live only on the user’s device.
- Even if the vector database is compromised, the data is unintelligible without the keys.
- Sualeh feels safer knowing this encryption is in place, despite the inherent risks.
The Role of Computer Science and Future of IDEs
- Sualeh reflected on whether deep CS knowledge or math is still necessary in the era of AI-assisted coding.
- He believes automating boring, repetitive tasks frees engineers to be more creative.
- Complex system design and architecture work will remain important.
- Models will assist engineers but not replace the need for skilled developers.
- IDEs will evolve but will continue to play a significant role as tools programmers use.
Costs and Pricing
- Cursor strives to keep pricing affordable (~$20/month).
- The cost reflects the heavy infrastructure and compute needed to power AI on every keystroke.
- Users get value far exceeding the price when compared to alternatives.
- The team works hard to optimize costs so more people can access the product.
Advice on Dropping Out to Build Startups
- Sualeh encourages thoughtful consideration.
- University provides valuable learning and social connections.
- Many operational skills are learned on the job.
- The choice to drop out should be obvious and personal; if it makes sense, go for it.
Conclusion: A Glimpse Into the Future of AI at Scale
This session with Cursorās CTO Sualeh Asif was a remarkable deep dive into the realities of building and scaling AI-powered developer tools.
From handling hundreds of millions of model calls daily, to navigating complex distributed databases and caching layers, to managing multi-cloud inference infrastructure under extreme load, the story of Cursorās infrastructure is one of continuous innovation and resilience.
Sualehās candid insights and stories reveal the critical importance of simplicity, compartmentalization, and constant adaptation in infrastructure at scale. They also highlight the exciting future where AI assists engineers in ways unimaginable a few years ago, without replacing the creativity and complexity that human developers bring.
For anyone interested in how cutting-edge AI products are built and operated, this conversation is a masterclass in the art and science of infrastructure at scale.
Thank you for joining this comprehensive exploration of Cursorās infrastructure and scale. For more detailed discussions, stay tuned to Stanfordās CS153 series and follow Cursorās journey as they continue to push the boundaries of AI-assisted coding.