Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447
Lex Fridman
Cursor Team discusses the future of programming with AI, focusing on innovations in code editing, automation, and enhancing user experience through advanced AI tools and collaboration.
You can also read:
Summary
Concise summary with Quick Takes and a list of curated Key Ideas
Summiz Holo
Comprehensive breakdown that ensures no insight is left out
AI-Driven Code Editors and the Future of Programming
Cursor, a code editor based on VS Code, is rethinking how AI can assist in coding. It’s not just about autocompleting lines; it’s about predicting entire changes, making coding faster and more fun. The idea is to eliminate low-entropy actions—those repetitive, predictable steps—so you can focus on the creative parts. As one of the team members put it, "fast is fun."
The evolution of AI models, like GitHub Copilot, has already shown how much productivity can be gained. But the future promises even more radical changes. Larger models, as scaling laws suggest, will lead to smarter AI that can handle more complex tasks, like bug detection and code verification. The "gold cursor tab" concept is a step in this direction, allowing programmers to skip ahead in time, reducing the need for manual edits.
Cursor’s approach is to integrate AI deeply into the coding process, predicting not just the next character but the entire flow of changes. This is where the "shadow workspace" comes in—a hidden instance of the editor where AI agents can modify code without affecting the user’s immediate environment. It’s like having a colleague working in the background, ready to assist when needed.
The future of programming is about speed, iteration, and reducing boilerplate. AI will make it easier to migrate code, detect bugs, and even suggest design decisions. But it’s not just about making things faster; it’s about making coding more enjoyable. As one of the team members said, "the skills will probably change too... it’ll be a lot more fun."
the idea was these models are going to get much better their capabilities are going to improve and it's going to entirely change how you build software both in a you will have big productivity gains but also radical in how like the active building software is going to change a lot.
The Role of Code Editors and the Journey to Cursor
A code editor is like a "souped-up word processor" for programmers, designed to handle the structured nature of code. It helps with everything from visual differentiation of tokens to navigating codebases like hyperlinks and catching basic errors. But the role of code editors is evolving. Over the next decade, the way software is built will change significantly, especially with AI integration.
For the Cursor team, a key principle is that code editors should be fun. And a big part of that fun comes from speed. "Fast is fun," they say, and coding offers a unique experience where you can build cool things quickly, without the gatekeeping of other disciplines.
The team’s journey to Cursor started with Vim, but GitHub Copilot’s autocomplete feature in VS Code was compelling enough to make them switch. Copilot suggests lines of code as you type, creating a feeling of connection when it works well—"like when your friend completes your sentences." This sense of flow and understanding is what they aim to enhance with Cursor.
The Origin of Cursor and the Impact of GPT-4
Even when GitHub Copilot gets things wrong, it’s not a big deal. You just type another character, and it might get it right. "Even when it's wrong, it's like a little bit annoying, but it's not that bad because you just type another character and then maybe it gets you." This iterative process is part of what makes Copilot so compelling.
The real turning point for the Cursor team came in 2020 with the release of the scaling laws papers from OpenAI. These papers showed that bigger models could lead to predictable improvements. "It looked like you could make these models a lot better if you had more compute and more data." This sparked conceptual conversations about how AI would transform various fields, including programming.
By late 2022, the team got early access to GPT-4, and the leap in capabilities was undeniable. "The step up in GPT-4 felt enormous... it really made concrete the theoretical gains that we had predicted before." This shift made them realize that all of programming would eventually flow through these models, not just specific tools. "It felt like this wasn't just going to be a point solution thing... all of programming was going to flow through these models."
One memorable moment was a bet made in June 2022 about whether AI could win a gold medal in the International Math Olympiad by 2024. At first, it seemed impossible, but the rapid progress in AI capabilities made the bet more plausible. "I remember thinking, this is not going to happen... but I was very wrong."
Scaling Laws and the Decision to Fork VS Code
The conversation about scaling laws was a turning point. Initially, there was skepticism about whether scaling alone could lead to massive progress. "I went through the stages of grief... anger, denial, and then finally acceptance." But eventually, the realization hit: scaling laws could indeed result in significant gains. This shift in thinking laid the groundwork for the team's optimism about AI's future.
When it came to Cursor, the decision to fork VS Code seemed obvious. The team believed that AI's rapid improvement would fundamentally change how software is built. "These models are going to get much better... it's going to entirely change how you build software." They didn't want to be limited by the constraints of being just a plugin. "We didn't want to get locked in by those limitations."
Despite VS Code with Copilot being a competitor, the team felt that being ahead in AI capabilities, even by a few months, would make a huge difference. "Being even just a few months ahead... makes your product much more useful." While Microsoft has done "fantastic things," the speaker believes that a startup like Cursor can innovate and push boundaries faster.
Looking ahead, the goal is for Cursor to evolve so rapidly that the current version will seem outdated in just a year. "The Cursor a year from now will need to make the Cursor of today look obsolete."
Frustration with Copilot and the Birth of Cursor
The team behind Cursor started with a deep frustration. Despite models improving, the experience with tools like Copilot felt stagnant. "The whole experience had not changed... it started to feel stale." They wanted something new, something that pushed boundaries.
Cursor's development approach was different. They worked on the user experience and model improvements simultaneously. "We're developing the UX and the way you interact with the model at the same time as we're developing how we make the model give better answers." This close collaboration was key. "The person making the UI and the person training the model sit 18 feet away... often the same person."
Cursor excels in two main areas. First, it acts like a fast colleague, predicting not just the next characters, but the next entire change or diff. Second, it helps users jump ahead of the AI, turning instructions into code. "Two things Cursor is pretty good at... being like a fast colleague who can jump ahead of you... and helping you jump ahead of the AI and go from instructions to code."
One standout feature is the "tab" function. After accepting an edit, the model predicts where to go next, often jumping several lines down. "You just press tab, it would go 18 lines down and show you the next edit... the internal competition was how many tabs can we make them press."
Cursor's Low-Entropy Prediction and Future Capabilities
Cursor's goal is to eliminate predictable, low-entropy actions in coding. "There are a lot of tokens in code that are super predictable," and Cursor aims to "eliminate all the low entropy actions you take inside of the editor when the intent is effectively determined." The idea is to skip over these predictable steps, allowing the user to jump ahead in their workflow.
To achieve this, Cursor uses small, sparse models that are "incredibly pre-fill token hungry," meaning they process large amounts of code but generate fewer tokens. A key breakthrough was the development of "speculative edits," a variant of speculative decoding that improves performance and reduces latency.
Caching is crucial for handling large input tokens without overloading GPUs. "Cashing plays a huge role... you need to reuse the KV cache across requests" to minimize compute load and maintain speed.
Looking ahead, Cursor will be able to "generate code, fill empty space, edit code across multiple lines, jump to different locations inside the same file," and even predict next actions like terminal commands. It will also help users by taking them to relevant information, like definitions, to verify code. "Maybe it should actually take you to a place that's like the definition of something and then take you back," providing the knowledge needed to make informed decisions.
Cursor's Diff Interface Evolution and Future Improvements
Cursor's team is working on making the next five minutes of coding more predictable, based on recent actions. "The next five minutes of what you're going to do is actually predictable from the stuff you've done recently." This predictability is key to streamlining the coding process.
The diff interface has gone through several iterations. Early versions used "blue crossed out lines" and "Google doc style" strikethroughs to show code changes, but these were "super distracting." The current version uses a box on the side to show deletions and additions, but it's still evolving.
One challenge is making the interface intuitive. For example, holding the "option button" to see AI suggestions was "nonintuitive," and the team is still working on improving this.
Looking ahead, the team is exploring ways to make reviewing diffs more efficient, especially for large edits or multiple files. Ideas include highlighting important parts of the diff and graying out low-entropy sections, or using a model to flag potential bugs with a "red squiggly."
The "verification problem" remains a challenge, as reviewing diffs for large edits can be "prohibitive." The team is excited about finding solutions to make this process smoother and more intuitive.
AI's Role in Code Review and Programming
The current code review process is inefficient, especially when dealing with unfamiliar code. Reviewers spend too much time trying to "grok this code that's often quite unfamiliar," and it doesn't even catch that many bugs. AI can significantly improve this by guiding the reviewer to the most important parts of the code, making the process faster and more effective. "You want a model to guide you through the thing," rather than manually figuring out which parts of the code matter.
When the code is produced by AI, the review process can be optimized entirely for the reviewer. There's no need to worry about the experience of the code producer, as you would with human-generated code. "In the case where the person that produced the code is a language model, you don't have to care that much about their experience." This allows for a more streamlined, reviewer-focused process.
As for natural language programming, it won't replace traditional methods entirely. Sometimes it's easier to show the AI what you want rather than explain it in words. "Sometimes it's just so annoying to explain to Swalla what I want him to do, and so I actually take over the keyboard." Natural language will have its place, but "it will not be the way most people program most of the time."
Looking further ahead, brain-machine interfaces might eventually become part of the programming process, but that's not the immediate focus. "Maybe eventually we will get to brain-machine interfaces," but for now, the focus is on improving the tools we already have.
Frontier Models and Speculative Edits in Code Generation
Frontier models are good at sketching out plans for code but struggle when it comes to creating diffs, especially in large files. They often mess up simple things like counting line numbers. To address this, the apply model steps in. It takes the rough code block generated by the frontier model and applies it to the file. Contrary to what many think, this process is not a simple deterministic algorithm. "It fails at least 40% of the time" when trying to do deterministic matching, leading to a poor product experience.
To make the process faster, speculative edits are used. This is a variant of speculative decoding, where multiple tokens are processed at once, speeding up the generation. The model uses the existing code as a strong prior, feeding chunks of the original code back into the model. Most of the time, the model agrees and spits the same code back out, allowing for parallel processing. When the model reaches a point of disagreement, it generates new tokens and resumes speculating on the next chunk of code.
This method results in a much faster version of normal code editing. "It looks like a much faster version of the model rewriting all the code." While the model is streaming the code, you can start reviewing it before it's even done, eliminating the need for a big loading screen.
Comparing LLMs for Coding and the Limits of Benchmarks
No single model dominates in all aspects of coding. Different models excel in different areas like speed, code editing, and long-context processing. "There's no model that dominates others... the categories being speed, ability to edit code, ability to process lots of code, long context." However, Sonet is considered the best overall, especially in reasoning and handling more complex problems. "The one that I'd say right now is just kind of net best is Sonet... it's really good at reasoning."
Benchmarks, though useful, fall short when compared to real-world coding. "Interview problems are very well specified... while the human stuff is less specified." Real coding is messier, with vague instructions and context-dependent tasks, unlike the clean, well-defined problems in benchmarks.
Public benchmarks like SweetBench are also problematic because they are often contaminated in the training data. This leads to models hallucinating correct answers without real context. "One of the most popular agent benchmarks, SweetBench, is really contaminated in the training data... they can hallucinate the right file paths."
Human Feedback, Hardware Differences, and Prompt Design Challenges
Evaluating AI models isn't just about benchmarks; human feedback plays a big role. "People will actually just have humans play with the things and give qualitative feedback." Internally, they rely heavily on this, calling it the "vibe benchmark" or "human benchmark." It's a way to get a feel for how models perform in real-world scenarios, beyond just numbers.
Hardware differences can also affect model performance. For example, AWS uses different chips than Nvidia GPUs, and "someone speculated that Claude's degraded performance had to do with maybe using the quantized version on AWS Bedrock." These subtle hardware variations can lead to noticeable changes in how models behave.
Prompt design is another tricky area. Early models like GPT-4 were "quite sensitive to the prompts" and had small context windows. Even with today's larger context windows, filling them up can slow down the model or confuse it. To manage this, they use an internal system called Preum, which helps decide what information to include in the prompt.
Interestingly, they draw inspiration from web design when formatting prompts. It's like designing a website that works on both mobile and desktop. "The thing that we really like is React and the declarative approach where you use JSX in JavaScript." This approach helps prioritize and organize information dynamically, depending on the input size.
JSX Prompts, Prioritization, and Resolving Ambiguity
The Cursor team uses a pre-renderer, similar to Chrome, to fit everything onto the page. They prompt with JSX, which "kind of looks like React," using components like a file component to prioritize lines of code. The line where the cursor is gets the highest priority, and "you subtract one for every line that is farther away." This helps center the most important information when rendering.
When dealing with large code blocks, they use retrieval, embedding, and reranking scores to prioritize components. This allows them to handle vast amounts of code more efficiently.
There's also a tension between letting programmers be lazy and encouraging them to be more articulate in their prompts. While it's tempting to let people "just do what you want," the system sometimes needs more depth of thought to resolve ambiguity. The model can either ask for clarification or show multiple possible generations to help the user choose.
To further assist, the system suggests files to add based on previous commits. For example, if you're working on an API, it might suggest editing the client and server files as well, based on your past work.
Agents in Programming and Speed Optimization
Agents are "really, really cool" because they resemble human-like behavior, making them feel like a step closer to AGI. However, they are "not yet super useful for many things," though they are getting close. For example, an agent could be useful for fixing bugs—finding the right files, reproducing the bug, fixing it, and verifying the solution. But programming is often iterative, and agents might not be ideal for all tasks since "you don’t really know what you want until you’ve seen an initial version." Still, agents could work in the background, handling backend tasks while the programmer focuses on frontend work.
Speed is a major focus for Cursor, and most aspects feel "really fast." The slowest part is the "apply" function, which takes about "1 or 2 seconds," but the team is working on fixing it. Techniques like "cache warming" help reduce latency and cost by reusing KV cache results. As the user types, the system can predict the context and warm the cache to improve performance.
KV Cache, Speculative Execution, and Reinforcement Learning
Transformers rely on keys and values to store representations of previous tokens, allowing them to "see" past tokens instead of processing each one independently. Normally, for every token, the model has to do a forward pass through the entire network, which is slow. But by storing these keys and values in the GPU, the model can skip reprocessing previous tokens. "You just need to do the forward pass through that last token," speeding up the process significantly.
Cursor also uses speculative execution to predict what the user might accept next. "You can basically predict ahead as if the user would have accepted the suggestion and then trigger another request." This trick makes the system feel faster, even though the model itself hasn't changed. By pre-caching multiple potential suggestions, the system can present the next option instantly when the user presses tab.
Reinforcement learning (RL) plays a role in optimizing these predictions. The model can predict multiple suggestions, and while a single prediction might not be accurate, "if you predict like 10 different things, turns out that one of the 10 is right." The model is trained to prioritize suggestions that humans are more likely to accept, rewarding good predictions and punishing bad ones. This process helps the system refine its outputs to better match user preferences.
Optimizing KV Cache with Multi-Query and Multi-Latent Attention
To speed up token generation, especially with large batch sizes, techniques like multi-query attention (MQA) and group query attention (GQA) are used to reduce the size of the KV cache. In MQA, "you just preserve the query heads and get rid of all the key value heads," while GQA keeps all the query heads but reduces the number of key and value heads. The goal is to compress the size of the keys and values, which helps overcome memory bandwidth bottlenecks. "You're bottlenecked by how quickly you can read those cache keys and values," so reducing their size speeds up the process.
Multi-latent attention (MLA) takes this a step further by compressing all keys and values into a single latent vector, which is expanded during inference. "MLA turns the entirety of your keys and values across all your heads into one latent vector," making it more efficient by reducing the size of the vector.
These optimizations allow for larger KV caches, which improves both the time to first token and the ability to handle larger prompts and batch sizes. "You can now make your cache a lot larger because you've less space allocated for the KV cache," and this also means "the size of your KV cache is both the size of all your prompts multiplied by the number of prompts being processed in parallel."
Shadow Workspace and Background Computation
The Shadow workspace is designed to run background computations that assist the user without affecting their workflow. The idea is to predict not just the next few lines of code but what the user might do in the next 10 minutes. "We want there to be a lot of stuff happening in the background," allowing the model to spend more computation time iterating and improving predictions.
A key part of this is providing feedback to the model. One important feedback mechanism is the language server, which exists for most programming languages. "Language servers... tell you you're using the wrong type or allow you to go to definition." Cursor uses this information not only to assist the programmer but also to feed it to the AI models in the background.
The Shadow workspace spawns a hidden window where AI agents can modify code without saving it. "The AI agents can modify code however they want... as long as they don't save it." This allows the model to iterate and get feedback from the language servers without disrupting the user.
Implementing this across different operating systems presents challenges. On Linux, "you can mirror the file system," but on Mac and Windows, it's more difficult. One potential workaround is "holding a lock on saving," allowing the model to operate on a version of the files without writing to disk.
AI Agents and Bug Detection Challenges
When thinking about AI agents for coding, the focus is often on bug detection and feature implementation. "Do you want them to find bugs? Implement new features?" The idea is to automate tasks like bug finding, especially logical bugs. However, current models struggle with this. "These models are so bad at bug finding... when just naively prompted to find a bug, they're incredibly poorly calibrated."
The issue stems from the pre-training distribution. "These models are a really strong reflection of the pre-training distribution," meaning they excel at tasks like code generation and question answering, which are well-represented in their training data. But bug detection? Not so much. "There aren't really that many examples of actually detecting real bugs and proposing fixes."
Despite this, there's potential for improvement. With more targeted training, models could become better at bug detection. "It just takes a little bit of nudging in that direction." The same way models have improved in other areas, they could eventually excel at finding and fixing bugs.
Human Calibration and AI in Code Safety
Humans have an edge in identifying critical bugs because they bring cultural knowledge and experience to the table. "Humans are really calibrated on which bugs are really important." For example, a staff engineer might remember a sketchy piece of code that took down a server years ago, while an AI model might not grasp the severity of such a bug.
Labeling dangerous code is crucial for both humans and AI. "If a code can do a lot of damage, one should add a comment that says this line of code is dangerous." This practice helps AI models focus on critical areas. "If you actually write dangerous in every single line, the models will pay more attention."
However, formal verification is the ultimate goal. "Until we have formal verification for everything, you know for certain that you have not introduced a bug if the proof passes." In the future, AI models could suggest specs and prove code correctness. "The model will suggest a spec, and a smart reasoning model computes a proof that the implementation follows the spec."
Formal Verification and AI Bug Models
Formal verification is possible, but it's a tough challenge, especially for large codebases. "You can prove formally verify down to the hardware... through the GCC compiler and then through the VAR log down to the hardware." The key is breaking down these "multi-layered systems" and verifying each part. If done right, "it should be possible" to formally verify even big codebases.
However, the specification problem is real, especially when dealing with external dependencies like APIs or language models. "How do you handle side effects or... external dependencies like calling the Stripe API?" And what about language models? "Maybe people use language models as primitives in the programs they write... how do you now include that?"
The dream is to prove that language models are aligned and give correct answers. "It feels possible that you could actually prove that a language model is aligned... or like you can prove that it actually gives the right answer." Achieving this would be a game-changer, from "making sure your code doesn't have bugs" to "making sure AI doesn't destroy all of human civilization."
Bug finding is a crucial step for AI to take on more programming tasks. "It should very quickly catch the stupid bugs like off by one errors." But as AI builds more of the system, it needs to "not just generate but also verify" its own code. Without this, the problems with AI programming will become "untenable."
Training bug models is tricky, but one idea is to introduce bugs synthetically. "You can train a model to introduce bugs in existing code... then train a reverse bug model that can find bugs using this synthetic data."
Bug Detection, Monetary Incentives, and Code-Running Loops
Finding bugs in code is tough, especially when just staring at the file. Often, you need to run the code, use traces, or step through a debugger to spot the issue. "It's kind of a hard problem to like stare at a file and be like where's the bug... often you have to run the code."
One idea is to have a fast, specialized model running in the background to catch bugs. For more critical issues, you could throw a lot of compute power at it. "You have a really specialty model that's quite fast... and sometimes you zap that with tons and tons of compute."
Monetary incentives could also play a role. Imagine tipping $5 when a model generates perfect code or finds a bug. "I want a tip on a button that goes yeah here's $5... just to support the company and the interface." But there's concern that introducing money could make the process less fun. "Introducing money into the product makes it... you have to think about money and all you want to think about is the code."
A technical solution could help verify bug fixes, reducing the need for an honor system. "If we can get to a place where we understand the output of the system more... maybe the bounty system doesn't need to rely on the honor system."
Finally, there's the question of how much interaction exists between the terminal and the code. Could there be a loop where the code runs, errors are detected, and suggestions are made? "How much interaction is there between the terminal and the code... can you do a loop where it runs the code and suggests how to change it?"
Database Branching, AWS, and Scaling Challenges
Branching databases is becoming a key feature for testing new code without affecting production. "You could sort of add a branch to the database... add a branch to the WR ahead log." This allows developers to test features safely, and AI agents might even use branching to test against different database branches. "Maybe the AI agents will use branching... test against some branch."
AWS remains the go-to for infrastructure, despite its notoriously difficult interface. "Whenever you use an AWS product you just know that it's going to work... it might be absolute hell to set it up." The reliability of AWS is unmatched, and if something goes wrong, "it's probably your problem."
Scaling brings its own set of challenges. As systems grow, unexpected issues like integer overflows start to appear. "You run into all of these with like... int overflows on our tables." Custom systems, like the Ral system for computing a semantic index of codebases, are particularly tricky to scale. "Our Ral system for computing a semantic index... has been one of the trickier things to scale."
Predicting where systems will break as they scale is nearly impossible. "It's very hard to predict where systems will break when you scale them... there's always something weird that's going to happen."
Client-Server Sync, Hashing, and Scaling Large Codebases
They're "very very paranoid about client bugs," so most details are stored on the server. The challenge is ensuring the local codebase state matches the server state. To solve this, they use a recursive hashing system. "For every single file, you can keep a hash... recursively do that until the top." This avoids the need to constantly download hashes for every file, which would create "ginormous network overhead" and strain the database. Instead, they reconcile a single hash at the root of the project and only dig deeper if there's a mismatch.
Scaling this system is tough, especially for companies with "enormous number of files." The bottleneck isn't storing data in the vector database but embedding the code. To handle this, they use a clever trick: caching vectors based on the hash of a given chunk. This way, when multiple users access the same codebase, they don't need to re-embed everything. "You just have some cache on the actual vectors computed from the hash of a given chunk."
Local vs. Cloud Models: Hardware Limitations and Future Solutions
Running models locally is tough. "More than 80% of our users are on Windows machines," and many of those machines aren't powerful enough. Even high-end devices like MacBook Pros struggle with large codebases. "It's not even a matter of if you're just a student... even the best programmer at a big company will have a horrible experience" trying to process everything locally.
As models grow, they need more resources. "These models are just bigger in total... they're going to need to fit often not even on a single node but multiple nodes." This makes local processing impractical for most users.
One potential future solution is homomorphic encryption. This would allow users to encrypt their data locally, send it to the cloud, and have it processed without exposing the data. "The server can run models that you cannot run locally on this encrypted data."
Privacy, Centralization, and Context Challenges in AI Models
Homomorphic encryption could be a game-changer for privacy-preserving machine learning, but it's still in the research phase due to the high overhead. "If you can make that happen, it would be really impactful," but right now, "the overhead is really big."
As AI models become more economically useful, more of the world's data will flow through a few centralized actors. This raises concerns about surveillance and misuse. "More and more of the world's information and data will flow through one or two centralized actors," and while surveillance might start for good reasons, "you start doing bad things with a lot of the world's data."
Managing context in AI models is also tricky. Including too much context can slow down the models and confuse them. "The more context you include, the slower they are," and "for a lot of these models, they get confused if you have a lot of information in the prompt."
Improving Retrieval Systems and Post-Training Models for Codebases
The team is focused on improving retrieval systems, aiming for "better edding models" and "better rankers." They want to pick the most relevant parts of the code base for what you're doing, and they believe they can do this much better.
One of the big ideas they're exploring is infinite context windows. The challenge isn't just making the context windows infinite but also ensuring the model can pay attention to that infinite context. "Can you make the model actually pay attention to the infinite context?" And if so, can you cache it to avoid recomputing everything?
They're also looking into whether learning information at the weight level (fine-tuning) leads to a different kind of understanding than in-context learning. "It might be that you actually get sort of a qualitatively different type of understanding if you do it more at the weight level."
A key experiment involves post-training models to understand specific codebases, like VS Code. The idea is to continue pre-training with general code data and then fine-tune it with questions about a particular repository. They are even considering using synthetic data to generate questions about the code. "What if you could actually specifically train or post-train a model such that it really was built to understand this code base?"
Finally, there's a debate on whether the model should handle everything end-to-end, including retrieval, or if retrieval should be separated from the Frontier Model. "Do you want to separate the retrieval from the Frontier Model?"
Test Time Compute and Process Reward Models
The team is exploring test time compute as a way to improve model performance without increasing model size. Instead of training a larger model, they suggest running the same model for longer to achieve the quality of a much larger one. "We could perhaps use the same size model and run it for longer to get an answer at the quality of a much larger model." This approach could help overcome the data wall they're hitting, where scaling up data and model size is becoming increasingly difficult. "It's going to be hard to continue scaling up this regime."
A key challenge is the model routing problem—figuring out which model to use for different levels of task complexity. "How do you figure out which problem requires what level of intelligence?" This is still an open research problem, and no one has fully solved it yet. "I don't think anyone's actually cracked this model routing problem quite well."
They also discuss the difference between outcome reward models and process reward models. Outcome reward models grade the final result, while process reward models evaluate the chain of thought leading to the result. "Process reward models instead try to grade the chain of thought." Open AI has done some preliminary work on this, using human labelers to create a large dataset for grading chains of thought.
Process Reward Models and OpenAI's Chain of Thought Decision
Process reward models are currently used to grade outputs from language models, but there's potential for more. Right now, people "sample a bunch of outputs" and use these models to grade them. The real excitement lies in using process reward models for tree search, where each step in the chain of thought is evaluated. This could allow models to "branch out and explore multiple paths" and then use the process reward models to assess which path is best. The challenge is figuring out how to train these models effectively for this purpose.
OpenAI made a controversial decision to hide the chain of thought from users. They argue that showing it could make it easier for people to replicate their technology. "It might actually be easier if you had access to that hidden chain of thought to replicate the technology." This decision mirrors another move by some API providers, who used to offer access to log probabilities but later removed it. The speculation is that this was done to prevent users from distilling the model's capabilities.
As for Cursor's integration of the 01 model, it's still a work in progress. "01 is not part of the default cursor experience in any way," and the team is still figuring out how to make it useful in the editor. They haven't yet found a way to integrate it into daily use, but they're exploring possibilities.
Cursor's Future and Synthetic Data Types
Concerns about Cursor's future, especially with GitHub Copilot's advancements, are addressed head-on. The speaker believes the space is still in its early stages, and the potential is massive. "The ceiling here is really, really, really incredibly high." The best product in 3-4 years will be far more useful than anything today. The key is continuous innovation: "If you stop innovating on the product, you will lose." This creates opportunities for startups to compete by building something better, even against established players.
The speaker then dives into synthetic data, breaking it down into three main types:
- Distillation: A language model outputs tokens or probability distributions, and a less capable model is trained on this. "This approach is not going to get you a more capable model than the original one," but it's useful for specific tasks.
- Bug detection: It's easier to introduce bugs than to detect them. A model can generate bugs, and synthetic data can train another model to detect them. "It's a lot easier to introduce reasonable-looking bugs than it is to actually detect them."
- Verification-based generation: Models generate data that can be easily verified, like in math or formal language. "You can have an OKAY model generate a ton of rollouts and then choose the ones that have actually been verified."
Verification, Generation, and the P vs NP Question
The conversation dives into the challenge of creating a perfect verifier for open-ended tasks. While verification is easier in some cases, like coding, it's much harder for long-horizon tasks. "Having the perfect verifier feels really, really hard to do with just like open-ended miscellaneous tasks." However, if verification is easier than generation, it could lead to "massive gains."
Reinforcement Learning with Human Feedback (RLHF) is discussed as a way to improve models by training them on human feedback. "RHf is when the reward model you use is trained from some labels you've collected from humans giving feedback." There's potential for recursive improvement if the model finds verification easier than generation. "It may work if the language model has a much easier time verifying some solution than it does generating it."
Ranking is also easier than generation, and the analogy to P vs NP problems is brought up. "If you believe P does not equal NP, then there's this massive class of problems that are much, much easier to verify given a proof than actually proving it."
The conversation takes a philosophical turn, wondering if AI could solve the P vs NP problem. "I wonder if the same thing will prove P not equal to NP or P equal to NP." And if it does, "Who gets the credit? Another open philosophical question."
Scaling Laws and Distillation
The original scaling laws paper by OpenAI had some issues, particularly with learning rate schedules, but Chinchilla corrected it. Since then, people have shifted focus from just compute, parameters, and data to optimizing for inference budget. "People start now optimizing more so for making the thing work really well given a given inference budget."
There are more dimensions to consider now, like inference compute and context length. "Inference compute is the obvious one... context length is another obvious one." For example, if you care about long context windows, you might train a model that’s less efficient during training but much cheaper and faster at inference.
Distillation is another promising approach. By training a large model and then distilling it into a smaller one, you can get a faster, more efficient model. "Distillation gives just a faster model." This method could also help overcome the data wall by extracting more signal from the same data. "Distillation... is perhaps another way of getting over... the data wall."
Compute, Engineering, and the Limits of Scaling
The conversation dives into whether the real limitation in AI development is compute or ideas. While more compute allows for more experiments, the belief is that "we are limited in terms of ideas that we have." Even with all the compute and data in the world, the bottleneck often comes down to "really good engineering." The engineering effort required to make models work at scale is immense, from writing code to optimizing GPU performance. "There's so much work that goes into research that is just like pure, really, really hard engineering work."
The strategy so far has been to focus on scaling up models, taking the "low-hanging fruit" first. As long as scaling works, "there's no point of experimenting with new ideas." But eventually, scaling hits a wall, and at that point, "new ideas are probably needed." When you're spending trillions, it's time to "reevaluate your ideas" because you're likely "idea-limited" by then.
The Future of Programming: Control and Abstraction
The team is excited about a future where the programmer stays in control, with an emphasis on speed, agency, and the ability to modify anything. They’re not thrilled about the idea of talking to a computer like an engineering department, as it sacrifices control and specificity. "It’s much harder to be really specific when you’re talking in a text box," they say, and that leads to giving up important decisions to the AI.
Engineering, they argue, isn’t just about implementing a pre-written spec. It’s about "tons of tiny micro-decisions" and trade-offs between speed, cost, and other factors. They believe that as long as humans are designing software, the human should stay in the driver’s seat.
One intriguing idea is controlling the level of abstraction in a codebase. You could look at the code in the form of pseudo-code, edit it, and have those changes reflected in the formal code. "It would be nice if you can go up and down the abstraction stack," they suggest, though they admit it’s still a fuzzy concept.
While AI can help with specific tasks, like fixing well-specified bugs, most programming requires human decision-making. "That’s not most of programming," they say, and it’s not the kind of programming they believe people value.
The Evolution of Programming: Fun, Speed, and AI
The team is excited about how programming is evolving. "Programming today is way more fun than back then," they say, with less boilerplate and more focus on creativity and speed. The ability to build things quickly and maintain individual control is being "turned up a ton," making it a great time to be a software developer.
AI is playing a big role in this shift. They describe a recent experience where they spent five days on a large migration in their codebase. In the future, they imagine showing a few examples to an AI, which would then apply the changes across the entire codebase in minutes. "You can iterate much, much faster," they say, allowing for more experimentation and less upfront planning.
There’s also speculation about programming moving towards natural language, where you might "operate in the design space of natural language." But there’s concern about losing creative control if programming becomes too abstracted.
Finally, they touch on how AI might expand the pool of people who can become great programmers. "The kind of person that should be able to do great programming might expand," they suggest, as AI tools lower the barrier to entry.
Passion, Intent, and the Future of Hybrid Programming
The best programmers are the ones who truly love coding. Some people on the team, after a full day of work, go home and "boot up Cursor" to work on side projects until 3 a.m. For them, coding is more than a job—it's a passion. "When they're sad, they say, 'I just really need to code.'" This obsession with programming is what sets them apart.
But it's not just about typing lines of code. Pressing "tab" isn't just a shortcut; it's about injecting intent into the code. "You're shaping the things that's being created," they explain. Programming is evolving into a higher bandwidth form of communication, where it's less about typing and more about expressing what you want to create. "Typing is much lower bandwidth than communicating intent."
Looking ahead, the team envisions a future where human-AI collaboration will redefine programming. They're building a "human-AI programmer" that will be "an order of magnitude more effective than any one engineer." This hybrid engineer will combine human creativity with AI's speed, iterating at the speed of thought and out-engineering even the best pure AI systems.
Conclusion
The future of programming is bright, with tools like Cursor making coding faster and more fun. Embracing AI and innovative ideas will redefine how we code, making it accessible and enjoyable for everyone.