Karpathy: AI is changing Software Engineering

Marton Trencseni - Thu 03 July 2025 - AI

Karpathy's original talk

Andrej Karpathy, formerly Director of FSD at Tesla and now working on an AI teaching startup, recently shared his insights on how AI is influencing Software Engineering. The 40-minute talk is worth watching in its entirety:

Slides from the talk are also online. His main points are summarized below:

  • Three historical epochs.

    • Software 1.0 → hand-written, instruction-level code.
    • Software 2.0 → neural-network weights learned from data (e.g., Tesla replaced C++ rules with vision nets).
    • Software 3.0 → large language models (LLMs) that you “program in English”; the prompt is now the source code. ([medium.com][1])
  • LLM = new computer + new OS – and it feels like the 1960s mainframe era. Karpathy likens a frontier model to an IBM System/360-style mainframe: a single, capital-intensive machine that many users time-share remotely. Compute is metered per token like CPU-seconds, the context window is the RAM, and tool calls are the syscalls. Just as minis and PCs eventually escaped the glass house, he expects cheaper, smaller models to follow. ([medium.com][1])

  • Consumer-first diffusion. Unlike every prior tech wave, chatbots hit billions of end-users before enterprises; CIO uptake is now in catch-up mode. ([medium.com][1])

  • “Iron-Man-suit” product pattern. The LLM apps that really work today keep a human in the verify loop, expose an autonomy slider, and design UIs for rapid audit/rollback—amplifying people instead of chasing full autonomy. ([medium.com][1])

  • “Vibe-coding.” Because “the hottest new programming language is English,” domain experts can build by describing the feel they want. The hard part that remains is DevOps, auth, payments, etc.—fertile ground for the next generation of agentic tools. ([medium.com][1])

  • Build an LLM-native web. Karpathy calls for infrastructure tweaks so agents can navigate deterministically: machine-readable llm.txt, code-first docs, actionable API snippets, and utilities like get.ingest that flatten repos into single text blobs. ([medium.com][1])

  • Jagged intelligence ⇒ design for fallibility. Models have super-human recall but lack common-sense guarantees. Success means pairing their brute-force creativity with tight human feedback loops and robust evaluation harnesses. ([medium.com][1])

  • Career takeaway. Fluency across 1.0, 2.0, and 3.0 stacks is becoming table-stakes; every business process is now just a prompt away from reinvention. ([medium.com][1])

Hacker News discussion

The video of course sparked a lively discussion on Hacker News, also summarized below:

  • Most commenters view “Software 2.0/3.0” as add-ons, not replacements — deterministic code still drives kernels, browsers, games, and business apps.
  • Hardware limits and the scarcity of proprietary, labeled data are the main blockers to widespread NN adoption; without them, traditional code or small classical ML models usually win.
  • Karpathy jumps into the thread to clarify that his version numbers are categories, not a quality ladder; they can and will coexist.
  • The standout production trick is structured output / function-calling / schema-constrained decoding: force the model to emit JSON that fits a type definition, then post-process. This makes LLMs usable as parsers, classifiers, and extractors without any fine-tuning.
  • Even with strict schemas, models still hallucinate or drop fields, so commenters insist on rigorous validation layers and domain-specific checks on top.
  • A veteran engineer argues that neural pipelines spread only where two conditions hold: a formal spec is impossible and data is abundant; otherwise deterministic code stays.
  • Several people propose a new split: deterministic vs. probabilistic programming. Prompting an LLM is treated as its own paradigm alongside imperative, functional, and declarative styles.
  • Others note that LLMs can generate synthetic data to train smaller, cheaper models, creating a “3.0 bootstraps 2.0” workflow.
  • Skeptics draw crypto-boom parallels, warning that hype obscures reliability issues and that LLM-written code is often brittle without heavy human QA.
  • Many engineers dislike the version-number branding itself: it suggests deprecation, yet 1.0, 2.0, and 3.0 will likely run side by side for decades.
  • Hands-on reports of coding agents are mixed: some developers already lean on them for boilerplate and unit tests; others are annoyed by constant context misses (e.g., spitting out React in a Vue repo).
  • Consensus emerges that humans must stay in the verify loop. The win is shrinking the review cycle, not eliminating it—and junior-to-senior pipelines (and perhaps entire desk-job categories) could be disrupted in the process.

My thoughts

My thoughts on this topic:

1. “Software 2.0” hasn’t eaten “Software 1.0.” I don’t agree with Karpathy that neural-network code is replacing deterministic code across the board. Yes, Tesla FSD leaned heavily on NNs, but the Linux kernel, Chrome, Word, and even AAA games still run on classic hand-written logic. NN adoption remains limited to special domains—self-driving, speech-to-text, Photoshop filters, and the like. In fact, I’d argue that LLMs (“Software 3.0”) have stolen the spotlight: most teams are now busy figuring out how to ship LLMs to production. As a Data Science manager, I see us using traditional ML (random forests, etc.) for most real-world problems—with the code itself written via LLM assistance. “Small-data” tasks rarely benefit from a big NN, and foundational NN models demand bespoke large datasets that most companies simply don’t have. My favorite proof-point: adaptive user interfaces. Despite oceans of interaction data, I’ve never seen an app that greys out seldom-used buttons or re-flows its UI on the fly. Building such a custom NN is still a highly non-trivial engineering effort.

2. “Software 1.5”: LLM-assisted code generation is devouring “Software 1.0.” LLM assistance is everywhere: snippets, recommendations, unit tests, bug hunts, linting, code reviews. Nine months ago Google said 25% of its code was AI-generated; if you include assistants, linters, and auto-review tools, the real share is surely higher, and heading for essentially 100% in tech firms. I’m not predicting code without humans, just that every line will be touched by an AI co-pilot before it lands in production.

3. “Software 3.0” is bending my stack choices.

  • Language: I used to favor JavaScript end-to-end (frontend and backend); now, because Python has the richest LLM tooling, I’d pick Python for the backend.
  • Architecture: I’ve always loved monoliths, but huge codebases overwhelm an LLM’s context window. In a startup setting I’d choose a microservices architecture because smaller, self-contained services are easier for LLMs to reason about, even if orchestration gets complex later.
  • Testing: Today I put more energy into generating robust unit tests than before, for extra safety when working with an AI co-pilot. I am not a fan of “vibe-coding” (blindly pasting AI output). The human still owns the final diff.

4. The 1960s mainframe analogy lands. Today’s chat boxes are the punch cards of our era. As models shrink, inference speeds up, and tool calling matures, we’ll invent richer interfaces—the LLM “Excel moment” hasn’t arrived yet, even though I already lean on LLMs daily.

5. Keep humans in the verify loop. Karpathy’s observation that successful LLM apps keep a human checker in the loop is spot-on. The big opportunity today is tooling that lets us verify faster — essentially, we still need to figure out how to fuse LLMs and GUIs so that review, rollback, and feedback cycles become rapid.

Conclusion

Bottom line: neural nets haven’t displaced classical code, but LLM-powered tooling is rapidly eating the process of writing that code, steering our stacks toward languages, architectures, and workflows that models can digest. The winners will be products and teams that embrace AI co-pilots while keeping fast and effective human verification at the heart of every release.