View on GitHub

Jeff Gabriel

Software development thoughts and findings.

Home

A Glimpse of the Future

published 2025-04-16 11:28:00 +0200

In order to better understand coding with AI tooling I have been working fairly intensively on a hobby project this past month. My experience seems to echo what I’ve seen others posting lately: these are tools and as such you need to think carefully about how you work with them. The results can be impressive and learning to use these tools is critically important for all software engineers.

I do want to call out the hype suggesting developers are now optional or that great things are being built with no developers at all. That’s either wishful thinking, strategic marketing, or a prelude to some very sad retrospectives. This isn’t self preservation; I am an engineering executive who doesn’t code at work and would be a hero to some if I claimed we no longer needed developers. Of course this doesn’t stop anyone from claiming it someday will be good enough - and hype won’t yield to Hofstadter’s Law for the advancement of technology.

I will try to summarize my experiences of the last month at the end of this post to explain where I think the tools can get better, and where I think it will require a different sort of leap forward.

The Project

I decided to build a game which I have worked on before in the artisanal style of actually coding it myself in Unity. This allowed me to understand in advance the relative complexity of the task, and to rely on requirements I had previously worked out. I set out to entirely vibe-code the game, committed to using the tools as much and as creatively as possible. My previous attempts to code with AI were primarily about calling APIs on Ollama, or OpenAI and pasting things into windows and working the code back into my project. With the advances of Cursor and Claude Code I was able to work together with the tools much more seamlessly. Each tool, even OpenAI’s project system, let me create contextual documents to help guide development when my prompts were insufficient. I worked with ChatGPT’s o1 reasoning model to choose an architecture, create a development plan, and to prepare some additional materials like formatting lists into json configuration files. Just as I worked with an OpenAI project to generate a good plan, I used the architecture guidance and other materials to create rules in cursor and to create a CLAUDE file for Claude Code. To be clear, there is no real reason I could find to use different tools other than wanting to test out their strengths and weaknesses.

The Learning

There are themes in the lessons I learned, and I’ll start there. Then it will be most straightforward to just provide discrete lessons in list form.

Your Tool is Kinda Dumb

Okay, that’s not totally fair. Your AI tool isn’t dumb - it just reflects its machine nature more than you might expect, especially when you’re chatting with it like a colleague and watching it spin up solutions on demand. It knows all kinds of programming languages, APIs and SDKs for tools you’ve never used, algorithms you’ve not heard of, and can code whole solutions in seconds. However, if you vibe code your way into a relatively complex problem space you will realize that the code is convoluted, duplicative, imprecise, and downright incorrect. While some have argued that it doesn’t write pretty code, but who cares anymore - writing pretty code is about humans reading it after all. We’re in a post-human paradigm and we just need to get with the times! That would be great or terrible if true, depending on your view of programmers’ value, but it turns out that humans do need to read it. It isn’t just about humans reading either, but that clean code is easier to contextualize and prompt against, and often a path to truly efficient execution.

AI Tools Need Small Tasks

Much of the work of the programmer is thinking, not typing. The reason a programmer spends so much time thinking, and loses so much productivity when you interrupt their thinking, is that they have to work through the complex interactions of the code they want to write, the code which has been written, and the many functional and non-functional requirements against which the code is being written.

The AI doesn’t think at all, no matter how tempted you are to anthropomorphize its actions. When left to its own work time and again it will invent whole new paradigms for problems solved elsewhere. It will load state continuously without thinking about efficiency. It will use different naming conventions and folder structures in different parts of the codebase. It will write tests that test no real behavior. It will be satisfied with solutions that don’t compile or don’t run. It will create new types and then transformations for those types into other types with an identical structure. It will solve problems in unique(ly flawed) ways for which computer science has well-worn solutions. It did all of these terrible things and more when left to just run command after command. Maybe you could look the other way at many of these issues if you really never had to read the code, or run the solution at scale. However, I ultimately needed to debug these issues because the spaghetti code (or lava code) overwhelmed itself and came crashing down. After about my 16th hour of trying to work only through prompts I had to dig in and fix the solution so I could get back to working. What’s more, during those 16 hours I wasted many more prompting my way out of terrible solutions. If I wasn’t an experienced software engineer, these would have been project ending problems. If I actually needed the software for a commercial product, getting it to compile again would have been the least of my worries.

Your Job Has Changed

What I was most surprised to learn wasn’t what the AI could do—but what it made me realize I needed to do differently. Making the most of the AI tools means changing what I expect my role to be as a software engineer. I don’t need to type very much, and I don’t have to do all the thinking, but my coding partner is naive and makes a mess of things when I am not paying attention. My job is to provide sharp context, craft clean prompts, and review every line like it was submitted by an intern with boundless enthusiasm and no sense of architecture.

The importance of both good context and well crafted prompts is all about focus for the agent. Good context meant taking time out to write down hints about coding rules, important requirements that are consistent across the code base, and the target architecture. In a large existing codebase I could see taking the time to build up docs on the application, the large and small scale architecture, the shared components the team uses, limitations on security, etc. But again, it’s about focusing the job you want the machine to do and also learning to be careful with how you phrase your requests. For example, the machine is wildly credulous with regard to your commands. A simple redirection such as “wouldn’t it make sense to merge class X and class Y?” turned into a mess of new code because the correct answer was “no”. In trying to test the edges of credulity I asked the machine if a function which is just a standard grid distance calculation was off by 1 in its count. Sure enough, it was totally wrong and was rewritten by the agent, but was not wrong in the first place. I had to direct prompts with important information, which I should add to the static rules/context, such as the fact that the gameboard is a triangular grid and standard graph theory algorithms apply.

PR reviews (or commit reviews as is really the case) became my most important job during this exercise. It’s already an important job in a well run engineering organization, but likely to become a major focus of work as AI tools are used more broadly. There are two reasons for this. First, you simply must review what the machine did (or planned) in detail at every step because you cannot afford to become unfamiliar with the codebase. Not only will real production software (which this project didn’t worry about at all) require intensive direction to retain the ‘ilities’, there will come a point where you will need to debug what the machine cannot. Second, you need a lot more commits and PRs in my experience. Because coding the solution is relatively cheap, and because the machine sometimes makes big mistakes, it made sense to start committing every time we arrived a reasonable functional waypoint so that I could easily declare bankruptcy on a new direction and start over from the last commit. Now, this workflow came up largely as I let the agent driven changes roll forward. Maybe every change should be approved and added, but my experience was that this was slower than reviewing the agent’s full attempt at a change and tweaking it. There were differences here between how Cursor would work and how Claude Code would work as well, so more time in a professional setting would be required.

Additional Lessons Learned

Here are a few quick-hit lessons that didn’t quite fit elsewhere but were too useful (or expensive) to ignore:

  1. In Cursor in particular it’s important to change the agent/ask/manual mode for the prompt
  2. GPT models are very useful for creating documents that feed the rules for your coding agents. Spending time in a GPT project where you’ve loaded this information, asked more questions of it to see where it falls apart, and iterating with additional documents before you start coding is time well spent.
  3. The model’s ability to make use of screenshots and console logs is impressive. Feeding these as context is useful even if just asking for direction or feedback.
  4. Claude Code seemed slightly better at getting itself out of trouble. CC was also more likely to limit the amount of change it would suggest at a time, making the review cycle more smooth and symbiotic. Yet in its current state (beta) you have to have the IDE up anyhow, and giving file context to Cursor inside the IDE kept me there most of the time.
  5. Claude Code was very expensive relative to Cursor. I spent $20 for a month of Cursor with 24+ hours of coding, and blew through $35 of CC credits doing 2-3 hours of coding.

The Future?

I have no special insight here but I do have some opinions. While the models and the related tools are impressive and capable of improving productivity, there are obviously areas to improve in order to have more autonomous coding workflows that would truly unleash the machines to clear your backlog. The near future will likely see improvements to automated context gathering. Context windows are growing, context caches are improving, and model costs are reducing so that it may be fine to include very large codebases in the context cache. Large context is not always needed, so choice of relevant context can also improve, even if a machine review prompts the user to include something rather than running ahead without context which is a current form of hallucination that reduces effectiveness.

Targeted improvements to mimic real software engineers’ mental model seem reasonable as well. I mean simple things like “the compiler is always right”, so always make sure your change compiles first; or the importance of a good test and ways to affirm a test actually challenges behavior. There are nuances that I think will not improve here, but an agent could be more insistent about tests, and using tests to affirm the correctness of new software. More nuanced rules of the programmer mind like when debugging look first to the last thing changed are still simple rule additions for an agent which were totally missed in my experience, where the agent would run off and change another 5 files when I would realize it was single character change needed on the last thing added.

What will be hard to fix is what is impossible for LLMs generally which is that they are incapable of goal direction according to an informed world model. This requires a different type of AI altogether, and I have no idea how close we really are to that reality. So where does it seem like world model is missing? First, goal direction. I can code for days on end to meet the goals spelled out in a 2 page document, going off in many directions as I discover and make mistakes, but always coming back to my purpose and my code-based belief system. Related is the world model that is informed by the context of my company and its peculiarities, the other software which exists adjacent to the current system, the years of debugging which drive my behavior in complex ways across all manner of problems, etc. To me this means the need to specify narrowly and review exhaustively isn’t going away soon - and never going away with LLM technology alone.

So yes, the future may still be full of code written by machines. But if we’re going to get there, someone still has to teach the machines what to care about. And that, for now, is still a very human job.