My second summer job was at a software company that made backup software.
They had a lab. Rows of machines running different operating systems, different hardware configurations. And they had people — actual humans — whose job was to sit down, run through test scripts, and click every button.
That was me, summer of whatever year that was. Professional button clicker.
It wasn’t glamorous. But it made sense at the time, because the alternative — writing code to test your code — felt like extra work on top of the real work. And honestly, in the early days of software, it kind of was.
The long, slow war over automated testing
The industry has been fighting this battle for a long time.
Unit testing frameworks started showing up in the late 90s — JUnit for Java in 1997, eventually phpunit for PHP in the mid-2000s. The argument for them was obvious in retrospect: write the test once, run it forever, catch regressions automatically instead of paying humans to click buttons.
And yet. Even in 2012, when I was job hunting, I was meeting engineers who didn’t do it. Didn’t believe in it. Thought it was overhead. I’ve talked to people in the last two years still having the same argument.
CI/CD pipelines are now considered standard practice. They’re also still not universal. In 2022, I had to debate — and I use that word generously — a senior developer on whether developing all production code directly on the server and then doing a file copy to his desktop was maybe… not ideal.
His response: “But then where will the files be stored?”
This was 2022.
So when I say the industry spent 30 years fighting automated testing, I’m not being dramatic. I watched it happen.
AI walked in and assumed we’d won
I’m working through the Anthropic Academy’s “Building with the Claude API” course right now — 84 lessons, 8+ hours. One of the earliest sections covers Prompt Evaluation.
The premise: when you’re building AI-powered features, you need to test your prompts the same way you test your code. Does this prompt reliably produce the output you expect? Does it degrade when the input varies? Does changing the model or the parameters break something that was working?
And the tooling they teach you to build is automated. Model-graded. Code-executed. Built into your development workflow from day one.
Not “here’s a best practice for mature teams.” The assumed baseline. The starting point.
Here’s what a basic eval loop looks like:
python
def evaluate_prompt(test_cases, prompt_template):
results = []
for case in test_cases:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": prompt_template.format(input=case["input"])
}]
)
# Grade the output automatically
grade = grade_response(
response.content[0].text,
case["expected"]
)
results.append(grade)
return sum(results) / len(results) # accuracy score
You run your test cases. You get a score. You iterate on the prompt. You run them again. Same loop as software testing — just applied to natural language instead of code.
But here’s where it gets better: you don’t write the test cases yourself. Claude does. Feed it your prompt and your success criteria, and it generates the corner cases, the negative cases, the weird edge inputs — the stuff you’d never think to test until a user finds it in production at 2 am.
That alone is worth the price of admission. Except that the price of admission is zero.
The course also covers using Claude itself as the grader. You write a prompt that evaluates whether another prompt’s output meets your criteria. The AI grades the AI’s homework. Which sounds circular until you realize it’s just a rubric — you define what “good” looks like, and the model applies it consistently across hundreds of test cases. Faster than human review. More consistent than eyeballing outputs yourself.
Design for testability — remember that?
There was a philosophy in software engineering called design for testability. The idea: structure your code from the start so it’s easy to test, rather than bolting tests on after the fact. Separation of concerns, dependency injection, and keeping functions small and pure. If your code is hard to test, that’s a signal your architecture is tangled.
It never fully caught on in a lot of shops. Too much upfront discipline for teams that just wanted to ship.
Prompt engineering is the same idea applied to natural language. A clean, well-structured prompt produces predictable, testable outputs. A sprawling mess of nested conditions and “but also do this unless X” produces chaos — and untestable chaos at that.
If your prompt is hard to evaluate, your prompt is probably tangled. Same principle. Different medium. Thirty years later.
Why this matters beyond the course
If you’re building AI-powered features into a product, prompt evaluation isn’t optional overhead. It’s the difference between “we have an AI feature” and “we have an AI feature that works reliably.”
The prompts are your logic. Untested prompts are untested code. And we’ve known since roughly 1997 that untested code is a liability.
The engineers who built the strongest case for automated testing in the 2000s weren’t wrong. They were just early, and the industry took its time catching up.
AI isn’t waiting for the industry to catch up this time. It’s just assuming you already did.
The Anthropic Academy’s API course — free, certified, straight from Anthropic — covers this in the Prompt Evaluation section. If you’re building with Claude, it’s worth your time.
Up next: Prompt Engineering — or, what happens when you let Claude write its own instructions.