2 more ways to get better AI outputs
More eval structures for your AI projects.
In the last edition, we built three graders for a newsletter idea scraper. A hard filter. A scoring rubric. And a calibration loop that checked whether the scores meant anything.
You ran an idea through the scraper. The rubric scored it. You trusted the number.
Then you ran the same idea again. Lo and behold, a different score. With the same prompt, same rubric, and same idea! Here’s why that happens and two evals that fix it.
Want to be scary good at AI?
Workflow guides got you started. But they won’t get you ahead. The founders, PMs, marketers, designers and engineers using Claude, Lovable and even Cursor — everyone who’s actually shipping — are inside GrowthX, building every weekend.
Build-focused events, expert-led sessions, AI credits, feedback, and community. It’s got everything except the excuse.
Even the best language models don’t think; they predict. Every time the model reads your rubric, it’s making a best guess at what the next word should be. In clear cases, the guess is nearly always the same.
On edge cases, where your rule could be read two ways, the guess changes. The evals we discuss today solve for that.
The bad prompt problem.
Say you have a prompt that scores newsletter ideas. It’s working. Then one idea scores wrong, so you fix the prompt. You run it again. That idea now scores correctly, but another one that was scoring correctly before looks off.
Why? The prompt applies the same rules to every idea. Change the rules broadly, and both scores shift.
The fix is to show the model how the rules apply to the specific idea that was breaking by adding a worked example with a one-line explanation. The lens stays the same. The model just gets better at using it.
But how do you know what broke?
What is a regression suite?
A regression suite is a fixed list of ideas with fixed expected scores, saved somewhere the prompt can’t see them. You run every prompt change against this list. If a score shifts by more than half a point on an idea you didn’t intend to change, something broke.
Here’s an example of a regression test in action.
When we were locking in rules for our in-house news scraper tool, we scored Indian startup and business news stories across six dimensions — role reversal, public strategy failure, repeatable template, platform crossing function line, macro India insight, and global precedent.
Originally, the rule was to apply the global precedent dimension to every item, with comparable economies such as China, Indonesia, and Brazil.
Here’s how the following news topics scored.
1. Rapido raises $240M at $3B · 5/6
✅ Role reversal · ✅ Public strategy failure · ✅ Platform crossing · ✅ Macro India insight · ✅ Global precedent
Precedent: Grab/Gojek — started as bike taxi in SEA, expanded to food + fintech + logistics. Conditions forming in India: Tier-2 mobility gap ✅, UPI rails enabling embedded fintech ✅
🏷 Revenue size ✅ (Entrackr) | David vs Goliath ✅ (TechCrunch)
(Business Standard)
12. Gameskraft — ₹526 Cr assets frozen, founders arrested · 3/6
✅ Public strategy failure · ✅ Macro India insight · ✅ Global precedent
Precedent: FanDuel/DraftKings in US — daily fantasy companies faced state-level enforcement, pivoted to licensed sports betting when regulation clarified. Conditions forming: India gaming regulation in flux ✅, operators exploring pivot to esports and skill gaming ✅
(Business Standard)
20. HrdWyr Series A $13M — AI-native SoC for consumer electronics and EVs · 2/6
✅ Macro India insight · ✅ Global precedent
Precedent: Taiwan’s TSMC early ecosystem — government PLI + domestic anchor clients (Apple for Taiwan, boAt/EV companies for India) seeding chip design. Conditions forming: India PLI for semiconductor ✅, domestic anchor clients creating demand ✅
(BSA roundup)
Off list: Sindhuja Microcredit $5M pre-Series D — rural MFI is a well-established model globally, not a new experiment with visibly forming conditions
Then we updated the rules.
The update said: global precedent only applies to seed or pre-Series A companies, or genuinely new categories.
Why? Because established companies with proven traction in India don’t need a global precedent to validate their direction. Plus, applying that rule inflated scores for established companies needlessly.
Thanks to the frozen test set, we noticed Rapido’s ranking changed exactly as we wanted in the rules. That’s a good thing. But the test set also alerts you when things go wrong. Two cases:
1. The score changes when it shouldn’t. Something you didn’t touch got affected. That’s the regression you were trying to avoid.
2. The score stays the same when it should have changed. The rule you added didn’t actually work.
How to set it up:
Create a Google Sheet with three columns: idea name, expected score, and actual score after each run. Before you make any change to the prompt, copy the current version into a new row in a separate tab. Label it with the date. That’s your version history.
After every prompt change, run your test set. Compare actual scores against expected. And don’t use the new version until you’ve traced the regression back to the specific line that caused it and decided whether to keep it.
The regression suite tells you whether a prompt change broke something. But what if you haven’t changed anything, and the prompt is still scoring the same idea differently each time you run it?
This is what a consistency check helps with.
When to run a consistency test.
Say, your news curation scraper has a hard rule: only flag items that score strictly above 30 out of 35 when the Revenue dimension is skipped. It’s a mechanical threshold, no judgment required.
And here’s what a run shows.
3. Quick Commerce — Amazon Now vs Blinkit 🚩
Amazon Now scaled fast enough that Blinkit now has reason to watch. The late global incumbent is finally a real threat in 10-minute delivery. Distribution muscle still buys a seat, even arriving last.
(The Arc)
Industry L4(4) + Timely L4(8) + Revenue skipped + Teaching L3(12) = 24/35
The scraper flagged the Amazon Now news despite its lower score. That’s an inconsistency in the output given the set of rules.
Why did it happen?
On questioning, the model stated it saw more narrative interest in the topic, meaning it made a judgment despite the rules.
The fix?
Simple, tighten the rule. Instead of flagging scores above 30, flag news topics with scores 31 and above — never flag a story scoring 30 or below, regardless of how strong the story appears.
That’s it. The gap closes.
How to run it.
Take three items — one that should always pass, one that should always fail, and one that is borderline. Run each five times without changing anything. If a clear pass fails on any run, or a clear fail passes on any run, you have a consistency problem. Find the rule the model is treating as a suggestion. Make it mechanical.
Neither of the evals we discussed today is really catching errors in the model output. Instead, they’re exposing loopholes in the rules — the same way the Amazon Now flag exposed that “above 30” wasn’t precise enough. Precise rules make AI reasoning better. And prompt by prompt, your tool gets better too.




