How to measure whether AI is actually working

I asked a client last month how their AI agent was performing. They said "great." I asked what metrics they were tracking. Silence.

This happens more than you'd think. Companies invest real money in AI, ship something, and then just... assume it's working because nobody's complaining. That's not measurement. That's hope.

The problem with "it feels faster"

Feelings aren't data. I've seen teams swear that an AI tool is saving them hours per week, and when we actually measured it, the saving was about 20 minutes. I've also seen the opposite: a team convinced their AI wasn't doing much, and when we looked at the numbers, it was saving 9 hours a week across the group.

Human perception of time saved is terrible. We overestimate savings on tasks we hated and underestimate savings on tasks we didn't mind. You need actual numbers.

What to measure (and when)

The metrics depend on what the AI is doing. But there are three categories that apply to almost everything.

Time displacement

The most straightforward one. How long did this task take before AI? How long does it take now?

Measure both. Before you deploy, get a baseline. Time the task across 5-10 instances. Record it. After deployment, do the same thing 2 weeks in and again at 6 weeks.

Why twice? Because the first measurement catches the obvious time saving. The second catches whether people have actually adopted it or quietly gone back to the old way. I've seen this happen four times now. The tool works. People don't use it. The time saving exists in theory but not in practice.

For a reporting agent I built for a B2B SaaS client in January, the baseline was 2.5 hours per week for one person. After the agent: 15 minutes of review time. That's a 90% reduction in time. Clear, measurable, defensible.

Quality impact

Time isn't everything. Sometimes AI makes things faster but worse. Sometimes it makes things slower but significantly better.

For any task where the output goes to a customer or feeds into a decision, track error rates. Before AI: how often was the output wrong or needed correction? After AI: same question.

One client was using an agent to draft customer responses. The agent was fast. But 30% of responses needed editing before sending, and about 5% had factual errors that would've been embarrassing if they'd gone out. We tracked this for a month, identified the patterns causing the errors, adjusted the prompt, and got the edit rate down to 12% and factual errors to under 1%. Without measuring, they'd have either assumed it was fine or scrapped it entirely. Neither would've been the right call.

Cost per outcome

This is where people's eyes glaze over, but it matters. What does it cost you, in API fees and compute, to process one unit of work through the AI?

I track this per agent, per week. It takes about 5 minutes to check and it's caught problems twice. Once when a prompt change accidentally tripled the token usage (the agent was processing the same data three times due to a loop). Once when a model upgrade changed the pricing tier and nobody noticed for two weeks.

For most of the agents I build, the cost is somewhere between £0.01 and £0.50 per run. At low volume that's noise. At 500 runs a day, the difference between £0.05 and £0.30 per run is £3,750 a month. Worth tracking.

The ROI calculation

Once you've got time displacement and cost data, the ROI calc is simple.

Value created: Hours saved per week x hourly cost of the person who was doing it. If a £50/hour engineer saves 5 hours a week, that's £250/week or roughly £13,000 a year.

Cost: API/compute costs + time spent maintaining the agent + the original build cost amortised over 12 months.

For most of the agents I've built, the ROI is somewhere between 3x and 15x over 12 months. The reporting agent I mentioned earlier costs about £40/month in API fees and saves roughly £6,500/year in time. That's a straightforward win.

But I've also killed agents that looked good on paper and didn't deliver. One I built was supposed to automate ticket categorisation. It worked technically. But the time saving was only about 4 minutes per day because the human was already fast at it. The maintenance overhead ate the saving. Net ROI was basically zero. I shut it down after 6 weeks.

Measuring means being willing to admit when something isn't working.

When to measure

Don't wait. Start tracking from day one.

I set up a simple dashboard for every agent I deploy. Nothing fancy. A Google Sheet with four columns: date, runs, errors, cost. Updated weekly. Takes 5 minutes.

At 2 weeks, I do the first time displacement comparison against the baseline. At 6 weeks, I do a full review: time, quality, cost, and a conversation with the person using it about what's working and what isn't.

If the numbers look good at 6 weeks, I move to monthly monitoring. If they don't, I either fix it or kill it.

The metrics nobody tracks (but should)

Adoption rate. How many people are actually using the thing? If you built it for a team of 8 and only 3 are using it, your ROI is less than half what you modelled.

Workaround rate. How often do people bypass the AI and do it the old way? This is the canary in the coal mine. If people are working around the AI, something's wrong with the output, the UX, or the trust level. Find out which.

The temptation with AI is to ship it and move on to the next shiny thing. The discipline is in the measurement. If you're investing in AI and you're not tracking these numbers, you don't actually know whether it's working. You're guessing.

We cover measurement frameworks as part of every AI readiness assessment we do, because there's no point identifying opportunities if you can't tell whether they delivered.