Engineering

How we cut AI review time from 86 seconds to 27

March 26, 2026 8 min read Coffee Break Resume

When we launched Coffee Break Resume, our full review took 86 seconds. Users were staring at a spinner while we ran 7 parallel Claude API calls. Here's every change we made to get it to 27 seconds — and what we learned about LLM performance that isn't obvious.

The starting point

Coffee Break Resume runs 7 Claude API calls simultaneously via Promise.all every time someone pays for a full review. The calls are: a full review JSON (ATS analysis, section feedback, rewritten bullets), LinkedIn summary, cover letter, interview prep, resume rewriter, cold outreach messages, and a thank you letter.

Running them in parallel was the right call — serial execution would have been catastrophic. But parallel doesn't mean fast. Our initial timings looked like this:

Initial timings — Promise.all
[thankyou]
6,880ms
[cover]
7,940ms
[linkedin]
9,723ms
[review]
17,295ms
[outreach]
18,898ms
[rewriter]
51,910ms
[interview]
86,157ms
Total wall time86,158ms

Total time equals the slowest call — that's how Promise.all works. Everything finished in under 20 seconds except the two heavy calls. We were waiting 67 extra seconds just for interview and rewriter.

What we tried and what actually moved the needle

Reducing output volume — partial win

The obvious first move was to generate less. We cut interview questions from 10 to 6, dropped the questions-to-ask from 5 to 3, and removed the CARL framework from the main interview prompt (making it an on-demand fetch when the user clicks the CARL tab). We also reduced max_tokens for the interview call from 4,500 to 3,000.

Result: interview dropped from 86s to 44s. Better, but still the bottleneck.

Reducing output tokens helps but has diminishing returns. The real cost is often input processing time — how much context the model reads before generating a single token.

Lazy-loading the full resume rewrite — big win

The rewriter call was generating a complete plain-text resume rewrite on every paid review, even though most users never download the resume PDF. It was easily the largest single output — 800-1,200 tokens of raw text.

We pulled fullResume out of the main rewriter prompt entirely and created a separate /fullresume endpoint that fires only when the user clicks "Download Resume PDF". The main call now returns only section diffs and key changes, and we dropped max_tokens from 6,000 to 3,000.

Rewriter went from 51s to 17s. That single change saved 34 seconds off the wall time.

Trimming input context — the biggest win

After the output cuts, interview was still at 44s. We'd reduced what it was generating, but it was still slow. Digging into the prompt, we found the root cause: the interview call was receiving up to 16,000 characters of input before generating a single token.

❌ Before

Resume: truncated (12,000 chars)
JD: jd.slice(0, 4000)
Total input: ~16,000 chars

✅ After

Resume: truncated.slice(0, 3000)
JD: jd.slice(0, 1500)
Total input: ~4,500 chars

The interview prompt doesn't need the full resume. It needs enough context to know who the person is and what they've done — the first 3,000 characters covers that for virtually every resume. Similarly, 1,500 characters of a job description is enough to understand the role; the interview call doesn't need the full JD that the rewriter uses for keyword analysis.

Interview dropped from 44s to 27s. The context reduction was worth more than all the output cuts combined.

Key insight: input tokens matter as much as output tokens for wall-clock time. Time-to-first-token scales with input context length. Don't give a call more context than it actually needs.

Final timings

Final timings — Promise.all
[thankyou]
6,846ms
[cover]
8,497ms
[linkedin]
9,019ms
[outreach]
10,928ms
[review]
13,947ms
[rewriter]
17,472ms
[interview]
27,129ms
Total wall time27,132ms ↓ 68%

What each change contributed

Change Before After Saved
Cut interview questions 10→6, questionsToAsk 5→3 86s 73s −13s
Remove CARL from interview prompt (lazy-load) 73s 44s −29s
Lazy-load fullResume via /fullresume endpoint 44s 27s (rewriter) −34s rewriter
Trim interview input context 16k→4.5k chars 44s interview 27s interview −17s interview

What we'd do differently from the start

Design prompts around minimum viable context. Every call should only receive the context it actually needs. The rewriter needs the full resume; the interview prep doesn't. We started by passing the same large context to every call and paid for it in wall time.

Lazy-load anything that isn't immediately visible. The full resume rewrite is only needed if the user downloads the PDF. The CARL framework is only needed if the user clicks the CARL tab. Generate the critical path first; fetch the rest on demand.

Measure input tokens separately from output tokens. Most performance discussions focus on output length. Time-to-first-token scales with input size, and for large inputs it dominates.

The shape of the timing graph matters. When one call is 3x slower than everything else, that's your bottleneck. When all calls are within 2x of each other, you're near the floor for parallel execution.

What's next

The rewriter is the next candidate — it's still the second-slowest call at 17s and receives the full 12,000-character resume context. There's probably 5-8 seconds to recover there with smarter truncation. But 27s feels like a reasonable product experience for the amount of work being done, so we're focused on other things for now.

The economics are also worth noting: at Anthropic's Claude Sonnet pricing, each full review costs approximately $0.08 in API fees. At $9.99 per review that's a 97% gross margin. The performance optimisations didn't change the cost meaningfully — trimming input context saves fractions of a cent per call. The goal was always user experience, not API cost.

Try the tool

Free resume score in 10 seconds. Full review — rewritten bullets, cover letter, interview prep, and more — in about 30 seconds.