case study · verba

A multilingual speech-to-draft workspace I'd actually trust to ship.

Raw transcript preservation, ambiguity surfaced, drafts compared across providers, eval loop logged — on Cloudflare Workers + D1 + Workers AI.

rolesolo · design + build
stackHTML/CSS/JS · CF Workers · D1 · Workers AI · OpenAI / Anthropic
timelineweekends · 2025–2026
statuslive · workspace open

Working across six natural languages, you notice the same failure mode in every speech-to-text product on the market: confidence is hidden, ambiguity is silently resolved, and the draft you receive looks confident even when the model isn't. Verba is the workspace built around the opposite assumption — preserve the raw, surface the doubt, and make the eval loop the artifact.

The problem

Most speech-to-draft tools optimise for a single output: a polished paragraph. That's the wrong unit. The unit that matters is the trace from raw audio to a written artifact you'd actually publish. Drop the trace and you can't tell whether the model heard you correctly, whether it translated faithfully, whether the draft is a clean summary or a confident hallucination. For an operator working across English, 中文, español, français, and português, that gap is unworkable.

The shape

The flow is sequential but every artifact compounds. Each pane hands its output to the next without a copy-paste, and the eval pane sees all of it.

The eval log is the artifact. Everything before it is scaffolding to make the log honest. — working notes

Constraints I picked

No build step beyond Vite. No framework. No backend other than Cloudflare. The same edge that serves joaquinh.com runs the inference. This was a deliberate choice — a cheap, well-understood edge stack is the one I'd actually push to production. If I can't build a workspace I trust on it, the "boring stack wins" thesis is just a posture.

Multilingual is not a feature; it's the test

If your speech-to-draft tool only works in English, you're not building for the world your users live in. Verba runs the eval suite in en/zh/es/fr/pt with code-switched audio, and the rubric weights faithfulness above concision deliberately — a polished mistranslation is worse than a clumsy accurate one.

Diff rendering

Word-level Myers diff with a 200ms debounce. Anything finer felt jittery; anything coarser hid the model's actual edits.

Streaming

Server-sent events all the way. WebSockets were tempting but added a connection-state surface I didn't want for a single-direction stream.

~240ms
edge cold-start to first token
4
providers, parallel
5
languages, eval-tested

What's next

Multi-turn evals. The current rubric is single-shot. The honest version of "is this model better" is multi-turn, and the workspace should make that the path of least resistance.

The workspace is the eval. If shipping the eval is hard, you'll ship without one. — readme, current