Working across six natural languages, you notice the same failure mode in every speech-to-text product on the market: confidence is hidden, ambiguity is silently resolved, and the draft you receive looks confident even when the model isn't. Verba is the workspace built around the opposite assumption — preserve the raw, surface the doubt, and make the eval loop the artifact.
The problem
Most speech-to-draft tools optimise for a single output: a polished paragraph. That's the wrong unit. The unit that matters is the trace from raw audio to a written artifact you'd actually publish. Drop the trace and you can't tell whether the model heard you correctly, whether it translated faithfully, whether the draft is a clean summary or a confident hallucination. For an operator working across English, 中文, español, français, and português, that gap is unworkable.
The shape
The flow is sequential but every artifact compounds. Each pane hands its output to the next without a copy-paste, and the eval pane sees all of it.
- Transcribe — drop audio, get text with confidence-scored segments. Whisper-style on Workers AI; chunked uploads to R2.
- Polish — three deliberate modes: rewrite, translate, prompt-generate. Diff highlighted inline against the raw transcript.
- Compare — run the polished prompt across multiple providers in parallel via
/api/compare. Side-by-side diff. Cost and latency captured. - Log eval — fixed-rubric scoring (faithfulness, concision, structure, terminology). Persisted to D1. JSONL out for downstream analysis.
The eval log is the artifact. Everything before it is scaffolding to make the log honest.— working notes
Constraints I picked
No build step beyond Vite. No framework. No backend other than Cloudflare. The same edge that serves joaquinh.com runs the inference. This was a deliberate choice — a cheap, well-understood edge stack is the one I'd actually push to production. If I can't build a workspace I trust on it, the "boring stack wins" thesis is just a posture.
Multilingual is not a feature; it's the test
If your speech-to-draft tool only works in English, you're not building for the world your users live in. Verba runs the eval suite in en/zh/es/fr/pt with code-switched audio, and the rubric weights faithfulness above concision deliberately — a polished mistranslation is worse than a clumsy accurate one.
Diff rendering
Word-level Myers diff with a 200ms debounce. Anything finer felt jittery; anything coarser hid the model's actual edits.
Streaming
Server-sent events all the way. WebSockets were tempting but added a connection-state surface I didn't want for a single-direction stream.
What's next
Multi-turn evals. The current rubric is single-shot. The honest version of "is this model better" is multi-turn, and the workspace should make that the path of least resistance.
The workspace is the eval. If shipping the eval is hard, you'll ship without one.— readme, current