Main content

Something for the weekend - introducing 'speechos'. What I learned dictating 203,000 words in 44 days

George Lawton Profile picture for user George Lawton April 24, 2026
Summary:
Over 44 days of dictating, I produced more than 203,000 words and almost no finished work. What I did produce was a new vocabulary: 'speechos', outpourings, and a felt sense of what the next interface with AI might actually be.

letters

Over the last 44 days I have dictated 203,000 words at a pace of around 144 words a minute. To be clear, I produced very little finished work in the process, but I did get some glimpses of what new paradigms for the future of work might look like.

Doing the math, those 203,000 words compared to an average story size of 1,500 words is roughly 135 articles. But over the last month I filed only a handful, exploring how new processes might inform higher-quality stories grounded in my own felt sense and a better fact-checking process.

This whole process started with wondering how an emerging paradigm among data analysts, what Plotly calls vibe analytics, might inform other kinds of work. Vibe analytics is itself a riff on the serious play that Michael Schrage, a research fellow at MIT Sloan's Initiative on the Digital Economy, has been championing since his 2000 book of that name. In my case, I started to wonder what happens when the same kind of serious play informs higher-quality storytelling rather than just increasing output.

The best metaphor for the 44 days came to me by accident, mid-dictation:

A state shift akin to the way a boat sort of moves along in the water at a slow pace. It's kind of rough, and then if it has a hydrofoil underneath it, it sort of goes above the water and it's faster and smoother.

Faster is what you notice first. Smoother changes the work. The moment the hydrofoil comes up under you, the drag drops in a way that registers in the body before it registers as output. Here is what that felt like from inside the very first long voice-dictated session:

This whole thing is just me talking to my computer. It's a little easier than trying to type, and it's faster, and I can feel the words flowing out. There are so many words. I'm just amazed at how fast these words are coming in.

That is not a productivity observation. That is someone noticing a state shift in real time. The productivity was a byproduct.

From fiddly to freedom

I have been chasing some version of this for a long time. I gave Nuance's Dragon Dictate a few tries over a decade ago, once it got accurate enough to start saving time, after the right headset, a lot of training, and a desktop machine that slowed to a crawl. Then I added NCH's Express Scribe with a foot pedal to clean up interview recordings. I had high hopes for Microsoft Word's native dictation when Microsoft finally added it, and higher hopes still when Microsoft acquired Nuance in 2022. But sadly, Dragon development has essentially stalled since, Word's dictation has remained basic, and for various reasons I found myself abandoning my latest dictation experiment after much effort.

Nearly a year ago, I wrote about how fragmented the transcription landscape had become. Each tool was better at one thing, yet none of them talked to each other, and you still ended up doing half the stitching in Word. A few months ago I stumbled onto a new generation of dictation tools like Wispr Flow, Typeless, and WhisperX, which inspired a new investigation into whether this time might be different.

My current observation is that these new tools have certainly moved the needle, yet still have some interesting gaps if we want to reimagine ways of work that feel more like serious play than frustrating tool integration or cautious dictation. Along the way I found myself inventing a new vocabulary to describe the shape of artifacts and processes that might make work a little more fun, while also being grounded in my own felt sense, the facts, and the counterfactuals that help me weed out flaws in my thinking earlier in the process.

New abstractions required

One early shift was exploring what might happen if I gave myself permission to just start talking about things, rather than the traditional dictation approach of over-focusing on the end goal. This seems to be the level that all dictation vendors are designing for today: talk instead of type, get the same artifact faster. For sure, this is a reasonable upgrade since it means less wrist strain, more words per hour, and the input method stops being the bottleneck.

But then one day I found myself musing what might happen, what kind of shift might exist, if I stopped trying to edit myself as I normally do in trying to deliver a finished product the first time. This was the sort of gem that popped out, suggesting a whole new level where the process of actually writing could flow out first and be edited and organized after the fact:

What if you could just be doing it and clean everything up after the fact? Organizing everything after the fact so that you could focus on how do I bring my felt sense to the matter or the topic at hand, rather than adding that internal friction of trying to get everything exactly right on the first go.

Or, in the software developer's language I kept reaching for:

Like people talk about in software development, where you get the bones of a thing and you see what lands, how it works, and then you kind of refine it through a process of improvement, rather than trying to get to the finish line from the get-go.

Iterative refinement versus waterfall. The internal editor sits down while the internal writer goes for a walk.

There is also a precedent for the second-level workflow, if you are old enough to have seen it in action. I had a friend in the early 2000s, before keyboards fully won, a labor relations negotiator, one of the few people I personally knew who could actually dictate well:

He would get a thought or two down and then pause, gather his thoughts a bit more. Through this process he generates something that could become a letter and then he'd send it off to his secretary to actually write it, and they'd probably add another letter of correction on top of it.

That workflow disappeared because we all got keyboards and were expected to become our own secretaries. Now it is coming back. The secretary is an LLM. And, this is the part the vendors have not quite caught up to, the LLM is not there to correct me. It is there to catch me. The way a good secretary also made their boss's thinking better by cleaning up around the edges of it, and occasionally sending back a note of correction that changed the letter.

Why felt sense matters

A word I keep using, and that deserves a short unpacking before the rest of this piece lands: felt sense.

The philosopher Eugene Gendlin coined the term to describe the bodily knowing that shows up before you have words for it. A writer working from felt sense knows a sentence is off before she can say why. A painter knows the proportion is wrong before the eye resolves the error. It is not intuition in the mystical register. It is a perfectly ordinary, trainable channel of information that most knowledge work asks us to ignore.

Most professional writing is done almost entirely from head sense. You know what the argument is. You know what the editor wants. You know which keywords the piece needs. You assemble accordingly. The result is competent and often useful, and also, frequently, a little dead. The reader can feel when a piece was manufactured from the outside in, even if they cannot articulate why.

Felt sense is what tells you a sentence is dishonest even when it is technically accurate, or that an interview answer is evasive even though the words check out, or that the story you are writing is not actually the story you are in. Poets work this channel constantly. Good journalists do too, though they are rarely credited for it. When a piece of writing lands for a reader, what lands is usually the felt sense the writer kept faith with, transmitted through the words.

The typing-plus-editor workflow suppresses this. Every keystroke is a micro-contraction, every autocorrect a small interruption, every paragraph a chance to check yourself out of the body and into the outline. Dictation, done well, lets the channel stay open. You can feel the sentence arriving before it arrives. You can notice when the thread goes cold. You can, crucially, notice when you are performing versus when you are telling the truth. That is the shift I keep pointing at.

The genesis of outpouring

When I first started doing these long voice-dictated sessions, I was calling what I produced rants. Hunter Thompson quality. Unhinged. Letting it pour. I brought the first proper voice-dictated piece, about 9,000 words dictated in a single evening, to Claude and introduced it using the word rant.

On first reading, Claude called the piece a portrait, a writer taking himself apart in real time. I sat with that and spoke back that rant carried a kind of self-aggrandizing pejorative quality, like I was apologizing in advance for what was coming. But portrait felt wrong too. Too refined. Too much the painter's considered eye. The thing I had produced was more primal and rough than that. Claude suggested words in the vicinity: raw feed. Outpouring. First voice. Unguarded. Spill. The thing that comes before the thing.

I read that and spoke back: outpouring feels more resonant for now. Let's call it that instead of rant. Claude's closing line landed it: an outpouring isn't something to be cleaned up or organized, it's something to be received.

Adopting that term for these free flows changed what I permitted myself to do. A rant is throwaway. An outpouring has standing. Over the last month I've accumulated dozens of these, starting in one place, like pondering the implications of a technology or the felt sense of one company's strategy versus another, then musing how it shows up in my direct experience in various ways.

AI jamming

In that vibe analytics piece I wrote a few months ago, Schrage tells me the new tools bring out the best in you and amplify the worst in you, whatever your cognitive style may be. He maps the eras of business analytics this way: the spreadsheet era asked what happened, the dashboard era asked why did it happen, and the vibe era asks what insights emerge if we explore together.

This framing got me wondering what might become possible when voice + LLM (Large Language Model)  replaces keyboard + grammar checker. The metaphor I keep reaching for is musical: AI jamming, not in the sense of congestion but in the sense of two musicians finding a groove neither arrived with. One plays a phrase. The other answers. Something neither had in mind becomes the thing. The outpouring is the scratch track, the take where you are finding the song, not performing it. The LLM is the other player in the room.

Three things this unlocks that typing-plus-AI does not:

  1. The rhythm of exchange changes. Shorter distances between thought and response. Closer to conversation than to correspondence.
  2. The asymmetry is useful. Neither player replaces the other. The human brings felt sense, body, stakes, judgment. The LLM brings pattern recognition, recall, and the instant rewrite of a paragraph said once and not wanted again.
  3. The output is not the point. Jazz musicians tape the jam. Some tapes become albums. Most are the work that made the albums possible. The 203,000 words are mostly the tapes. That ratio is not failure, it is the ratio of practice to performance.

Schrage's what insights emerge if we explore together turns out to be answerable not just by querying the data, but by playing with the tool that queries the data, and playing with what lands along the way.

'Speechos'

A different category of error shows up in voice than in typing. In one outpouring I leaned on the idea of a 'speecho' to flag a kind of error the dictation tool itself can never catch:

Different modes of errors show up in speech than in typing. In typing it's like the fat fingers: you get two keys wrong. When you're dictating, the phonemes are wrong and they're mismatched, and pretty much all of the grammar and the spell checkers are looking for the wrong category of error. I've come up with a term that maybe it will take off: speechos, as opposed to typo.

Here's one flagged in the outpouring as it happened:

Funny noticing just now 'Chad', that's a Wispr error. Should be Jud, who is the one I have been talking about for the last bit. Make sure to flag all of the things like that.

'Speechos' are not fixable at the transcription layer. Chad and Jud are both plausible names; only the surrounding meaning tells you which one was meant. Most vendors still can't do this. An LLM can, if you let it, because it is operating one level up, where context of the whole lives. In the prior piece on the future of transcription, I observed how difficult it was to tease out a transcription error where multiple engines mis-transcribed totally rational as totally irrational. It was only after I passed the original audio through a background-noise suppression filter that even I could confidently decide what was originally spoken. That was the same category of error. I just didn't have the word for it yet.

We also need a new class of grammar checker looking for these, plus a different filler-word logic. I once interviewed a source who used no as a filler, a truncation for you know, where most people say um. Every instance had to be listened to for intent, and it cost me an hour of cleanup on a single interview.

My take

A few weeks ago I wrote about Andrej Karpathy's argument that the real value in agentic AI is not in the agents themselves. It is in the boundary between human craftsmanship and machine execution. His example is a prose document that tells an agent what to learn about. You improve the document. The agent gets better. The boundary is where the work is.

Voice + LLM does something similar to that boundary, but from the other side. When you move to voice, the human side of the boundary changes posture. You can dictate while looking out a window. You can dictate while walking. You are not pinned to a screen. The friction that typing introduces drops away, and what fills the space is felt sense. The body comes back online because it no longer has to sit in the particular shape a keyboard demands.

The vendors are still optimizing for the first level. The interesting work lies in streamlining the infrastructure to support this second level. The reason most of us have not seen it yet is that the first level runs on productivity metrics, and the second level runs on something the productivity metrics cannot measure. Here is one practical thing this second-level approach enables: running Wispr, Otter, and Speechmatics in parallel and using an LLM to reconcile the three transcripts against each other. It costs an extra dollar per hour, but it saves me half an hour of interruptions and frustrations in trying to make sense of the various transcription errors in the flow of work.

One word of governance caution that came to me during this: in the process of doing this kind of work, you are sending audio to one company's cloud and text to another's. Wispr's own voice processing, for instance, runs through third-party providers including OpenAI's servers. The US Cloud Act touches both, and anyone in healthcare, legal, or finance probably wants on-prem, or at minimum the training-data toggle off.

Also, the tooling lags the workflow. Between the Rode mic, AI pens like Plaud, and Bluetooth headsets, a physical on/off switch remains the most underappreciated design primitive in the category. A virtual button on your phone works OK, but then I have to dig up the backup copy from the dictation app when the display times out, or worse, lose the thought when something brushes against the screen. The computer-based versions let you start and stop with a hotkey, but then silently time out after six minutes.

If you want to try the shift yourself, the cheapest experiment is this: Wispr Flow for capture, dictate a week of morning reflections straight into a private doc, and let Claude or ChatGPT tidy the phonemes later.

The bigger implication of these new kinds of workflows is not that we will write faster. Some of us should. Some of us probably shouldn't. The real story is that we will think differently about what thinking is when it no longer needs a keyboard as its intermediary. The 203,000 words are not 203,000 words of finished work. They are 203,000 words of thinking-while-making that used to happen in the head and never made it to the page. That is the shift. The vendors have not noticed yet. When they do, we may discover new ways of working that feel less like the dreaded tightness of deadline delivery and more like the open, generative lightness of musical jamming, held honest by the felt sense of the human in the room.

A note on process: diginomica's strict policy is that we don't write stories generated with ChatGPT or other LLMs. The thinking, structure, and voice of this piece are my own, dictated, outlined, and shaped across many drafts. Claude served as the secretary-LLM this piece describes: catching 'speechos', tightening prose, and surfacing my own earlier material back to me when relevant. The passages clearly marked as Claude's words in the section on how this piece got its working vocabulary are verbatim, from a conversation I've kept the transcript of.

Image credit - Pixabay

Loading
A grey colored placeholder image