Off the Rack or Bespoke? — useMYvoice Research

Imagine a fashion house that produces a complete ready-to-wear range: every coat, every shirt, every pair of trousers, cut to fit the proportions of models sized zero to three. The house is very good at what it does. For most of its customers, the clothes fit well enough, and a small tuck here or a let-out seam there produces something wearable. But walk in wearing a size eighteen body and all the tailoring skill in the world cannot make those garments fit properly, because the fundamental cut — the angle of the shoulder, the drop of the chest, the rise of the trouser — was never designed for you. You are not an edge case requiring minor adjustment. You are outside the design assumptions of the entire range.

This is an analogy for what happens when a large pre-trained automatic speech recognition model like OpenAI's Whisper is adapted for deaf-accented speech.

The Fashion House and Its Size Range

Whisper was trained on hundreds of thousands of hours of audio drawn predominantly from hearing speakers, broadcast-quality recordings, and mainstream accents. That training data is the fashion house's core customer base, and the model's entire parameter space — its understanding of what speech sounds like, how sounds transition into words, how words flow into sentences — is organised around that centre of gravity.

Deaf-accented speech is not a small variation on hearing speech. Depending on degree of hearing loss, age of onset, whether someone grew up signing or speaking, cochlear implant history, and the vocal environment they grew up in, a deaf speaker's acoustic profile may differ from the model's training distribution in fundamental ways: different nasality, different voicing, different formant patterns, different rhythmic structure. This is not a small seam let-out. This is a different body shape entirely.

"You can let out every seam available to you and still not get a good fit, because the garment's fundamental cut was never designed for that body."

Techniques like LoRA fine-tuning allow a Whisper model to be adapted toward a specific user's voice. LoRA works by adding small trainable layers to a frozen pretrained model, nudging its behaviour toward new examples without retraining the whole system. For many applications this is elegant and efficient. But those small trainable layers are always working against the gravitational pull of the pretrained model's original assumptions. The base model constantly pulls back toward the centre. After several adaptation sessions, you reach a point where further adaptation cannot pull the model further from its original training distribution without degrading its general performance. The seams have been let out as far as they will go.

The Problem of Altering Every Garment at Once

There is a second problem that emerges when a single adapted model is shared across a community of deaf users — and it is perhaps the more serious one.

Suppose a tailor is asked to alter a garment to better fit five people with different limb differences. They accommodate the first person, then the second, then the third. Each adjustment moves the garment slightly: a left sleeve shortened here, a right shoulder raised there. When a sixth person arrives with a different configuration of limb differences, accommodating them requires moving the garment in a direction that partially undoes what was done for some of the previous five. The garment is one object. It cannot simultaneously satisfy incompatible requirements.

A shared Whisper model adapted for a community of deaf users has the same structural problem. Each user's acoustic profile pulls the model's shared parameters in a different direction. The adaptation that improves recognition for one user whose deafness affects primarily their vowel production may slightly worsen recognition for another user whose deafness primarily affects consonant voicing. These are not small random variations clustered around a common centre — deaf speech profiles are structured by genuinely different underlying causes, and those causes pull in genuinely different acoustic directions.

"The act of making the model work better for a new user may be expected to make it work slightly worse for those it had been adapted for previously. It is not a technical failure — it is a geometric one."

This is distinct from the phenomenon machine learning researchers call catastrophic forgetting, which describes a model losing previously learned capabilities when trained on new tasks. What happens here is subtler and in some ways more troubling: the model does not forget its previous users. It is simply being pulled, by the mathematics of shared optimisation, toward a position in parameter space that is closer to all of them on average — and therefore not particularly close to any of them specifically.

Voices That Change Over Time: The Shelf Life Problem

There is a third dimension to this problem that the research literature has largely not addressed, because it requires thinking about users not as fixed acoustic targets but as people whose voices change over time.

For a person undergoing speech therapy, voice is moving in one direction: toward clearer articulation, toward sounds that were previously difficult, toward a voice that sounds progressively less like the voice that was used to train the model. A model adapted to a pre-therapy voice will become progressively less accurate as therapy succeeds. The user's improvement makes their own recognition system less useful.

For a person with a progressive condition affecting speech — Parkinson's disease, ALS, multiple sclerosis, or progressive hearing loss that alters vocal self-monitoring — voice is moving in the other direction. A model adapted at diagnosis may perform worse than the unadapted baseline model after a year or two, because it has been pulled toward a voice that no longer exists. The adaptation has not just stopped helping. It has made things worse than doing nothing would have been.

A LoRA-adapted Whisper model, in other words, has a shelf life for any user whose voice is not static. And very few voices are static.

The Bespoke Tailor

The alternative approach — acoustic model adaptation using tools like Kaldi, the framework underlying the VOSK speech recognition system — works on entirely different principles.

Rather than adjusting the parameters of a large shared model trained on standard speech, Kaldi builds a statistical acoustic model from the ground up, based on the phonetic units of the target language and the acoustic evidence provided by the specific user. When a hat does not fit, the tailor adjusts that hat. They do not reach into the stockroom and adjust every other hat in the inventory at the same time.

Whisper + LoRA

One shared model, adjusted for multiple users. Each new adaptation interacts with all previous ones. The base model's gravitational pull limits how far the adaptation can reach. Improvement has a ceiling — and a shelf life.

Kaldi / VOSK

One model per user, built from acoustic principles rather than inherited assumptions. Adapting one user's model has zero effect on any other. There is no ceiling and no shelf life — the model follows the user's voice wherever it goes.

Per-user Kaldi acoustic adaptation is computationally modest — far lighter than the GPU-intensive work of fine-tuning a large transformer model. With approximately one hour of phoneme-level data, meaningful adaptation is achievable on consumer hardware overnight, without cloud infrastructure. And because each user's model is entirely independent, improving one user's recognition is a completely isolated act. There is no zero-sum competition for shared parameters. There is no community whose previous adaptations are eroded by each new member.

Crucially, there is no ceiling. As a user's voice changes — through therapy, through illness, through ageing, through any of the many ways a human voice moves over a lifetime — the model can be updated to follow. A new session of phoneme samples, another overnight adaptation run, and the model reflects the voice as it is now, not as it was at first calibration.

What This Means

The dominant trend in automatic speech recognition research over the past several years has been toward large pretrained transformer models with lightweight fine-tuning for specific applications. For many use cases, this is a genuine advance. But the assumptions embedded in this approach — that a shared model can serve a heterogeneous user community, that adaptation destinations are fixed points, that pre-training distributions are close enough to any likely target — do not hold for users with atypical or changing speech.

For the deaf community specifically, the diversity of acoustic profiles is not noise around a common centre. It is structured divergence driven by fundamentally different causes. A shared adapted model cannot simultaneously be close to all of those causes. The fashion house cannot carry a ready-to-wear range that fits everyone.

The useMYvoice research project is investigating whether the bespoke tailoring model — per-user acoustic adaptation, built from phonetic first principles, continuously updatable, structurally isolated from other users — offers a more appropriate architecture for this community than the current dominant paradigm. Not because the technology is newer. In many respects it is older. But because its assumptions match the reality of the people it is being asked to serve.

"The question is not which system is most accurate on average. It is which system remains useful across a user's life — and whether the architecture was ever designed with that question in mind."