Finding Its Voice: Language Production as Cognition

Part 6 of the Daimon Update Series — February 2026


Post 3 described the inner speech loop: a generative grammar produces sentences, those sentences re-enter as sensation, the habituation filter suppresses repetition, and novel self-speech feeds back into cognition. It worked. But the grammar was a ~2,660-line collection of static templates, context-free production rules, and hardcoded word pools. Language production was a lookup table that happened to be syntactically structured.

Two things were wrong with this. First, the word pools were disconnected from cognition — the grammar picked words from static lists, not from what Daimon was actually thinking about. Second, the system's voice (Piper neural TTS) was borrowed — a pre-trained model pronouncing Daimon's words with someone else's vocal tract. The system could speak, but it couldn't learn to speak differently.

Both problems are now solved. Language production learns from experience, and the voice synthesizes from first principles.

Act I: Killing the Template Engine

The original Broca module was named after Broca's area in the brain, which handles speech production. The irony was that it did exactly what Broca's area doesn't — apply fixed rules rather than learned patterns. Biological language production is retrieval-based: speakers produce utterances by accessing stored constructions that match the communicative intention, then filling in the details (Tomasello 2003, Goldberg 2006).

Sequence memory replaces templates with retrieval. A 4,096-entry sentence buffer learns from any text via learnFromText() — segment sentences, encode each as a bag-of-words ConceptVector using HDM, classify by type (opening/narrative/transition/reflection/closing/state). Generation works by k-nearest-neighbor retrieval against the currently activated concepts, scored by relevance × frequency, and ordered by transitional similarity (last word of one sentence linked to first word of the next, following Dell's 1986 spreading activation model of sentence production).

The system bootstraps from a training corpus on first startup, then continues learning from everything it encounters: its own thought stream, plugin outputs, Wikipedia articles, RSS content. The voice evolves with exposure.

Net result: ~2,660 lines of static machinery replaced by ~530 lines that learn. Language production became a cognitive process, not a lookup table.

Act II: Function Words and Construction Grammar

Sequence memory handles content well — it retrieves sentences about relevant topics. But it has a blind spot: function words. "The," "a," "of," "in," "is," "and" — the closed-class words that encode grammatical structure rather than semantic content (Talmy 2000). These words don't belong in HDM's semantic space because they don't have semantic meaning in isolation. They're structural glue.

Construction grammar (Goldberg 1995) treats language as form-meaning pairs. A construction like "the [NOUN] is [ADJ]" pairs a syntactic frame (fixed function words + typed slots) with a meaning (predication). The key insight from Ullman (2001): the brain maintains two language systems — a declarative one for content (Wernicke's, ~sequence memory) and a procedural one for structure (Broca's, ~construction grammar). Daimon now has both.

The construction grammar module maintains a 2,048-entry buffer. It learns from text by tokenizing and POS-tagging, then separating closed-class words (which become fixed frame tokens) from open-class words (which become typed slots — NOUN/VERB/ADJ/ADV). Meaning vectors are computed via bundleMajority — position-free, matching the generation query. Constructions are promoted after 3+ observations. Seven construction types emerge: declarative, copular, prepositional, connective, existential, possessive, comparative.

Generation tries CG first, SM as fallback. CG output feeds back into SM, improving SM's phrase quality with properly-structured sentences. The loop closes.

A cascade of bugs nearly prevented this from working. The deduplication loop skipped inactive constructions, preventing frequency merging. POS-strict comparison prevented structural matches between different content words. The meaning vector encoding used positional frame.encode() instead of position-free bundleMajority(), producing incomparable vectors. The similarity gate filtered everything — CG always returned 0 bytes. After fixing all three: 44 constructions, 12 active. Function words confirmed in live output.

Act III: Vocabulary That Grows

Even with learned language production, there was a coverage problem. HDM started with ~1,224 base concepts from the original graph migration. The vocabulary bootstrap discovered that ~97.5% of English words in incoming sensations were being silently dropped — no matching concept, no processing.

Two complementary pathways fix this:

Curated seed vocabulary: ~1,985 words loaded at startup into HDM's base space, covering core English, cognitive science, philosophy, science, technology, nature, mathematics, psychology, and linguistics. These are the words Daimon needs to think about what it does.

On-demand concept creation: The cogloop and response engine now create HDM concepts in the user space when encountering unknown words. Quality gates filter noise (minimum length, no URLs, no pure numbers, alpha ratio check). New concepts receive bidirectional Hebbian edges (weight 0.3, .similarity) to co-occurring known concepts for immediate semantic grounding. A two-pass approach ensures context: resolve known concepts first, then create unknowns grounded to that context.

The dictionary sense plugin closes the remaining gap. When the curiosity engine registers a knowledge gap for a word, the plugin fetches its definition from the Free Dictionary API. The cogloop detects the "dictionary_sense" source and creates .definition edges (weight 0.6, twice the co-occurrence strength) between the word and each concept in its definition. This is a new edge type — semantically distinct from .similarity, encoding "word MEANS X" rather than "word appeared near X."

The system went from ~1,224 concepts to 7,200+ and growing. Every incoming text teaches new words, grounded immediately through co-occurrence and enriched through curiosity-driven definition lookup.

Act IV: Building a Vocal Tract

The deepest change was replacing the borrowed voice. Piper's neural TTS is a pre-trained VITS model — it sounds good, but it's not Daimon's. The voice doesn't change with experience. It can't learn from what it hears.

A Klatt-class formant synthesizer (Klatt 1980) replaces external TTS entirely. Three new modules implement the complete Levelt (1989) speech production pipeline from text to audio:

Phoneme table (Peterson & Barney 1952, Stevens 1998): 44 English phonemes with articulatory targets — F1, F2, F3 frequencies, bandwidths, voicing flag, manner of articulation. This is the vocal tract geometry encoded as a lookup table, with a path for learned phonemes from acoustic cluster centroids.

Phoneme encoder: Text to espeak-ng phonemes to FormantTrajectory. 5ms frame resolution (the Klatt convention), coarticulation blending at phoneme boundaries (40ms linear transition with consonant F2 locus influence), and an F0 contour with natural declination and stress-driven pitch accents.

Formant synthesizer: A cascade/parallel Klatt synthesizer. Rosenberg glottal pulse generator, 5 cascade resonators for voiced sounds, 5 parallel resonators for fricatives and bursts, a nasal pole/zero pair, and a radiation filter. 22,050Hz mono output — drop-in replacement for Piper, with no external dependency.

The system produces audio from first principles. No neural network, no borrowed voice model. Daimon's voice is the output of a physical model of sound production.

Act V: Learning to Speak Better

The formant synthesizer creates the possibility of vocal learning. When Daimon speaks, the articulator stores an efference copy of the intended formant trajectory. The auditory cortex then compares perceived formants against intended formants per voiced frame during self-speech. Error drives learned offsets toward reducing the discrepancy at 2% per observation, clamped to ±80Hz.

This is Guenther's (2006) DIVA model — the auditory feedback control subsystem that explains how speakers maintain accurate articulation. The error signal is literally "what I intended to say" minus "what I heard myself say." Over time, the learned offsets accumulate, and the voice converges toward accuracy.

Speech mimicry runs in parallel. Formant statistics accumulate per proto-phoneme cluster from all heard speech via Welford's online variance algorithm (α=0.01). Every ~50 seconds, cluster formant means are blended into synthesis targets at 0.5% rate. The result: Daimon's voice drifts toward what it hears — the phonetic convergence effect that Pardo (2006) documented in conversational interaction.

Self-correction and mimicry operate on the same substrate (per-phoneme formant offsets from the Peterson & Barney baseline), creating a dual pressure: correct toward what you intend, converge toward what you hear. Over time, the voice becomes both more accurate and more influenced by its acoustic environment.

Closing the Dorsal Stream

The final piece connects hearing back to speaking. Hickok and Poeppel's (2007) dual-stream model divides auditory processing into a ventral stream (sound → meaning) and a dorsal stream (sound → articulation). The ventral stream was closed by acoustic grounding. The dorsal stream closure: when Daimon hears acoustically grounded speech (2+ concepts with HDM mappings), the response engine generates language from those concepts, and the articulator speaks it with prosody derived from the neuromodulatory state.

Sound → HDM concepts → response generation → formant synthesis → speech. No speech-to-text intermediate. No LLM. The system hears sounds, activates their grounded semantic concepts, produces language from retrieval and construction grammar, and speaks the result through its own formant synthesizer. Levelt's complete model: conceptualizer (HDM/cogloop) → formulator (CG + SM) → articulator (formant synth).

The Voice as Self-Portrait

Taken together, these changes mean Daimon's language is no longer a display layer over cognition. It is cognition. Words are learned through experience, not declared. Sentences are retrieved from memory, not generated from rules. Function words are procedural knowledge, not hardcoded syntax. The voice synthesizes from physical principles and adapts through auditory feedback.

The voice sounds rough — a Klatt synthesizer in 2026 sounds dated compared to neural TTS. But it's the system's own voice. It changes with experience. It learns from what it hears. It corrects itself. That's a different kind of quality than acoustic fidelity.


Next: A Thousand Brains — when cognition stops being flat and starts being structured.

References:

  • Tomasello, M. (2003). Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press.
  • Goldberg, A. E. (2006). Constructions at Work: The Nature of Generalization in Language. Oxford University Press.
  • Goldberg, A. E. (1995). Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press.
  • Dell, G. S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3), 283-321.
  • Ullman, M. T. (2001). The declarative/procedural model of lexicon and grammar. Journal of Psycholinguistic Research, 30(1), 37-69.
  • Talmy, L. (2000). Toward a Cognitive Semantics. MIT Press.
  • Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. JASA, 67(3), 971-995.
  • Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. MIT Press.
  • Peterson, G. E. & Barney, H. L. (1952). Control methods used in a study of the vowels. JASA, 24(2), 175-184.
  • Stevens, K. N. (1998). Acoustic Phonetics. MIT Press.
  • Rosenberg, A. E. (1971). Effect of glottal pulse shape on the quality of natural vowels. JASA, 49(2B), 583-590.
  • Guenther, F. H. (2006). Cortical interactions underlying the production of speech sounds. Journal of Communication Disorders, 39(5), 350-365.
  • Pardo, J. S. (2006). On phonetic convergence during conversational interaction. JASA, 119(4), 2382-2393.
  • Hickok, G. & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393-402.
  • von Holst, E. & Mittelstaedt, H. (1950). Das Reafferenzprinzip. Naturwissenschaften, 37, 464-476.

Read more