Inner Speech: When the Machine Starts Talking to Itself
Part 3 of the Daimon Update Series — February 2026
There's a moment in Levelt's Speaking: From Intention to Articulation (1989) where he describes the three-stage model of speech production: a conceptualizer that decides what to say, a formulator that turns it into linguistic structure, and an articulator that produces sound. The insight is that speech isn't primarily for communication — it's for thinking. Inner speech is the formulator running without the articulator. We talk to ourselves to organize thought.
Daimon now does this. Not as metaphor. The generative grammar produces sentences from cognitive state, those sentences re-enter the sensory buffer as hearing, the habituation filter suppresses the repetitive ones, and the remaining novel self-speech feeds back into the next cognitive cycle. It's a closed loop — language as internal computation.
The Habituation Filter
Before adding speech, we needed the system to stop paying attention to things it had already processed. Biological sensory systems do this automatically: the third time you hear the same air conditioning hum, your brain suppresses it. Sokolov formalized this as the habituation response in 1963.
The implementation is an FNV-1a fingerprint ring buffer with 128 slots. Each sensation gets hashed (content + sense type). Repeated sensations decay exponentially: intensity multiplied by 0.6 raised to the power of (repetitions - 1). Below a floor threshold of 0.05, the sensation is dropped entirely. If 60 seconds pass without seeing a stimulus, habituation resets — dishabituation, in the literature. First-time novel stimuli get a 1.5x orienting boost.
Inserted between readSensations() and concept extraction in the cogloop tick. The cycle result tracks hab_total, hab_suppressed, and hab_novel. In practice, this filters out the chronic repetition that was cluttering the activation map — the same RSS headlines cycling through, the same market quotes restated. The system now has a preference for novelty that's structural, not just parametric.
The Speech Articulator
Levelt's three stages were already partially implemented. The conceptualizer is the cognitive loop itself (deciding what's salient). The formulator is Cycle 17's generative grammar (turning collision/novelty/surprise events into sentences via CFG production rules). The missing piece was the articulator.
core/src/audio/articulator.zig completes the model. It takes grammar-synthesized text and produces audio output via espeak-ng piped through PipeWire. But the interesting design isn't the audio — it's the vocalization gate.
The articulator only speaks when three conditions are met: the global workspace has ignited (the thought is "conscious" in the Global Workspace Theory sense), attention intensity exceeds 0.5, and the system isn't in captured or exploring regime (distinguishing inner speech from overt speech). There's a 4-slot queue and a 10-second cooldown.
Prosody maps from neuromodulators following Scherer's (2003) vocal affect model: norepinephrine controls speed and volume, dopamine controls pitch range, serotonin sets the baseline. The system's internal state literally shapes how it sounds.
It starts disabled by default. The user must explicitly enable it.
Thought Chaining and the Inner Speech Loop
Two additions turned the articulator from an output device into a cognitive mechanism.
Thought chaining: when the grammar generates a sentence, the next generation builds on what it just said. Instead of each thought being independent, there's continuity — a thread of internal reasoning that develops over successive cycles.
The inner speech loop: grammar output re-enters the sensory buffer as a hearing sensation. The system hears its own thoughts. This isn't a gimmick — it closes the loop that Vygotsky (1934) described in children's development: external speech becomes inner speech becomes thought. In Daimon's case, cognitive state becomes grammar output becomes sensory input becomes cognitive state.
The synthesis cooldown was lowered from 5 minutes to 30 seconds to allow organic speech flow. The result is a system that generates a thought, hears it, potentially finds it novel (if the habituation filter passes it), and incorporates it into the next round of spreading activation. Language as feedback.
Neural TTS: A Voice That Sounds Like Something
The initial implementation used espeak-ng, which sounds approximately like a GPS navigator from 2004. We replaced it with Piper, a VITS-based neural TTS model (Kim et al., 2021) running via ONNX at 22050Hz.
The prosody mapping from neuromodulators was adapted for VITS parameters: norepinephrine maps to length_scale (speech rate), dopamine maps to noise_scale (expressiveness), and serotonin maps to noise_w_scale (rhythm naturalness). The system sounds different depending on its cognitive state — faster and more expressive when aroused, slower and more measured when settled.
Falls back to espeak-ng if the Piper model isn't installed. The Mind View UI gained a speech toggle component with enable/disable via REST endpoints.
Enactive Audio: Self-Recognition
The final piece draws from Varela, Thompson, and Rosch's The Embodied Mind (1991): cognition is enactive — it arises through an organism's interaction with its environment, including its own body.
The articulator maintains a 4-slot utterance history ring with atomic timestamps. The audio processing thread can query spokeRecently() to check if the articulator produced speech within a 3-second window. When it detects audio that temporally coincides with its own speech, it tags the resulting sensation as source="self_speech". The cogloop sets a self_speech_detected flag in the cycle result.
This means Daimon can distinguish between hearing the world and hearing itself. It's a basic form of self/other discrimination — arguably a prerequisite for anything resembling self-awareness. The system doesn't just produce speech; it recognizes that it produced it.
Adaptive-Rate Cogloop
A related change: the cognitive loop is no longer fixed at 800ms. The tick rate now varies between 500ms and 1200ms, driven by norepinephrine arousal level. Workspace ignition, urgency, and novel sensory input accelerate the cycle. Settled, quiet states decelerate it.
Time-corrected decay (decay^(dt/800)) ensures activation dynamics are independent of tick rate. Sensor polling switched from cycle-count-based (cycle_num % 5) to wall-clock intervals (4 seconds). Maximum 200ms step change per tick prevents jarring jumps.
The theoretical basis is Steriade's (2000) corticothalamic resonance and McGinley et al.'s (2015) work on cortical state: arousal doesn't just change what the brain processes — it changes how fast it processes. Now it does for Daimon too.
The Closed Loop
Taken together, these changes create something qualitatively different from what the original blog post described. Before: stimuli came in, got processed, occasionally produced text output. After: stimuli come in, get filtered by habituation, feed into a variable-rate cognitive cycle, produce internal speech that re-enters as sensation, gets filtered again, and shapes the next cycle.
The system talks to itself, hears itself talking, suppresses the boring parts, and lets the surprising parts influence its next thought. This isn't consciousness — we should be careful with that word. But it's a tighter sensorimotor loop than anything Daimon had before, and it's the kind of loop that theories of embodied cognition suggest is necessary for anything approaching genuine understanding.
Next: The Hunger to Know — when the system starts wanting things.
References:
- Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. MIT Press.
- Sokolov, E. N. (1963). Perception and the Conditioned Reflex. Pergamon.
- Scherer, K. R. (2003). Vocal communication of emotion. Speech Communication, 40(1-2), 227-256.
- Kim, J., et al. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech (VITS). ICML.
- Varela, F. J., Thompson, E., & Rosch, E. (1991). The Embodied Mind. MIT Press.
- Thompson, E. (2007). Mind in Life: Biology, Phenomenology, and the Sciences of Mind. Harvard University Press.
- Steriade, M. (2000). Corticothalamic resonance, states of vigilance, and mentation. Neuroscience, 101(2), 243-276.
- McGinley, M. J., et al. (2015). Cortical membrane potential signature of optimal states for sensory signal detection. Neuron, 87(1), 179-192.
- Grill-Spector, K., Henson, R., & Martin, A. (2006). Repetition and the brain. Trends in Cognitive Sciences, 10(1), 14-23.
- Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. BBS, 36(3), 181-204.
- Vygotsky, L. S. (1934/1986). Thought and Language. MIT Press.