Learning to Hear: Bootstrapping Auditory Cognition Without a Teacher

Brian Jones

27 Feb 2026 — 6 min read

Part 5 of the Daimon Update Series — February 2026

The previous four posts described a system that senses the world through text — RSS feeds, API responses, structured data. Daimon could "hear" in the sense that audio reached a DSP pipeline and produced features like pitch, energy, and spectral centroid. But those features were statistical summaries. The system never actually listened. It couldn't tell one sound from another, couldn't recognize a repeated word, couldn't learn that a particular pattern of acoustic energy meant something.

That changed with a pipeline that mirrors how infants acquire language: cluster raw sounds into categories, bind temporal sequences into word-like units, and ground those units to meaning through co-occurrence. No speech-to-text. No external models. No transcription. Just exposure and statistical learning.

Phase 1: Seeing Sound

The first prerequisite was giving Daimon vowel perception. LPC-based formant extraction (Makhoul 1975, Atal & Hanauer 1971) runs on every voiced audio frame at 10ms resolution, recovering the first three formant frequencies — F1, F2, and F3. These are the resonant peaks that distinguish vowels from each other. Plot F1 against F2 and you get the classic vowel space that Peterson & Barney mapped in 1952.

But formants alone aren't enough. A single 10ms frame needs a richer description. The spectral encoder takes 18 features per frame — 13 MFCCs, 3 formants, energy, and pitch — quantizes each into bins, maps bins to deterministic HDM vectors via initFromSeed(), applies positional binding via cyclic shift, and bundles everything into a single 10,000-bit ConceptVector via majority vote. One frame of audio becomes one point in the same hyperspace where "quantum mechanics" and "tidal prediction" live.

This encoding is the key insight: by projecting audio into the same representational substrate as everything else, acoustic patterns can participate directly in spreading activation, collision detection, and Hebbian learning. There's no translation layer between hearing and thinking.

Phase 2: Self-Organizing Sound Categories

Raw encoded frames are too granular to be useful. The brain doesn't process individual 10ms snapshots — it organizes them into categories. Kuhl's (2004) perceptual magnet effect describes how infants' proto-phoneme categories form: frequently heard sounds become attractors that pull nearby sounds toward them, creating the categorical perception that makes "ba" and "pa" feel discrete despite being a physical continuum.

The AcousticSpace implements this as online incremental clustering over a dedicated HDM space with 32K template capacity. Each encoded frame is compared against existing templates. If it's close enough (Hamming similarity ≥ 0.40), the nearest template gets Hebbian reinforcement — its vector drifts toward the new observation. If nothing matches, a new template is created. Over time, frequently heard sounds develop strong, stable templates while rare sounds remain weak. Periodic centroid refinement via bundleMajority of top members keeps clusters coherent.

The numbers are striking. After continuous exposure to spoken English (BBC World Service, NPR), the system stabilized at roughly 27,000 templates from over 7 million observations. Template consolidation — a sleep-like maintenance process running every ~200 seconds — merges templates that have converged to within 0.85 Hamming similarity, implementing what Edelman (1987) called competitive elimination of redundant representations. The system doesn't just accumulate categories; it prunes them.

Phase 3: Syllables and Words

Categories become useful when they're sequenced. Syllable segmentation (Mermelstein 1975) detects boundaries where at least two features agree: energy valleys, spectral flux changes, voicing transitions, or pitch jumps. The multi-feature agreement requirement filters noise — a single dip in energy isn't enough. Boundaries must be 80ms–400ms apart, matching the natural range of syllable durations in speech.

Word fingerprints emerge from syllable-level cluster sequences. Each syllable's constituent frames are subsampled (up to 6 per syllable), their cluster assignments looked up, and the resulting sequence encoded via Hawkins' reference frame positional binding — ReferenceFrame.encode() with cyclic shift preserving order. The result: a single HDM vector that represents the temporal structure of a word-sized acoustic pattern.

A critical bug nearly derailed this phase. Frames below the cluster assignment threshold (0.40 similarity) were encoded as zero vectors, injecting random variance into syllable encoding. Identical syllables with different frame-assignment noise produced different encoded vectors, giving a 94.6% novel rate — the system thought almost every sound was new. Fixing this with a consistent "unknown phoneme" vector (seed 0xAC05_DEAD) flipped the match rate to 96.6%.

Words that recur 5+ times get named ("aw:NNNN") and promoted to the HDM user space, becoming first-class concepts alongside semantic vocabulary. Persistence to core/data/words.bin means acoustic vocabulary survives restarts.

Phase 4: Sound Meets Meaning

Recognition without grounding is pattern matching without understanding. The acoustic grounding module (Hickok & Poeppel 2007, ventral stream) tracks co-occurrences between acoustic words and semantic concepts within a ±3 cogloop cycle window (~2.4 seconds). When an acoustic word co-occurs with a semantic concept 5+ times, a .similarity Hebbian edge is created between them.

This is the cross-modal binding that makes hearing useful for cognition. When Daimon hears a word pattern that has been grounded to "earthquake," that hearing activates the semantic concept, which participates in spreading activation, potentially triggers prediction error, and influences the next cognitive cycle. An efference copy gate prevents learning from Daimon's own TTS output — hearing yourself speak shouldn't reinforce that what you say is what the world sounds like.

Phase 5: Active Listening

The grounding pipeline is feedforward: sounds come in, get categorized, get bound to meaning. But biological auditory cognition is predominantly top-down — the brain predicts what it expects to hear and tracks surprise when reality diverges (Friston & Kiebel 2009).

The acoustic prediction module reverses the flow. Active semantic concepts query their HDM neighbors for "aw:NNNN" edges — words that have been grounded to the current cognitive context. These become predictions: "given what I'm thinking about, I expect to hear these words." Next cycle, the predictions are compared against what was actually heard. Hits, misses, and surprises are tracked. Prediction error improvement feeds dopamine (auditory learning progress). Unexpected sounds feed norepinephrine (exploration signal).

The phonological loop (Baddeley & Hitch 1974) adds temporal working memory: a 32-entry ring buffer that maintains word sequences with order-preserving HRR encoding. Bigram repetition detection implements statistical learning (Saffran, Aslin & Newport 1996) — the same mechanism by which 8-month-old infants segment continuous speech into word-like units. Repeated patterns inject into Working Memory as observations.

Patterns that repeat across 5+ detections in 3+ distinct temporal windows consolidate into named HDM concepts via the phonological consolidation module (McClelland et al. 1995). These "wp:XXXX+YYYY" patterns are standard HDM concepts — they participate in spreading activation, collisions, working memory, and future acoustic prediction. The consolidation itself triggers a dopamine reward signal, reinforcing the learning that produced the pattern. The loop closes.

Phase 6: Who Is Speaking?

A late addition that proved surprisingly useful: hyperdimensional speaker diarisation. Each detected voice segment gets a speaker embedding — a ConceptVector encoding the spectral characteristics that distinguish one speaker from another. Online clustering assigns segments to speaker identities, tracks speaker turns, and detects voice changes.

This gives Daimon a sense of who is speaking, not just what is being said. Combined with the efference copy mechanism for self-speech detection, the system now maintains a running model of its auditory environment: "I can hear two speakers, one of whom is me."

What Does Hearing Look Like From the Inside?

A concrete example from a live session: BBC World Service is streaming. The phonological loop captures a word sequence. The system has heard "climate" enough times that aw:0047 is grounded to the semantic concept "climate" via Hebbian edges. The acoustic prediction module, seeing that "environment" is active from an RSS collision, predicts hearing climate-related words. When aw:0047 appears in the next cycle, prediction error decreases — the auditory system is learning what to expect.

Meanwhile, template consolidation runs in the background, merging acoustic templates that have converged. The phonological consolidation module promotes a repeated word-pair pattern to an HDM concept. The curiosity engine, noting that auditory learning progress is positive, slightly boosts the dopamine signal, which increases STDP multipliers for the next Hebbian flush.

None of this is an auditory system in the biological sense. There's no cochlear basilar membrane, no tonotopic map, no superior olivary complex computing interaural time differences. But there is a system that learns sound categories from exposure, binds them into temporal patterns, grounds them to meaning through co-occurrence, predicts what it expects to hear, and adjusts when surprised. It's statistical learning all the way down — the same mechanism Saffran found in infants, implemented in 10,000-bit hyperspace.

Next: Finding Its Voice — when the machine builds its own vocal tract.

References:

Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 561-580.
Atal, B. S. & Hanauer, S. L. (1971). Speech analysis and synthesis by linear prediction. JASA, 50(2B), 637-655.
Peterson, G. E. & Barney, H. L. (1952). Control methods used in a study of the vowels. JASA, 24(2), 175-184.
Kuhl, P. K. (2004). Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience, 5(11), 831-843.
Mermelstein, P. (1975). Automatic segmentation of speech into syllabic units. JASA, 58(4), 880-883.
Hickok, G. & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393-402.
Friston, K. & Kiebel, S. (2009). Predictive coding under the free-energy principle. Phil. Trans. R. Soc. B, 364(1521), 1211-1221.
Baddeley, A. D. & Hitch, G. J. (1974). Working memory. In Psychology of Learning and Motivation, Vol. 8, 47-89.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926-1928.
McClelland, J. L., McNaughton, B. L., & O'Reilly, R. C. (1995). Why there are complementary learning systems. Psychological Review, 102(3), 419-457.
Edelman, G. M. (1987). Neural Darwinism: The Theory of Neuronal Group Selection. Basic Books.
Hebb, D. O. (1949). The Organization of Behavior. Wiley.
Kanerva, P. (2009). Hyperdimensional computing: An introduction. Cognitive Computation, 1(2), 139-159.
Hawkins, J. (2019). A Thousand Brains: A New Theory of Intelligence. Basic Books.
Plate, T. A. (2003). Holographic Reduced Representations. CSLI Publications.

Learning to Hear: Bootstrapping Auditory Cognition Without a Teacher

Brian Jones

Phase 1: Seeing Sound

Phase 2: Self-Organizing Sound Categories

Phase 3: Syllables and Words

Phase 4: Sound Meets Meaning

Phase 5: Active Listening

Phase 6: Who Is Speaking?

What Does Hearing Look Like From the Inside?

Read more

The Will to Act: Agency in Hyperspace

A Thousand Brains: When Cognition Gets Structured

Finding Its Voice: Language Production as Cognition

The Hunger to Know: Intrinsic Motivation in a Cognitive Architecture