音频优先产品方向

版本号：v0.1.0 最后更新：2026-04-04

说明：本版为按规范整理的历史文档，正文暂保留原英文内容。

1. Purpose

This document defines a practical first product direction based on audio as the initial modality for the multimodal analysis framework.

The purpose of this direction is to:

choose a narrow and achievable first modality
create an implementation path that is easier than RF-first or multimodal-first execution
support a child-friendly product direction
validate the framework architecture through a concrete, lower-friction application

This document complements:

2. Why Audio Is a Good Starting Point

Audio is a strong first modality for several reasons.

2.1 Lower System Complexity

Compared with RF, video, or full multimodal sensing:

hardware access is simpler
recording is easy on phones and tablets
file formats are straightforward
algorithm prototyping is faster
user testing is easier to organize

2.2 Fast Feedback Loop

Audio allows quick iteration:

capture a sample
run a pipeline
inspect features
classify or cluster events
compare explanations

This is ideal for validating the framework's:

module registry
experiment runner
scoring engine
evidence model
AI explanation layer

2.3 Child-Friendly Interaction

Audio products can feel natural and approachable for children:

listen to the world around them
identify interesting sounds
ask the assistant what a sound might be
learn through friendly explanations

This makes audio a better first consumer-facing modality than raw spectrum analysis.

3. Recommended Product Framing

The first product should not try to be "an app that understands every sound in the world".

A better framing is:

a child-friendly sound explorer
a smart sound observation assistant
an interactive audio interpretation tool

This keeps the experience educational, engaging, and credible.

4. Recommended First Use Case

The best first use case is:

Children's environmental sound identification and explanation

Examples of candidate sounds:

birds
rain
thunder
dog bark
cat meow
doorbell
car horn
footsteps
musical instruments
household sounds

This use case is strong because:

it is easy to demonstrate
it has broad appeal
it is easier to validate than open-ended emotional interpretation
it is a natural fit for mobile devices as the user terminal

5. Why Not Start With "Animal Translator"

The broader idea of interpreting animal signals is compelling, but it should not be the very first shipped experience.

Reasons:

the phrase "translator" invites overclaiming
ground truth is hard to establish
audio alone is often insufficient
multimodal context is usually needed

A more credible progression is:

environmental sound understanding
simple animal sound identification
animal-state interpretation with context
richer cross-modal animal behavior products later

6. Product Experience Vision

The user flow should be simple:

the child records or listens to a sound
the system identifies likely sound candidates
the assistant explains what it might be in child-friendly language
the system shows why it made that guess
the child can ask follow-up questions

Example outputs:

"This sounds more like a dog bark than a doorbell."
"I heard short repeating bursts with strong high-frequency energy."
"That often happens when a dog is alert or excited."

The assistant should avoid pretending to have certainty when the evidence is weak.

7. Suggested Product Principles

The product should be:

curious rather than authoritarian
educational rather than overly diagnostic
explainable rather than magical
safe for children
usable on a phone as the main terminal

The product should avoid:

exaggerated certainty
medical or behavioral overreach
emotionally manipulative claims
claims of literal language translation

8. Technical Scope for a First Version

The first version should be intentionally constrained.

Suggested capabilities:

record or import short audio clips
detect sound events
classify into a limited sound vocabulary
provide confidence-ranked results
produce child-friendly explanations
save observations for replay and improvement

This is enough to exercise the platform without overwhelming the first implementation.

9. Suggested First Algorithm Chain

An initial audio pipeline might look like:

audio_input
-> basic_denoise
-> framing / segmentation
-> spectrogram extraction
-> feature extraction
-> sound event detection
-> classifier
-> evidence builder
-> AI explanation

Candidate first modules:

audio_normalize
denoise_basic
segment_energy_based
spectrogram_extract
mfcc_extract
audio_embedding
sound_event_detect
sound_classifier
similarity_search
explanation_formatter

10. Evidence Model for Audio

The audio-first product should still follow the evidence-first design.

Examples of evidence:

dominant frequency bands
temporal repetition pattern
onset shape
classifier top-k labels
similarity to known examples
background noise level
confidence spread between candidates

This prevents the assistant from acting like a black box.

11. Role of AI in the Audio Product

AI should not replace the core audio pipeline.

AI should:

choose or adjust candidate analysis paths
summarize evidence into natural language
respond to user questions
explain uncertainty
suggest what to record next

For example:

"Try recording a longer clip."
"Move closer to the sound source."
"This result is uncertain because the clip includes overlapping background noise."

12. Why This Product Validates the Platform

An audio-first product validates the framework's most important platform concepts:

InputSource
Observation
AlgorithmModule
Evidence
Experiment
module registration
pipeline execution
scoring
AI explanation

If these work well for audio, the same architecture can later support:

animal sound interpretation
machine acoustic diagnostics
optical pulse decoding
RF and spectrum workflows

13. Mobile as the Main Terminal

Audio-first execution works especially well with a phone as the user terminal.

The phone can provide:

recording
playback
interaction
questions and answers
simple visual explanations
history of observations

This reduces hardware friction for the first product and makes early testing easier.

14. Data Strategy

The first version should use a small, well-defined dataset strategy.

Recommended phases:

start with a small curated label set
collect replayable clips and associated metadata
track difficult or ambiguous clips
refine classifier and explanation policies

Avoid trying to ingest an uncontrolled universe of sounds too early.

15. Good First Success Criteria

The first version should be judged by concrete metrics such as:

correct top-1 or top-3 classification on the chosen sound vocabulary
explanation quality as judged by users
ability to distinguish similar environmental sounds
robustness to moderate noise
ease of use for parents and children

These are more useful than broad claims about "understanding sound".

16. Expansion Path

A disciplined expansion path from the audio-first product might be:

environmental sound identification
richer sound categories
animal sound identification
animal-state interpretation with context
multimodal sound + video behavior interpretation

This preserves credibility while still supporting the larger long-term vision.

17. Recommended Immediate Next Step

The next engineering step after this product direction should be:

define the core shared data model
specialize it for audio observations and audio modules
build a minimal experiment runner around audio clips

This creates a practical path from concept to implementation.

18. Summary

Audio is a strong first modality because it is:

easier to prototype
easier to test
easier to explain
compatible with mobile-first interaction
suitable for child-friendly product experiences

The recommended first product is not a universal sound interpreter and not an animal translator.

It is a child-friendly environmental sound exploration and explanation product that validates the larger multimodal framework in a disciplined way.

音频优先产品方向.md 8.6 KB История Исходник