Voice System
TomoriBot has a bidirectional voice pipeline:
- inbound STT: user audio attachments become text for conversation context
- outbound TTS: personas can send native Discord voice messages
Phase 4 routes both through custom endpoint capabilities:
speechfor TTStranscriptionfor STT
Commands
Section titled “Commands”/speech elevenlabsconnects ElevenLabs speech and transcription in one flow./provider custom-endpoint addregisters localtts-cloneandopenai-compatible-transcriptionendpoints./model speechselects the active TTS endpoint./model transcriptionselects the active STT endpoint./speech voice-adduploads the one server-local reference sample supported in Phase 4. You can upload any audio format; it is automatically converted to mono WAV and stored in S3/CloudFront in production or underdata/voice-samples/in non-production. A 10-20 second clip with no background music is recommended./speech voice-assignassigns either the local sample or an ElevenLabs voice to a persona./speech transcriptscontrols visible transcript posting in chat. It does not enable or disable background STT.
Runtime Behavior
Section titled “Runtime Behavior”The generate_voice_message tool appears only when the active persona has a voice assignment compatible with the active speech endpoint.
Audio attachments are transcribed only when a transcription endpoint is configured. There is no legacy optional-key fallback after Phase 4.4.
Local setup guides: