[Hiring] Machine Learning Researcher, Audio @Bland
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.
Role Description
As a Machine Learning Researcher at Bland, you'll be working on foundational research and development across the core components of our voice stack: speech-to-text, large language models, neural audio codecs, and text-to-speech. Your work will define how our agents understand, reason, and speak in real time at enterprise scale.
• Build and Scale Next-Generation TTS Systems
• Design and train large scale text-to-speech models capable of expressive, controllable, human-sounding output.
• Develop neural audio codec-based TTS architectures for efficient, high-fidelity generation.
• Improve prosody modeling, question inflection, emotional expression, and multi-speaker robustness.
• Optimize for real-time, low-latency inference in production.
• Advance Speech-to-Text Modeling
• Build and fine-tune large scale ASR systems robust to accents, noise, telephony artifacts, and code switching.
• Leverage self-supervised pretraining and large-scale weak supervision.
• Improve transcription accuracy for real-world enterprise scenarios, including structured extraction and conversational nuance.
• Pioneer Neural Audio Codecs
• Research and implement neural audio codecs that achieve extreme compression with minimal perceptual loss.
• Explore discrete and continuous latent representations for scalable speech modeling.
• Design codec architectures that enable downstream generative modeling and controllable synthesis.
• Develop Scalable Training Pipelines
• Curate and process massive audio datasets across languages, speakers, and environments.
• Design staged training curricula and data filtering strategies.
• Scale training across distributed GPU clusters focusing on cost, throughput, and reliability.
• Run Rigorous Experiments
• Design ablation studies that isolate the impact of architectural changes.
• Measure improvements using both objective metrics and perceptual evaluations.
• Validate ideas quickly through focused experiments that confirm or eliminate hypotheses.
Qualifications
• Experience with self-supervised learning, multimodal modeling, or generative modeling.
• Hands-on experience building or scaling TTS, STT, or neural audio codec systems.
• Familiarity with large scale speech datasets and real-world audio variability.
• Experience training and serving large models on modern accelerators.
• Track record of designing controlled experiments and meaningful ablations.
• Comfortable in fast-moving startup environments.
Requirements
• Ability to derive new formulations and implement them efficiently.
• Strong intuition for audio quality, prosody, and conversational dynamics.
• Knowledge of inference optimization techniques, including quantization, kernel optimization, and memory efficiency.
• Understanding of real-time constraints in telephony or streaming environments.
• Ability to move quickly from hypothesis to validation.
• Strong ownership mindset from research through deployment.
• Excited by ambiguous, unsolved problems.
Benefits
• Healthcare, dental, vision, all the good stuff
• Meaningful equity in a fast-growing company
• Every tool you need to succeed
• Beautiful office in Jackson Square, SF with rooftop views
• Competitive salary: $160,000 to $250,000
Apply tot his job
Apply To this Job