I'm the ElevenLabs CEO - what do you want to do with voice AI but can't? (AMA)
Hi Everyone!
Solving AI audio end-to-end means tackling both generation and understanding - from text-to-speech to speech-to-text and everything in between. At ElevenLabs, we’re working on breakthroughs in AI audio that bridge research and real-world use.
Ask me anything about what we’re building, the challenges of scaling AI speech models, and where this space is headed. Also keen to hear what you’ve built with ElevenLabs!
Replies
Statfluence
When will AI advance enough to generate a course with my voice and appearance, making it indistinguishable from me actually presenting? (course is just an example use case)
Product Hunt
@ralphlasry There's an argument to be made that it doesn't need to be indistinguishable unless the target is people who know you well and intimately, and that all it needs to be is realistic and high quality. In that case, we might be there now?
ElevenLabs
@ralphlasry we might already be there for a number of voices, use-cases & languages! I think it will mainly depend on your exact accent or style of delivery - for example narration use-cases in English w/ one of the English-accent should be already very possible (try Professional Voice Cloning!).
The important thing, to be able to reproduce the voice well is high-quality input in the style you want. So for example for reproducing course materials, ideally you would create the voice based on you delivering the course - with 30 minutes of good quality audio - you can likely then create an amazing version already today! That will also be true for a lot voiceover use-cases and most popular languages (across Europe especially, liek Spanish, German, French are all very high quality).
There are some accents & use-cases where the models are getting good, but not perfect just yet. In the future, ideally you would be able to capture Spanish-European/Latin-American/Catalan etc. and while it's already possible for these 3, the more niche the accent the harder to recreate. And similarly, the more conversational the use-cases with high amplitude of emotions, slightly harder to capture all those emotions. We are working on a set of new models, that should be able to reproduce this well + control this much better too, so I would expect that for majority of spoken languages & use-cases, this will be possible this year!
ElevenLabs
@ralphlasry @rajiv_ayyangar and yes! I think we are there already for so many - especially as a voice narrates slightly different types of content, how you expect the narration to sound might change!
Try out elevenreader.io where we provide some incredible recognizable voices (like Deepak Chopra for mediation or Richard Feynman for lectures - both of course licensed from the people or estates) to see whether you would recognize them!
Using ElevenLabs feels like magic. It's phenomenal and deserves all of the attention it gets. It seems like it would take momentous engineering not only for the actual tech, but also to build for this scale so quickly. What was the hardest technical challenge in building out the platform? Did you ever think it wasn't going to be possible? Any good stories around the technical feats that were required?
ElevenLabs
@steveb thank you! So much more magic we are research & hoping deploy soon!
When we started, the first use-case we picked, was focused specifically on dubbing - and we have built a prototype, using existing research/technologies that combined the flow from STT -> translation step (via LLMs/other providers at the time) -> TTS + voice cloning for output. The more we learned there, the more clear it was that existing components weren't of high enough quality. Solving underlying components and doing our own research, was one of defining parts of the company and likely one of the biggest early challenges & feats we were able to accomplish. In 2022 we brought our TTS to live, shortly after a way to clone voices and now in 2025, finally added STT element too -> both of course required tons of incredible research work to be able to c
Of course, through out, as we scaled and brought technology to the world, the usage that's going through has been enormous and we build a great inference stack. How we are able to maintain high reliability at the scale, provide low-latency across any region (with dedicated servers globally), auto-scale as needed has been one of the "hidden" engineering feats that we are extremely proud of.
Daily.co
I'd love to hear any thoughts you have about architectures for next-generation models. This is maybe a bit of a more long-range question, but for conversational voice it would be nice to move past the approach of "triggering inference" and towards a more natively streaming approach to inference.
We now have the ability to maintain long-lived connections to APIs like your Conversation AI API. This is really exciting! I'd love to be able to rely on the model to decide whether to respond based on something other than "turn detection." For example, if I'm talking to my personal assistant agent, it might let me talk for a while about my todo list items, silently collecting data, before deciding to respond.
Product Hunt
@kwindla +1 would love to hear Mati's thoughts here! I'm imagining the next generation of conversational AI interfaces that have a much more fluid and flexible feel of conversation. And maybe even handle conversations between multiple humans and an AI better (and multiple AIs?).
ElevenLabs
@kwindla thanks for the question - and amazing work at Daily!
Voice will be the foundational interface to the technology - there is much nuance that it can carry and do so better than text. So it goes without saying that Conversational AI is the space were we invest most of our engineering time today.
You are spot in with set of efforts & ideas we are tracking: custom turn-detection functions, turn-detection based on emotions (as an example if agent detects you are stressed or angry it can help interject and calm you down), or the opposite where agent performs a wider set of tasks before responding back. Already today, our Conversational AI product, does use a wider set of signals (words, silence, textual context) to decide whether turn should occur. We are investing now time to be able to bring more signals from audio into the fold (emotions, audio context) so the conversation both feels more fluid and more controllable - and think we can get many of these soon. And already today, we provide a set of tools & functions you can configure to do exactly the work above - task agent with doing a booking and only responding back after it has done that booking.
Taking a bigger step back, there are 2 approaches to Conversational AI work: cascaded model of STT->LLM (or similar)->TTS (what we created so far and are extending further) or a true voice-to-voice/speech-to-speech model (researching). We are working on both, and it will depend on the use-case which one is the better choice.
For reliability, scalability, control - the cascaded model is the way to go. For any larger business, or use-cases, these are the factors that are crucial, and where we think the first approach shines through. We think that a lot of the custom logic that you mention, for example, will be easiest to implement at scale via this model - at least in this/next year.
For slightly lower latency, a bit more expressiveness but less reliability/scalability/control - voice-to-voice will be better. We expect more consumer-y use-cases to try this model, which might not need as much access to knowledge base, tools, functions, specific LLM guardrails.
ElevenLabs
@kwindla @rajiv_ayyangar 100% - many of the agents can already do this incredible well and we are seeing 100s of companies & developers building so many great experiences. An interesting recent example is company ThisGen that is creating agents for 911 responders for training purposes - where responders can interact with emotional & very conversational agent to prepare better for real-life scenarios. Or here (shameless plug!) you can try our own agents deployed on the docs https://elevenlabs.io/docs/overview where you can ask it about API integration and will automatically navigate you to the right spot, explain and work together with you in a much more assistant-friendly way - feels like it's there in the background to help you unpack the use-case, while you build.
Is it possible to record my current voice and then generate a voice from childhood or old age?
Or even more fun, to combine the voices of a male and a female to predict the voices of their offspring?
minimalist phone: creating folders
@onbing This is an interesting idea. :)
ElevenLabs
@onbing love the idea! We are working on exactly this - voice engineering where you can start with a voice, and then modify it via prompts - changing the accent, age, tempo it speaks with or anything in between. Side interesting question, and what makes this hard, is of course how much of "your voice" / "specific voice" is embedded into those characters - i.e. is accent or tempo you speak with part of the characteristics that make the voice recognizable.
In all cases, we do expect it will be possible for so many different variations giving you even more control, where you are not limited to a set of just predefined conditions, but can truly prompt-engineer the voice in any way you want!
Hope to ship it soon!
Hi Mati,
I'm currently building an app called SUN, which allows curious minds to create deep-dive audio-courses on any topic with an integrated Q&A capability. I tested ElevenLabs' product, uploaded and trained voices, and I have to acknowledge, you've created an incredible product. Training a new voice requires less than a minute of audio, and the quality is arguably the best on the market.
However, after calculating unit economics, it became clear that using ElevenLabs wouldn't be a viable option for my startup. This is unfortunate because the experience I envision is voice-rich, incorporating nuanced accents, first-person perspectives, third-person perspectives, and more.
I'm curious about ElevenLabs' strategy regarding product costs. Are there any partnership programs on the roadmap for startups like mine? I’d love for SUN to be a long-term partner of ElevenLabs, but the current cost structure makes it unfeasible.
P.s. I hate to be a guy who complains about prices, but that's the challenge of my start-up in using ElevanLabs today.
ElevenLabs
@artinbogdanov great idea and excited to test it out one day! It would be amazing to listen to the course and then when solving problems or learning it be able to ask follow-up questions - something I missed when studying Mathematics. Part of which Steve Jobs already wished was possible back in 1985 (
)!Good news is that we have exactly a programme for ideas like yours - just shoot us a note here: https://elevenlabs.io/startup-grants and would be happy to give you access - for 3 months free, while you build and test, with support help too.
And beyond, we can be more efficient at higher scales - and would be happy to talk-through this as you build it out - to make the unit economics on your side right. Good luck with everything and hope to see the launch here one day :).
Thanks for doing this Mati! Curious about a couple thing here:
1) What's the bottleneck for having good TTS for many more languages? Is it technological (models can't support the phonemes of some languages); data-related (we need better tagged data fro some languages)?; or logistical (needs a lot of compute so we have to prioritize)? or something else. Said another way, if you had infinite money to create TTS for all languages out there, where would you throw that money?
2) A lot (but not all) of conversational AI tech involves STT -> LLM -> TTS. This means that we invariably rely only on the words in a conversation and not other cues (sighs, pauses, emotion). Where do you think we are headed to solve for this? Would we always need LLM as a middle layer?
Thanks for ElevenLabs btw. Avid users at Pear here :)
ElevenLabs
@agordhandas great questions and happy to hear you are a user & partner!
(1) A good combination of these - for most it's data & compute. On data front - of course having the data in a given language is one of the necessary ingredients. On compute - this one is a bigger factor as adding a language (or set of languages) means retraining the model and then hosting/deploying it - this in turn also means that previous languages (if the model now includes all of them) and specifically voices that were created by the users might slightly change - so we go through a wider set of testing to ensure that impact is limited and as we move from one model to another, you can still benefit from all voices you created with the same amazing quality. That compute part also means, we need to prioritize - and as we (relatively to big labs!) a much smaller company, we cannot run as many models in parallel - and try to innovate how to make the smarter with less compute used. And finally, some languages require slightly different approaches/breakthroughs to get right - for example for Japanese with Kanji characters, you need a different method to get that tonality & emotions right.
To answer the main question: would deploy more capital (although capital isn't the main bottleneck!) on bringing amazing researchers/research engineers (for any researchers reading - if you are interested - message me!) and compute.
(2) We think you can likely bring more of that emotional context in that cascaded model too (i.e. STT -> LLM -> TTS) and exactly something we are working on - by preserving more audio-context and fine-tuning the model to be a bit more emotional. But even today you can create a emotional agent too - try prompting an agent in our Conversational AI product (i.e. https://elevenlabs.io/conversational-ai) with cues that it should preserve uhms & ams in the responses + create a voice that also has some of those elements/pauses, and you will see it feels lot more human already!
There is an alternative approach here too (which we are researching as well!), of true speech-to-speech/voice-to-voice, where you create a omni-model experience and train all those pieces together (something similar to what OpenAI did with 4o). You can then have a bit more naturalness, however less stability, lower audio quality and less control/more hallucinations -- of course with time we expect that a lot of these will get solved too. It will be interesting though how big of the gap there is between those 2 approaches - we think we can get naturalness extremely high with cascading model too :). However, for some use-cases, where for example a bit lower latency is more important than control or stability, that omni-approach will likely be were it's headed!
ElevenLabs
Hey Everyone! Thanks for dropping amazing questions - excited to get this started!
Hey!
I've just build something to generate audio stories for kids, leting parents clone their voices... !
I want to build now something to "talk" with people who have died, I mean.. you clone the voice of someone who died.
Is this even legal? I mean the person who died is not gonna be able to allow you to clone their voice...
what are your thoughts about this? Could I use eleven labs for this?
ElevenLabs
@javierfandos your use-cases is something that we do hear from parents quite often (for when they are travelling or just unavailable) - I hope that's going well!
For your main question - it depends on the geographical region and the laws of that region - I cannot give the legal advice on this one and recommend diving deeper (/getting lawyer advice) on what would be possible. As I think about the future though, I would expect people will indicate in their will on how should their likeness be passed on in the generative AI world; of course some similar laws for other spaces do already exist that might help inform that too - e.g. estates will have rights to music passed through the original creator and the estate then can give specific approvals.
TTS/STT with more advancement...Gr88 innovation.
I imagine ElevenLabs has a ton of control and security measures around voice cloning, but with open source models popping up every day, do you think we'll see a proliferation of audio deep fakes? Are we already there? Is there a way to fight this or is this just something the future will hold regardless of what we do?