It's the end of 2024, and the React Native community lags behind in providing robust speech and audio capabilities, despite the huge and growing market pressure for voice AI.
I have consulted with clients and built apps with real-time communication from scratch. Until recently, communication was primarily human to human, but now humans can be emulated with AI, and computers can understand and generate human language to provide services like:
Let's take a closer look at the workflow.
Users can talk to AI, and AI can also speak. So, you basically need the ability to convert speech to text and text to speech.
react-native-voice
and expo-speech-recognition
can be used for this purpose.react-native-tts
and expo-speech
are available for this functionality.You'll quickly learn that these libraries are meant to be simple and easy to use. However, in the end, you may need to customize the native part to meet advanced requirements.
For the case when you need speech to text, react-native-voice
comes to the rescue.
react-native-voice
First, let's discuss react-native-voice
. We're focusing on this library because TheWidlarzGroup is receiving increasing requests for speech recognition consulting. This is due to the recent surge of AI assistants and the need for more voice interfaces to be implemented in apps.
And this is exactly what react-native-voice
does. It works on Android and iOS, providing limited capabilities using non-customizable edge models from each native platform.
Limited? Yes, but it's actually convenient in most simple use cases. Let's look at the JavaScript API:
What you can do with speech: start
(with locale), stop
, cancel
, destroy
, and you can check isSpeechAvailable
and isRecognizing
. There are also callbacks for: results
, start
, partialResults
, error
, end
, recognized
, and speechVolumeChanged
. Sounds like a complete set!
And it uses these native APIs to achieve that:
SpeechRecognizer
—for speech recognitionRecognitionListener
—for handling recognition eventsRecognizerIntent
—for configuring speech recognitionSFSpeechRecognizer
—the main class for speech recognitionAVAudioEngine
—for capturing audioAVAudioSession
—for managing the audio sessionSFSpeechAudioBufferRecognitionRequest
—for processing the audio bufferBelow is our path to build production ready app with react-native-voice. If you don't want to spend tens of hours like we did, contact us for a consulting call.
react-native-voice
opinionated?If you look into the native codebase, you'll notice several architectural decisions made by the contributors. Let's look into iOS as an example.
The @react-native-voice/voice
library implements a rigid audio processing pipeline with fixed parameters—a low-pass filter coefficient of 0.5 and volume normalization to a 0–10 scale. While this works well for typical mobile applications, it becomes a limitation in specialized environments. The hardcoded decibel calculation and normalization:
makes it challenging to adapt to different acoustic environments like industrial settings or professional audio applications. The fixed 0–10 scale for volume normalization, while simple, might not suit applications requiring more precise audio monitoring. Because these parameters are embedded in the native implementation, they cannot be modified through React Native. This forces developers to consider alternative solutions when more control over audio processing is required.
The library enforces a fixed audio buffer configuration with a predetermined size of 1024 samples and non-configurable audio tap setup:
This hardcoded buffer size represents a one-size-fits-all approach that balances memory usage and latency for typical voice recognition scenarios. However, it becomes problematic when building applications with specific latency requirements or different server-side constraints. For instance, real-time voice command applications might benefit from smaller buffers for faster response times, while high-quality voice recording might require larger buffers. As this configuration is embedded in the native code, developers cannot adjust it through React Native. This may force them to fork the library or seek alternative solutions when buffer size optimization is crucial for their use case.
The library implements a fixed audio session configuration with predetermined routing behavior:
This opinionated approach to audio routing and speaker management works for standard voice recognition scenarios but becomes limiting when dealing with complex audio setups. The hardcoded behavior for handling Bluetooth devices and headphones, while reliable for basic use cases, doesn't allow for dynamic audio routing or optimization for specific hardware configurations. This is particularly challenging when building professional audio applications that require precise control over the audio session.
The library uses a simplified state management system with binary flags:
This basic state machine handles the typical voice recognition lifecycle but lacks flexibility for more complex workflows. The predefined cleanup behavior and fixed error handling patterns make it difficult to implement multi-stage voice processing or custom recognition states. Applications that require sophisticated state management—such as those with intermediate processing steps or complex error recovery mechanisms—might find these constraints too restrictive and may need to consider alternative implementations.
The @react-native-voice/voice
library, while providing essential speech recognition functionality, abstracts away several powerful features available in the native speech recognition APIs. This simplification, while aiding common use cases, means that developers lose access to sophisticated recognition controls that could be crucial for specialized applications.
On iOS, SFSpeechRecognizer
offers advanced task configuration options that remain inaccessible through the React Native interface:
These hidden capabilities include optimization hints for different recognition scenarios (like dictation vs. search), custom vocabulary injection, and control over on-device recognition. The ability to provide contextual strings could significantly improve recognition accuracy for domain-specific applications, while task hints could optimize the recognition engine for specific use cases.
Similarly, Android's SpeechRecognizer
exposes detailed configuration options through its Intent system:
These settings allow for fine-tuned control over language models, multiple recognition hypotheses, and confidence thresholds—features that could be valuable for applications requiring more precise control over the recognition process.
Applications requiring precise confidence scoring, detailed analytics about recognition performance, sophisticated error handling, fine-grained control over recognition quality, or detailed timing information may find these features essential.
The absence of these features in the react-native-voice
interface means developers may need to create custom native modules to access this functionality when building professional-grade speech recognition applications.
While @react-native-voice/voice
provides a solid foundation for basic voice recognition features, the library's simplification of native APIs creates significant limitations for professional and specialized applications. The hidden capabilities in both iOS and Android platforms become crucial when building enterprise-grade speech recognition solutions.
Professional customization becomes essential when your application requires precise control over the recognition process.
Real-time translation services require not just accurate speech recognition but also detailed confidence scores and multiple recognition hypotheses to ensure translation quality. These applications often need custom vocabularies and contextual optimization—features available in native APIs but inaccessible through the React Native interface.
A real-time translation application was experiencing intermittent recognition issues with the @react-native-voice/voice
library. The standard error handling only provided basic error states, making it challenging to:
To improve reliability, we developed a custom native module that exposes detailed error information from the underlying APIs:
This improved error handler allows the app to handle a wider range of scenarios, including:
By exposing these native API capabilities, we transformed basic speech recognition into a more robust, production-grade implementation suitable for applications like real-time translation.
TheWidlarzGroup specializes in bridging this gap by developing custom native modules that expose these powerful capabilities. Our team can implement:
The decision to invest in custom development becomes particularly compelling when recognition accuracy directly impacts your business outcomes. Whether you're building a real-time translation service or a specialized voice command interface, TheWidlarzGroup can help unlock the full potential of native speech recognition capabilities within your React Native application.
{{cta}}