Building Advanced Speech Recognition in React Native: A Guide to Extending react-native-voice

Voice in React Native

react-native-voice

How is react-native-voice opinionated?

Filter

Audio Buffer Configuration Constraints

Audio Session Management Constraints

Recognition State Machine Limitations

What Native APIs Are Not Even Mentioned in the Library

Other Missing Features

When to Consider Using TheWidlarzGroup Consultants

Key Missing Capabilities

Why Choose TheWidlarzGroup

Use Case: Advanced Error Handling in Speech Recognition

The Challenge

Native Capabilities Implementation

Partner with TheWidlarzGroup for Advanced Voice Solutions

Example H2

Voice in React Native

It's the end of 2024, and the React Native community lags behind in providing robust speech and audio capabilities, despite the huge and growing market pressure for voice AI.

I have consulted with clients and built apps with real-time communication from scratch. Until recently, communication was primarily human to human, but now humans can be emulated with AI, and computers can understand and generate human language to provide services like:

Chatbots: AI conversational agents assisting users with various tasks.
Voice-Controlled Smart Homes: AI managing home devices through speech.
Virtual Assistants: AI handling calls and messages.
Travel Assistants: AI helping with planning and booking trips in real time.
Interactive Entertainment: AI characters engaging users in conversational games.
Educational Tools: AI providing interactive learning experiences through voice.

Let's take a closer look at the workflow.

Users can talk to AI, and AI can also speak. So, you basically need the ability to convert speech to text and text to speech.

Speech to Text: Converting spoken words into text. Libraries like react-native-voice and expo-speech-recognition can be used for this purpose.
Text to Speech: Converting text into spoken words. Libraries like react-native-tts and expo-speech are available for this functionality.

You'll quickly learn that these libraries are meant to be simple and easy to use. However, in the end, you may need to customize the native part to meet advanced requirements.

For the case when you need speech to text, react-native-voice comes to the rescue.

`react-native-voice`

First, let's discuss react-native-voice. We're focusing on this library because TheWidlarzGroup is receiving increasing requests for speech recognition consulting. This is due to the recent surge of AI assistants and the need for more voice interfaces to be implemented in apps.

And this is exactly what react-native-voice does. It works on Android and iOS, providing limited capabilities using non-customizable edge models from each native platform.

Limited? Yes, but it's actually convenient in most simple use cases. Let's look at the JavaScript API:

What you can do with speech: start (with locale), stop, cancel, destroy, and you can check isSpeechAvailable and isRecognizing. There are also callbacks for: results, start, partialResults, error, end, recognized, and speechVolumeChanged. Sounds like a complete set!

And it uses these native APIs to achieve that:

Android:
- SpeechRecognizer—for speech recognition
- RecognitionListener—for handling recognition events
- RecognizerIntent—for configuring speech recognition
iOS:
- SFSpeechRecognizer—the main class for speech recognition
- AVAudioEngine—for capturing audio
- AVAudioSession—for managing the audio session
- SFSpeechAudioBufferRecognitionRequest—for processing the audio buffer

Below is our path to build production ready app with react-native-voice. If you don't want to spend tens of hours like we did, contact us for a consulting call.

How is `react-native-voice` opinionated?

If you look into the native codebase, you'll notice several architectural decisions made by the contributors. Let's look into iOS as an example.

Filter

The @react-native-voice/voice library implements a rigid audio processing pipeline with fixed parameters—a low-pass filter coefficient of 0.5 and volume normalization to a 0–10 scale. While this works well for typical mobile applications, it becomes a limitation in specialized environments. The hardcoded decibel calculation and normalization:

self.averagePowerForChannel0 = (LEVEL_LOWPASS_TRIG * ((avgValue == 0) ? -100 : 20.0 * log10f(avgValue))) + ((1 - LEVEL_LOWPASS_TRIG) * self.averagePowerForChannel0);

makes it challenging to adapt to different acoustic environments like industrial settings or professional audio applications. The fixed 0–10 scale for volume normalization, while simple, might not suit applications requiring more precise audio monitoring. Because these parameters are embedded in the native implementation, they cannot be modified through React Native. This forces developers to consider alternative solutions when more control over audio processing is required.

Audio Buffer Configuration Constraints

The library enforces a fixed audio buffer configuration with a predetermined size of 1024 samples and non-configurable audio tap setup:

[mixer installTapOnBus:0 bufferSize:1024 format:recordingFormat block:^{...}];

This hardcoded buffer size represents a one-size-fits-all approach that balances memory usage and latency for typical voice recognition scenarios. However, it becomes problematic when building applications with specific latency requirements or different server-side constraints. For instance, real-time voice command applications might benefit from smaller buffers for faster response times, while high-quality voice recording might require larger buffers. As this configuration is embedded in the native code, developers cannot adjust it through React Native. This may force them to fork the library or seek alternative solutions when buffer size optimization is crucial for their use case.

Audio Session Management Constraints

The library implements a fixed audio session configuration with predetermined routing behavior:

[self.audioSession setCategory:AVAudioSessionCategoryPlayAndRecord withOptions:AVAudioSessionCategoryOptionDefaultToSpeaker error:nil];

This opinionated approach to audio routing and speaker management works for standard voice recognition scenarios but becomes limiting when dealing with complex audio setups. The hardcoded behavior for handling Bluetooth devices and headphones, while reliable for basic use cases, doesn't allow for dynamic audio routing or optimization for specific hardware configurations. This is particularly challenging when building professional audio applications that require precise control over the audio session.

Recognition State Machine Limitations

The library uses a simplified state management system with binary flags:‍

@property (nonatomic) BOOL isTearingDown;
@property (nonatomic) BOOL continuous;

‍This basic state machine handles the typical voice recognition lifecycle but lacks flexibility for more complex workflows. The predefined cleanup behavior and fixed error handling patterns make it difficult to implement multi-stage voice processing or custom recognition states. Applications that require sophisticated state management—such as those with intermediate processing steps or complex error recovery mechanisms—might find these constraints too restrictive and may need to consider alternative implementations.

What Native APIs Are Not Even Mentioned in the Library

The @react-native-voice/voice library, while providing essential speech recognition functionality, abstracts away several powerful features available in the native speech recognition APIs. This simplification, while aiding common use cases, means that developers lose access to sophisticated recognition controls that could be crucial for specialized applications.

On iOS, SFSpeechRecognizer offers advanced task configuration options that remain inaccessible through the React Native interface:

task.taskHint = SFSpeechRecognitionTaskHintDictation;
task.contextualStrings = @[@"custom", @"vocabulary"];
recognizer.supportsOnDeviceRecognition = YES;

These hidden capabilities include optimization hints for different recognition scenarios (like dictation vs. search), custom vocabulary injection, and control over on-device recognition. The ability to provide contextual strings could significantly improve recognition accuracy for domain-specific applications, while task hints could optimize the recognition engine for specific use cases.

Similarly, Android's SpeechRecognizer exposes detailed configuration options through its Intent system:‍

Intent recognizerIntent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
recognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
recognizerIntent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 5);

‍These settings allow for fine-tuned control over language models, multiple recognition hypotheses, and confidence thresholds—features that could be valuable for applications requiring more precise control over the recognition process.

Other Missing Features

Recognition Quality and Performance Controls.
Advanced Event Handling.
Recognition Metadata and Analytics.
Error Handling and Diagnostics.

Applications requiring precise confidence scoring, detailed analytics about recognition performance, sophisticated error handling, fine-grained control over recognition quality, or detailed timing information may find these features essential.

The absence of these features in the react-native-voice interface means developers may need to create custom native modules to access this functionality when building professional-grade speech recognition applications.

When to Consider Using TheWidlarzGroup Consultants

While @react-native-voice/voice provides a solid foundation for basic voice recognition features, the library's simplification of native APIs creates significant limitations for professional and specialized applications. The hidden capabilities in both iOS and Android platforms become crucial when building enterprise-grade speech recognition solutions.

Key Missing Capabilities

Recognition confidence scoring and multiple hypotheses
Detailed timing and segmentation information
Advanced error diagnostics and recovery mechanisms
Custom vocabulary and contextual optimization
On-device vs. cloud recognition control

Professional customization becomes essential when your application requires precise control over the recognition process.

Why Choose TheWidlarzGroup

Real-time translation services require not just accurate speech recognition but also detailed confidence scores and multiple recognition hypotheses to ensure translation quality. These applications often need custom vocabularies and contextual optimization—features available in native APIs but inaccessible through the React Native interface.

Use Case: Advanced Error Handling in Speech Recognition

The Challenge

A real-time translation application was experiencing intermittent recognition issues with the @react-native-voice/voice library. The standard error handling only provided basic error states, making it challenging to:

Identify the specific causes of recognition failures
Implement appropriate recovery strategies
Maintain detailed error logs for quality monitoring
Provide feedback to users when translations didn't process correctly

Native Capabilities Implementation

To improve reliability, we developed a custom native module that exposes detailed error information from the underlying APIs:‍

// iOS Custom Error Handler
@implementation AdvancedVoiceRecognition

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishWithError:(NSError *)error {
    NSMutableDictionary *diagnosticInfo = [NSMutableDictionary dictionary];

    // Check if an error occurred during speech recognition
    if (error) {
        diagnosticInfo[@"errorDomain"] = error.domain;
        diagnosticInfo[@"errorCode"] = @(error.code);
        diagnosticInfo[@"errorDescription"] = error.localizedDescription;
    }

    // Add additional information regardless of whether an error occurred
    diagnosticInfo[@"audioSessionStatus"] = [self getAudioSessionStatus];
    diagnosticInfo[@"deviceSpecificInfo"] = [self collectDeviceMetrics];

    // Collect recognition metrics
    NSMutableDictionary *recognitionMetrics = [NSMutableDictionary dictionary];
    if (task.result) {
        recognitionMetrics[@"averageConfidence"] = @(task.result.bestTranscription.averageConfidence);
        recognitionMetrics[@"recognitionDuration"] = @(task.result.duration);
    }
    recognitionMetrics[@"audioLevel"] = @(self.currentAudioLevel);

    diagnosticInfo[@"recognitionMetrics"] = recognitionMetrics;

    // Send an event with the diagnostic information
    [self sendEventWithName:@"onDetailedError" body:diagnosticInfo];
}
@end

‍This improved error handler allows the app to handle a wider range of scenarios, including:

Collecting detailed diagnostic information even when no error occurs
Providing more granular feedback to the React Native layer
Enabling automated recovery from common recognition issues
Optimizing performance based on error patterns and recognition metrics

By exposing these native API capabilities, we transformed basic speech recognition into a more robust, production-grade implementation suitable for applications like real-time translation.

Partner with TheWidlarzGroup for Advanced Voice Solutions

TheWidlarzGroup specializes in bridging this gap by developing custom native modules that expose these powerful capabilities. Our team can implement:

Sophisticated Recognition Analytics Systems: Gain insights into your application's performance.
Advanced Error Handling with Rich Diagnostic Information: Quickly identify and resolve issues.
Custom Vocabulary Optimization Solutions: Improve recognition accuracy for domain-specific terms.
Real-Time Performance Monitoring Tools: Keep your application running smoothly under all conditions.

The decision to invest in custom development becomes particularly compelling when recognition accuracy directly impacts your business outcomes. Whether you're building a real-time translation service or a specialized voice command interface, TheWidlarzGroup can help unlock the full potential of native speech recognition capabilities within your React Native application.

‍

Do you need help with developing react solutions?

Leave your contact info and we’ll be in touch with you shortly

Leave contact info

Leave your contact info & we will be in touch with you shortly!

Name*

E-mail addres*

Conversation Topic*

Related service*

Your timezone*

Thank you! Your submission has been received! 📨

Oops! Something went wrong while submitting the form.

Name*

E-mail address*

Conversation Topic*

Related service*

Your timezone*

I declare that I have read the information clause contained in the Privacy Policy regarding the processing of personal data. By clicking on the “Submit” button, you send us your personal data, which will be processed to answer your submission and, if you give your voluntary consent, also to receive the newsletter. The controller of your personal data is The Widlarz Group sp. z o.o. located in Kraków (address: ul. Szlak 77/222, 31-153 Kraków, Poland). Provision of data is voluntary, but failure to do so will prevent you from getting an answer to the question asked via the form. Data may be disclosed to entities providing services to us (e.g., web host, IT service providers). You have the right to request from the controller access to and rectification or erasure of personal data or restriction of processing concerning the data subject or to object to processing as well as the right to data portability and to lodge a complaint with a supervisory authority. For more information, see the Privacy Policy: Privacy Policy.

React Native

React Native Voice

React Native Audio

AI Assistant

Speech Recognition

No items found.

Building Advanced Speech Recognition in React Native: A Guide to Extending react-native-voice

Table of Contents

Voice in React Native

react-native-voice

How is react-native-voice opinionated?

Filter

Audio Buffer Configuration Constraints

Audio Session Management Constraints

Recognition State Machine Limitations

What Native APIs Are Not Even Mentioned in the Library

Other Missing Features

When to Consider Using TheWidlarzGroup Consultants

Key Missing Capabilities

Why Choose TheWidlarzGroup

Use Case: Advanced Error Handling in Speech Recognition

The Challenge

Native Capabilities Implementation

Partner with TheWidlarzGroup for Advanced Voice Solutions

Leave your contact info and we’ll be in touch with you shortly

Leave your contact info & we will be in touch with you shortly!

See related posts

`react-native-voice`

How is `react-native-voice` opinionated?