Blog
Building Advanced Speech Recognition in React Native: A Guide to Extending react-native-voice

Building Advanced Speech Recognition in React Native: A Guide to Extending react-native-voice

Learn how to tackle the current challenges of speech recognition technology in React Native. This article demonstrates how TheWidlarzGroup provides professional, customized solutions that elevate voice applications to the next level. Discover how to enhance your app's performance with advanced speech recognition features.
Tags
React Native
React Native Voice
React Native Audio
AI Assistant
Speech Recognition
Published on
November 13, 2024

Voice in React Native

It's the end of 2024, and the React Native community lags behind in providing robust speech and audio capabilities, despite the huge and growing market pressure for voice AI.

I have consulted with clients and built apps with real-time communication from scratch. Until recently, communication was primarily human to human, but now humans can be emulated with AI, and computers can understand and generate human language to provide services like:

  • Chatbots: AI conversational agents assisting users with various tasks.
  • Voice-Controlled Smart Homes: AI managing home devices through speech.
  • Virtual Assistants: AI handling calls and messages.
  • Travel Assistants: AI helping with planning and booking trips in real time.
  • Interactive Entertainment: AI characters engaging users in conversational games.
  • Educational Tools: AI providing interactive learning experiences through voice.

Let's take a closer look at the workflow.

Users can talk to AI, and AI can also speak. So, you basically need the ability to convert speech to text and text to speech.

  • Speech to Text: Converting spoken words into text. Libraries like react-native-voice and expo-speech-recognition can be used for this purpose.
  • Text to Speech: Converting text into spoken words. Libraries like react-native-tts and expo-speech are available for this functionality.

You'll quickly learn that these libraries are meant to be simple and easy to use. However, in the end, you may need to customize the native part to meet advanced requirements.

For the case when you need speech to text, react-native-voice comes to the rescue.

react-native-voice

First, let's discuss react-native-voice. We're focusing on this library because TheWidlarzGroup is receiving increasing requests for speech recognition consulting. This is due to the recent surge of AI assistants and the need for more voice interfaces to be implemented in apps.

And this is exactly what react-native-voice does. It works on Android and iOS, providing limited capabilities using non-customizable edge models from each native platform.

Limited? Yes, but it's actually convenient in most simple use cases. Let's look at the JavaScript API:

What you can do with speech: start (with locale), stop, cancel, destroy, and you can check isSpeechAvailable and isRecognizing. There are also callbacks for: results, start, partialResults, error, end, recognized, and speechVolumeChanged. Sounds like a complete set!

And it uses these native APIs to achieve that:

  • Android:
    • SpeechRecognizer—for speech recognition
    • RecognitionListener—for handling recognition events
    • RecognizerIntent—for configuring speech recognition
  • iOS:
    • SFSpeechRecognizer—the main class for speech recognition
    • AVAudioEngine—for capturing audio
    • AVAudioSession—for managing the audio session
    • SFSpeechAudioBufferRecognitionRequest—for processing the audio buffer

Below is our path to build production ready app with react-native-voice. If you don't want to spend tens of hours like we did, contact us for a consulting call.

How is react-native-voice opinionated?

If you look into the native codebase, you'll notice several architectural decisions made by the contributors. Let's look into iOS as an example.

Filter

The @react-native-voice/voice library implements a rigid audio processing pipeline with fixed parameters—a low-pass filter coefficient of 0.5 and volume normalization to a 0–10 scale. While this works well for typical mobile applications, it becomes a limitation in specialized environments. The hardcoded decibel calculation and normalization:

self.averagePowerForChannel0 = (LEVEL_LOWPASS_TRIG * ((avgValue == 0) ? -100 : 20.0 * log10f(avgValue))) + ((1 - LEVEL_LOWPASS_TRIG) * self.averagePowerForChannel0);

makes it challenging to adapt to different acoustic environments like industrial settings or professional audio applications. The fixed 0–10 scale for volume normalization, while simple, might not suit applications requiring more precise audio monitoring. Because these parameters are embedded in the native implementation, they cannot be modified through React Native. This forces developers to consider alternative solutions when more control over audio processing is required.

Audio Buffer Configuration Constraints

The library enforces a fixed audio buffer configuration with a predetermined size of 1024 samples and non-configurable audio tap setup:

[mixer installTapOnBus:0 bufferSize:1024 format:recordingFormat block:^{...}];

This hardcoded buffer size represents a one-size-fits-all approach that balances memory usage and latency for typical voice recognition scenarios. However, it becomes problematic when building applications with specific latency requirements or different server-side constraints. For instance, real-time voice command applications might benefit from smaller buffers for faster response times, while high-quality voice recording might require larger buffers. As this configuration is embedded in the native code, developers cannot adjust it through React Native. This may force them to fork the library or seek alternative solutions when buffer size optimization is crucial for their use case.

Audio Session Management Constraints

The library implements a fixed audio session configuration with predetermined routing behavior:

[self.audioSession setCategory:AVAudioSessionCategoryPlayAndRecord withOptions:AVAudioSessionCategoryOptionDefaultToSpeaker error:nil];

This opinionated approach to audio routing and speaker management works for standard voice recognition scenarios but becomes limiting when dealing with complex audio setups. The hardcoded behavior for handling Bluetooth devices and headphones, while reliable for basic use cases, doesn't allow for dynamic audio routing or optimization for specific hardware configurations. This is particularly challenging when building professional audio applications that require precise control over the audio session.

Recognition State Machine Limitations

The library uses a simplified state management system with binary flags:

@property (nonatomic) BOOL isTearingDown;
@property (nonatomic) BOOL continuous;

This basic state machine handles the typical voice recognition lifecycle but lacks flexibility for more complex workflows. The predefined cleanup behavior and fixed error handling patterns make it difficult to implement multi-stage voice processing or custom recognition states. Applications that require sophisticated state management—such as those with intermediate processing steps or complex error recovery mechanisms—might find these constraints too restrictive and may need to consider alternative implementations.

What Native APIs Are Not Even Mentioned in the Library

The @react-native-voice/voice library, while providing essential speech recognition functionality, abstracts away several powerful features available in the native speech recognition APIs. This simplification, while aiding common use cases, means that developers lose access to sophisticated recognition controls that could be crucial for specialized applications.

On iOS, SFSpeechRecognizer offers advanced task configuration options that remain inaccessible through the React Native interface:

task.taskHint = SFSpeechRecognitionTaskHintDictation;
task.contextualStrings = @[@"custom", @"vocabulary"];
recognizer.supportsOnDeviceRecognition = YES;

These hidden capabilities include optimization hints for different recognition scenarios (like dictation vs. search), custom vocabulary injection, and control over on-device recognition. The ability to provide contextual strings could significantly improve recognition accuracy for domain-specific applications, while task hints could optimize the recognition engine for specific use cases.

Similarly, Android's SpeechRecognizer exposes detailed configuration options through its Intent system:

Intent recognizerIntent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
recognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
recognizerIntent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 5);

These settings allow for fine-tuned control over language models, multiple recognition hypotheses, and confidence thresholds—features that could be valuable for applications requiring more precise control over the recognition process.

Other Missing Features

  • Recognition Quality and Performance Controls.
  • Advanced Event Handling.
  • Recognition Metadata and Analytics.
  • Error Handling and Diagnostics.

Applications requiring precise confidence scoring, detailed analytics about recognition performance, sophisticated error handling, fine-grained control over recognition quality, or detailed timing information may find these features essential.

The absence of these features in the react-native-voice interface means developers may need to create custom native modules to access this functionality when building professional-grade speech recognition applications.

When to Consider Using TheWidlarzGroup Consultants

While @react-native-voice/voice provides a solid foundation for basic voice recognition features, the library's simplification of native APIs creates significant limitations for professional and specialized applications. The hidden capabilities in both iOS and Android platforms become crucial when building enterprise-grade speech recognition solutions.

Key Missing Capabilities

  • Recognition confidence scoring and multiple hypotheses
  • Detailed timing and segmentation information
  • Advanced error diagnostics and recovery mechanisms
  • Custom vocabulary and contextual optimization
  • On-device vs. cloud recognition control

Professional customization becomes essential when your application requires precise control over the recognition process.

Why Choose TheWidlarzGroup

Real-time translation services require not just accurate speech recognition but also detailed confidence scores and multiple recognition hypotheses to ensure translation quality. These applications often need custom vocabularies and contextual optimization—features available in native APIs but inaccessible through the React Native interface.

Use Case: Advanced Error Handling in Speech Recognition

The Challenge

A real-time translation application was experiencing intermittent recognition issues with the @react-native-voice/voice library. The standard error handling only provided basic error states, making it challenging to:

  • Identify the specific causes of recognition failures
  • Implement appropriate recovery strategies
  • Maintain detailed error logs for quality monitoring
  • Provide feedback to users when translations didn't process correctly

Native Capabilities Implementation

To improve reliability, we developed a custom native module that exposes detailed error information from the underlying APIs:

// iOS Custom Error Handler
@implementation AdvancedVoiceRecognition

- (void)speechRecognitionTask:(SFSpeechRecognitionTask *)task didFinishWithError:(NSError *)error {
    NSMutableDictionary *diagnosticInfo = [NSMutableDictionary dictionary];

    // Check if an error occurred during speech recognition
    if (error) {
        diagnosticInfo[@"errorDomain"] = error.domain;
        diagnosticInfo[@"errorCode"] = @(error.code);
        diagnosticInfo[@"errorDescription"] = error.localizedDescription;
    }

    // Add additional information regardless of whether an error occurred
    diagnosticInfo[@"audioSessionStatus"] = [self getAudioSessionStatus];
    diagnosticInfo[@"deviceSpecificInfo"] = [self collectDeviceMetrics];

    // Collect recognition metrics
    NSMutableDictionary *recognitionMetrics = [NSMutableDictionary dictionary];
    if (task.result) {
        recognitionMetrics[@"averageConfidence"] = @(task.result.bestTranscription.averageConfidence);
        recognitionMetrics[@"recognitionDuration"] = @(task.result.duration);
    }
    recognitionMetrics[@"audioLevel"] = @(self.currentAudioLevel);

    diagnosticInfo[@"recognitionMetrics"] = recognitionMetrics;

    // Send an event with the diagnostic information
    [self sendEventWithName:@"onDetailedError" body:diagnosticInfo];
}
@end

This improved error handler allows the app to handle a wider range of scenarios, including:

  • Collecting detailed diagnostic information even when no error occurs
  • Providing more granular feedback to the React Native layer
  • Enabling automated recovery from common recognition issues
  • Optimizing performance based on error patterns and recognition metrics

By exposing these native API capabilities, we transformed basic speech recognition into a more robust, production-grade implementation suitable for applications like real-time translation.

Partner with TheWidlarzGroup for Advanced Voice Solutions

TheWidlarzGroup specializes in bridging this gap by developing custom native modules that expose these powerful capabilities. Our team can implement:

  • Sophisticated Recognition Analytics Systems: Gain insights into your application's performance.
  • Advanced Error Handling with Rich Diagnostic Information: Quickly identify and resolve issues.
  • Custom Vocabulary Optimization Solutions: Improve recognition accuracy for domain-specific terms.
  • Real-Time Performance Monitoring Tools: Keep your application running smoothly under all conditions.

The decision to invest in custom development becomes particularly compelling when recognition accuracy directly impacts your business outcomes. Whether you're building a real-time translation service or a specialized voice command interface, TheWidlarzGroup can help unlock the full potential of native speech recognition capabilities within your React Native application.

{{cta}}

Leave your contact info & we will be in touch with you shortly!

Thank you! Your submission has been received! 📨
Oops! Something went wrong while submitting the form.
React Native
React Native Voice
React Native Audio
AI Assistant
Speech Recognition
No items found.
Do you need help with developing react solutions?

Leave your contact info and we’ll be in touch with you shortly

Leave contact info
Become one of our 10+ ambassadors and earn real $$$.