[ad_1]
Amazon Polly is a service that turns text into real speech. This allows for a whole class of applications that can convert text to speech in multiple languages.
This service can be used by chatbots, audiobooks, and other text-to-speech applications in conjunction with AWS AI or machine learning (ML) services. For example, Amazon Lex and Amazon Polly can be combined to create a chatbot that engages in a two-way conversation with a user and performs certain tasks based on user commands. Amazon Transcribe, Amazon Translate, and Amazon Polly can be combined to transcribe text from a source language, translate it into another language, and speak it.
In this post, we present an interesting approach to highlighting text as it is spoken using Amazon Polly. This solution can be used in many text-to-speech applications to perform the following actions:
- Add visual capabilities to audiobooks, websites, and blogs
- Increase comprehension when users try to quickly understand text as it is spoken
Our solution gives the client (the browser, in this example) the ability to know what text (word or sentence) Amazon Polly is speaking at any given moment. This allows the client to dynamically highlight text as it is spoken. Such a capability is useful for providing visual speech support for the use cases mentioned above.
Our solution can be extended to perform additional tasks beyond text highlighting. For example, the browser can display images, play music, or perform other animations while the front-end speaks text. This capability is useful for creating dynamic audiobooks, educational content, and richer text-to-speech applications.
Solution overview
At its core, the solution uses Amazon Polly to convert a string of text into speech. Text can be entered from a browser or via an API call to an exposed endpoint in our solution. Speech generated by Amazon Polly is stored as an audio file (MP3 format) in an Amazon Simple Storage Service (Amazon S3) bucket.
However, using only the audio file, the browser cannot find which parts of the text are being spoken at any given moment, because we do not have detailed information about when each word is spoken.
Amazon Polly offers a way to do this using speech tokens. Speech tokens are stored in a text file that shows the time (measured in milliseconds from the start of the audio) when each word or sentence was spoken.
Amazon Polly returns speech token objects in a line-separated JSON stream. The speech mark object contains the following fields:
- time – Timestamp in milliseconds from the beginning of the corresponding audio stream
- type – type of speech mark (sentence, word, visage or SSML)
- start – offset of the start of the object in bytes (not characters) in the input text (without commas)
- The end – end-of-object offset in bytes (not characters) in the input text (without commas)
- value – It varies depending on the type of speech sign:
- SSML – SSML tag
- Vizeme – Vizem’s name
- word or punishment – A substring of input text delimited by the start and end fields
For example, the sentence “Mary had a little lamb” can give you the following speech token file if you use SpeechMarkTypes
= [“word”, “sentence”] To get speech tokens in an API call:
The word “had” (at the end of line 3) begins 373 milliseconds after the start of the audio stream, starts at byte 5, and ends at byte 8 of the input text.
Architecture overview
The architecture of our solution is presented in the following diagram.
Solution Our website is stored on Amazon S3 as static files (JavaScript, HTML) hosted in Amazon CloudFront (1) and served to the end user’s browser (2).
When a user enters text into a browser through a simple HTML form, it is processed by JavaScript in the browser. This invokes the API (3) via Amazon API Gateway to call the AWS Lambda function (4). The Lambda function calls Amazon Polly (5) to generate speech (audio) and speech token (JSON) files. Two calls were made to Amazon Polly to retrieve the audio and speech token files. The calls are made using JavaScript asynchronous functions. The output of these calls are audio and speech token files stored in Amazon S3 (6a). To prevent multiple users from overwriting each other’s files in the S3 bucket, files are stored in a timestamped folder. This reduces the chance that two users will overwrite each other’s files in Amazon S3. For production release, we can use more robust approaches to separate user files based on user ID or timestamp and other unique characteristics.
The Lambda function generates pre-signed URLs for speech and speech-token files and returns them to the browser as an array (7, 8, 9).
When the browser sends a text file to the API endpoint (3), it returns two pre-signed URLs for the audio file and the speech marks file in a single asynchronous call (9). This is indicated by the wrench symbol next to the arrow.
A JavaScript function in the browser retrieves speech token files and audio from their URL handles (10). It installs an audio player to play audio. (The HTML audio tag is used for this purpose).
When the user presses the play button, it analyzes the speech cues captured earlier to generate a series of timed events using a timeout. Events trigger the callback function, which is another JavaScript function used to highlight spoken text in the browser. In parallel, the JavaScript function streams the audio file from its URL handle.
The result is that events take place at the appropriate time to emphasize the text as it is spoken while the audio is playing. Using a JavaScript timeout allows us to synchronize the audio with the highlighted text.
prerequisites
To run this solution, you need an AWS account with an AWS Identity and Access Management (IAM) user that has permission to use Amazon CloudFront, Amazon API Gateway, Amazon Polly, Amazon S3, AWS Lambda, and AWS Step features.
Use lambda to create speech and speech marks
The following code invokes Amazon Polly synthesize_speech
Functions twice to fetch audio and speech marks file. They work as asynchronous functions and are coordinated to return a result at the same time using promises.
On the JavaScript side, text highlighting is done with a highlighter (start, end, word) and timed events are set setTimers()
:
Alternative approaches
Instead of the previous approach, you can consider several alternatives:
- Create both speech tokens and audio files in the Step Functions state machine. A state machine can call a parallel branch condition to call two different lambda functions: one to generate speech and one to generate speech tokens. The code for this can be found in the use-step-functions subfolder in the Github repo.
- Call Amazon Polly asynchronously to generate audio and speech tokens. This approach can be used if the text content is large or the user does not need a real-time response. For more information about creating long audio files, see Creating long audio files.
- Ask Amazon Polly to generate a predefined URL directly
generate_presigned_url
Call the Amazon Polly client in Boto3. If you follow this approach, Amazon Polly will generate new audio and speech cues every time. In our current approach, we store these files in Amazon S3. Although these saved files are not accessible from the browser in our version of the code, you can modify the code to play previously generated audio files by extracting them from Amazon S3 (instead of recreating the audio for text using Amazon Polly). We have more code examples for accessing Amazon Polly with Python in the AWS Code Library.
Create a solution
The entire solution is available from our Github repo. To create this solution in your account, follow the instructions in the README.md file. The solution includes an AWS CloudFormation template to provision your resources.
Cleaning
To clean up the resources created in this demo, follow these steps:
- Delete the S3 buckets created to store the CloudFormation template (Bucket A), the source code (Bucket B), and the website (
pth-cf-text-highlighter-website-[Suffix]
). - Remove the CloudFormation stack
pth-cf
. - Delete the S3 bucket containing the speech files (
pth-speech-[Suffix]
). This bucket is created by the CloudFormation template to store audio and speech token files generated by Amazon Polly.
Summary
In this post, we showed an example solution that can highlight text using Amazon Polly. It was created using the Amazon Polly speech markers feature, which provides markers for where each word or sentence begins in an audio file.
The solution is available as a CloudFormation template. It can be deployed as in any web application that performs text-to-speech conversion. This would be useful for adding visual capabilities to audio in books, lip-syncing avatars (using Visem speech marks), websites and blogs, and assisting the hearing impaired.
It can be extended to perform additional tasks beyond text highlighting. For example, the browser can display images, play music, and perform other animations while the front-end speaks text. This capability can be useful for creating dynamic audiobooks, educational content, and richer text-to-speech applications.
You are welcome to try this solution and learn more about the corresponding AWS services from the following links. You can extend the functionality for your specific needs.
About the author
Varad G Varadarajan Is a trusted advisor and field CTO for Digital Native Business (DNB) customers in AWS. It helps them design and build innovative solutions using scalable AWS products and services. Varad’s areas of interest are IT strategy consulting, architecture and product management. Outside of work, Varad enjoys creative writing, watching movies with family and friends, and traveling.
[ad_2]
Source link