Tools such as Apple’s Siri, Amazon’s Alexa, and Google Home have brought text-to-speech capabilities to the masses. These conversational assistants respond to natural-language requests and reply in kind. They use machine-learning models trained on large amounts of recorded speech samples matched with the corresponding text. When the assistant wants to say something, the model is able to build new utterances that sound natural.
These big tech companies also provide APIs to access such capabilities, and those support many languages. To use them, you have to send the text to their server and receive the generated speech back. That’s often fine, but not when the text pertains to someone’s personal data. In the EU, GDPR requires that such data be handled correctly, and in particular not transmitted outside the EU. Using a third-party API of a trans-national company cannot provide the required transparency.
Voxo AB (https://www.voxo.ai/) is a Stockholm-based startup that specializes in extracting, analysing, and visualising voice data. Their services are used in multiple industries to provide insights and enable data-driven business development. They are keen to build on their existing voice expertise to enter market sectors that need the capability to synthesize voices speaking Swedish. The generation must not compromise the integrity of the data, which might be personal to a user. Thus, existing programmatic APIs are unsuitable, and Voxo building its own solution using HPC.
As first-time HPC users, Voxo applied to the pan-European program for introductory industrial HPC access, called SHAPE (https://prace-ri.eu/hpc-access/shape-access/). They were delighted to receive help from ENCCS to write their proposal to build a Swedish-language voice-to-text capability. Ultimately they were awarded 25,000 core hours on the JUWELS Booster cluster. This cluster is housed at the Jülich Supercomputing Centre (https://www.fz-juelich.de/ias/jsc/EN/Home/home_node.html) in Germany and includes over 3700 latest-model NVIDIA A100 GPUS. These will be invaluable for training text-to-speech models for Swedish.
Part of the SHAPE award was the effort of an HPC expert, Dr Mark Abraham of ENCCS (https://enccs.se/). Together, they will use machine learning to develop their own Swedish-language speech-synthesis model. It will be a key component of a conversational assistant capable of providing information in real time in response to spoken natural-language questions. It will be capable of learning to pronounce jargon relevant to particular domains, such as banking. It will generate audio streams quickly, so that users will be comfortable with natural conversation flow, without pauses for generating long replies. It will be implemented using existing Tacotron and WaveGlow technology, such as described in this blog post from NVIDIA https://developer.nvidia.com/blog/how-to-deploy-real-time-text-to-speech-applications-on-gpus-using-tensorrt/.
Success in this project will mean that planned products such as Voxo’s conversational agent for personal banking can be brought to the Swedish market. This will make personal banking much more accessible, by removing requirements like visiting a bank branch, or having and being able to use a computer or mobile computing device.
For more information on PRACE SHAPE Access visit https://prace-ri.eu/hpc-access/shape-access/.