Text to Speech in Python

Text to Speech in Python

Abstract:

In this paper, we discuss the feasibility of using an automated speech-to-speech pipeline to encode voice samples instead of regular voice codecs, in situations that require high data compression with high packet loss scenarios. To analyse the advantages of using a speech-to-text transcription as a voice encoder and a text-to-speech synthesis as a decoder and compare it against standard a PCM A-law codec, we have measured the error rate of user-transcribed sentences based on the Semantically Unpredictable Sentences (SUS) test. Some of the PCM speech samples were also played in a way to simulate poor network conditions, specifically 5% and 10% packet loss. These were added to compare the performance of the speech-to-text method to standard voice codecs. Additionally, we have evaluated how similar the transcribed messages were to the ground truth by measuring the Levenshtein distance between the sentences and also their Double Metaphone phonetic representations. We conclude it may be feasible to use speech-to-text as a codec, as the results of do not show a significant difference between the synthesised voice and the real voice.