Audio samples from "Requirements and motivations of low-resource speech synthesis for language revitalization"

Paper: Requirements and motivations of low-resource speech synthesis for language revitalization
Authors: Aidan Pine, Dan Wells, Nathan Thanyehténhas Brinklow, Patrick Littell, Korin Richmond
Abstract: This paper describes the motivation and development of speech synthesis systems for the purposes of language revitalization. By building speech synthesis systems for three Indigenous languages spoken in Canada, Kanyen'kéha, Gitksan & SENĆOŦEN, we re-evaluate the question of how much data is required to build low-resource speech synthesis systems featuring state-of-the-art neural models. For example, preliminary results with English data show that a FastSpeech2 model trained with 1 hour of training data can produce speech with comparable naturalness to a Tacotron2 model trained with 10 hours of data. Finally, we motivate future research in evaluation and classroom integration in the field of speech synthesis for language revitalization.

For privacy and language community intellectual property concerns, Gitksan, Kanien'kéha and SENĆOŦEN data is not shared.


All of the below phrases are unseen during training.

Tacotron2 vs. FastSpeech2 with limited training data

LJ031-0185 - "From the Presidential airplane, the Vice President telephoned Attorney General Robert F. Kennedy,"

Reference
FastSpeech2 15m
FastSpeech2 30m
FastSpeech2 1hr
FastSpeech2 3hr
FastSpeech2 5hr
FastSpeech2 10hr
FastSpeech2 Full (24hr)
Tacotron2 5hr
Tacotron2 10hr
Tacotron2 Full (24hr)

LJ050-0031 - "that the Secret Service consciously set about the task of inculcating and maintaining the highest standard of excellence and esprit, for all of its personnel."

Reference
FastSpeech2 15m
FastSpeech2 30m
FastSpeech2 1hr
FastSpeech2 3hr
FastSpeech2 5hr
FastSpeech2 10hr
FastSpeech2 Full (24hr)
Tacotron2 5hr
Tacotron2 10hr
Tacotron2 Full (24hr)