-------- Forwarded Message --------
Subject: We may see some really great Sounding TTS in the near feature
Date: Sun, 11 Sep 2016 22:01:48 -0700
From: Warren Carr <warc...@gmail.com>
Reply-To: eyes-f...@googlegroups.com
To: eyes-f...@googlegroups.com
I was reading a blog post from WaveNet and I was blown away by some of
the stuff that they are doing.
I can’t wait to have those voices on our devices!
Here’s is the extract, followed by the URL to the page, and be sure to
head over to that page, and take a listen to some of those voices.
If you don’t want to read while you are on the page, you can simply hit
letter B, to take you to the “play button.”
The first ones are demonstrating how the current Google TTS sound, and
then the latter ones, demonstrate the more modern sounding ones.
There are a couple other languages in there besides U.S. English and
Chinese.
Quote:
This post presents WaveNet
, a deep generative model of raw audio waveforms. We show that WaveNets
are able to generate speech which mimics any human voice and which
sounds more
natural than the best existing Text-to-Speech systems, reducing the gap
with human performance by over 50%.
We also demonstrate that the same network can be used to synthesize
other audio signals such as music, and present some striking samples of
automatically
generated piano pieces.
Talking Machines
Allowing people to converse with machines is a long-standing dream of
human-computer interaction. The ability of computers to understand
natural speech
has been revolutionised in the last few years by the application of deep
neural networks (e.g., Google Voice Search ). However, generating speech
with computers
— a process usually referred to as
speech synthesis
or text-to-speech (TTS) — is still largely based on so-called
concatenative TTS , where a very large database of short speech
fragments are recorded from
a single speaker and then recombined to form complete utterances. This
makes it difficult to modify the voice (for example switching to a
different speaker,
or altering the emphasis or emotion of their speech) without recording a
whole new database.
This has led to a great demand for parametric TTS
, where all the information required to generate the data is stored in
the parameters of the model, and the contents and characteristics of the
speech
can be controlled via the inputs to the model. So far, however,
parametric TTS has tended to sound less natural than concatenative, at
least for syllabic
languages such as English. Existing parametric models typically generate
audio signals by passing their outputs through signal processing
algorithms known
as
vocoders .
WaveNet changes this paradigm by directly modelling the raw waveform of
the audio signal, one sample at a time. As well as yielding more
natural-sounding
speech, using raw waveforms means that WaveNet can model any kind of
audio, including music.
WaveNets
Wave animation
Researchers usually avoid modelling raw audio because it ticks so
quickly: typically 16,000 samples per second or more, with important
structure at many
time-scales. Building a completely autoregressive model, in which the
prediction for every one of those samples is influenced by all previous
ones (in
statistics-speak, each predictive distribution is conditioned on all
previous observations), is clearly a challenging task.
However, our PixelRNN and PixelCNN
models, published earlier this year, showed that it was possible to
generate complex natural images not only one pixel at a time, but one
colour-channel
at a time, requiring thousands of predictions per image. This inspired
us to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.
Architecture animation
The above animation shows how a WaveNet is structured. It is a fully
convolutional neural network, where the convolutional layers have
various dilation
factors that allow its receptive field to grow exponentially with depth
and cover thousands of timesteps.
At training time, the input sequences are real waveforms recorded from
human speakers. After training, we can sample the network to generate
synthetic
utterances. At each step during sampling a value is drawn from the
probability distribution computed by the network. This value is then fed
back into the
input and a new prediction for the next step is made. Building up
samples one step at a time like this is computationally expensive, but
we have found
it essential for generating complex, realistic-sounding audio.
Improving the State of the Art
We trained WaveNet using some of Google’s TTS datasets so we could
evaluate its performance. The following figure shows the quality of
WaveNets on a scale
from 1 to 5, compared with Google’s current best TTS systems (
parametric and concatenative
), and with human speech using
Mean Opinion Scores (MOS)
. MOS are a standard measure for subjective sound quality tests, and
were obtained in blind tests with human subjects (from over 500 ratings
on 100 test
sentences). As we can see, WaveNets reduce the gap between the state of
the art and human-level performance by over 50% for both US English and
Mandarin
Chinese.
For both Chinese and English, Google’s current TTS systems are
considered among the best worldwide, so improving on both with a single
model is a major
achievement.
Here are some samples from all three systems so you can listen and
compare yourself:
End of quote from:
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
What do you think?
Warren
--
To report violations of our ground rules or content guidelines, contact
eyes-free+own...@googlegroups.com. -- https://goo.gl/rDveM8
---
You received this message because you are subscribed to the Google
Groups "eyes-free" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to eyes-free+unsubscr...@googlegroups.com
<mailto:eyes-free+unsubscr...@googlegroups.com>.
To post to this group, send email to eyes-f...@googlegroups.com
<mailto:eyes-f...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.
---
Gamers mailing list __ Gamers@audyssey.org
If you want to leave the list, send E-mail to gamers-unsubscr...@audyssey.org.
You can make changes or update your subscription via the web, at
http://audyssey.org/mailman/listinfo/gamers_audyssey.org.
All messages are archived and can be searched and read at
http://www.mail-archive.com/gamers@audyssey.org.
If you have any questions or concerns regarding the management of the list,
please send E-mail to gamers-ow...@audyssey.org.