Speech Warping and Audio Restoration

Matt Montag - EEN 540 Speech Signal Processing - Project 3

MATLAB Files

proj3.m project script
pad.m utility function to make two vectors match in size
fftplot.m utility function
averageLogSpectrum.m utility function to compute average log spectrum

Discussion

Linear predictive coding is a signal processing technique that is useful for separating signal content using a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. The filter is represented by an all-pole filter that mimics the spectral shape of the formant.

In this project, linear prediction is performed and various manipulations are applied to the residual signal and spectral function. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting. A simple 200% pitch shift yields a squirrel-like timbre. It is not a physically valid conversion, because it scales the vocal tract too much.

I experimented a great deal with the pitch shifting and pole-scaling process and arrived at a tweaked method for gender conversion.

In all cases, split the source signal into Hamming-windowed frames with 50% overlap.
Female to male:
- Frame size is 94 ms, arrived at experimentally.
- Pitch scaled to 40% original. Peak detection is then performed to find the glottal impulses. The error signal content centered on each peak is extracted from the original error signal with a Hamming window and pasted into a new signal, but the peaks are redistributed, spread out by a factor of 2.5.
- Poles scaled to 85% of the original, which is like increasing the length of the vocal tract by a factor of 1.17.
- Poles greater than 3/4 pi are not scaled. This preserves high-frequency spectral shape. Poles greater than 3/4 pi are scaled, high-frequency content is disproportionately attenuated.
Male to female:
- Frame size is 125 ms, arrived at experimentally.
- Pitch is scaled to 200% original. Peak detection is then performed to find the glottal impulses. An additional glottal impulse is inserted between each impulse in the original error signal in order to double the pitch. This additional peak is created by interpolating the peaks on either side.
- Poles are scaled to 120% of the original, which is like decreasing the length of the vocal tract by a factor of 0.84.
Male to child:
- Frame size is 125 ms, arrived at experimentally.
- Pitch is scaled to 300% original. Peak detection is then performed to find the glottal impulses. Two additional glottal impulses are inserted between each impulse in the original error signal in order to triple the pitch. This additional peaks are created by interpolating the peaks on either side.
- Poles are scaled to 160% of the original, which is like decreasing the length of the vocal tract by a factor of 0.63.

Speech Gender Conversion

All Audio Samples

Barack Obama	obama.wav	obama_to_female.wav	obama_to_child.wav	obama_to_child_audition.wav
Sarah Palin	palin.wav	palin_to_male.wav	palin_to_female.wav
Katie Couric	couric.wav	couric_to_male.wav	couric_to_male_audition.wav
Al Gore	algore.wav	algore_to_female.wav	algore_to_child.wav
Hip-Hop	turnstiles.wav	turnstiles_to_child.wav	turnstiles_to_child_audition.wav

Files labeled "audition" were processed in Adobe Audition using the time stretch/pitch shift effect and are included for comparison.

Female to Male

Original

female_speech.wav

Converted
female_speech_to_male.wav

Male to Female

Original

male_speech.wav

Converted
male_speech_to_female.wav

Male to Child

Original

male_speech.wav

Converted
male_speech_to_child.wav

Residual Signal

Perfect Reconstruction

Original Signal
female_speech.wav
Error Signal
female_speech_error.wav
Reproduced Signal
female_speech_reproduced.wav

Stylized Signal

Audio Restoration

Original Signal

caruso.wav

Restored Signal
caruso_restored.wav

Discussion

Homomorphic signal processing can be used to separate two convolved signals. In this case, the original signal, a recording of Caruso singing, is convolved with the poor response of a transducer used in recording. In Homomorphic processing, components that have been convolved are converted into components that are added by taking the Fourier transform followed by the logarithm. After linear filtering to separate the added components, the original steps are undone. In this case, we take the log of the FFT of a modern Pavarotti recording and subtract it from the log of the FFT of the Caruso recording, and then exponentiate and IFFT the result, giving us a correction filter. When the Caruso recording is processed with this filter, it results in spectral content that more closely matches the reference Pavarotti recording.

The processing technique is not perfect, since it also amplifies the noisy part of the signal. A more advanced processing technique could be employed to isolate the speech signal from the noise.

Speech Warping and Audio Restoration

MATLAB Files

Discussion

Speech Gender Conversion

All Audio Samples

Female to Male

Male to Female

Male to Child

Residual Signal

Perfect Reconstruction

Original Signal female_speech.wav Error Signal female_speech_error.wav Reproduced Signal female_speech_reproduced.wav

Stylized Signal

One Sample Per Period

Two Samples Per Period

Four Samples Per Period

Audio Restoration

Discussion

Original Signal
female_speech.wav
Error Signal
female_speech_error.wav
Reproduced Signal
female_speech_reproduced.wav