Matt Montag - EEN 540 Speech Signal Processing - Project 3
Linear predictive coding is a signal processing technique that is useful for separating signal content using a source-filter model. LPC works on speech signals by estimating the resonance of the vocal tract (formant), reversing its effect with an inverse filter, and then coding the resulting residual signal. The residual signal is ideally an impulse train that represents the glottal impulse. The filter is represented by an all-pole filter that mimics the spectral shape of the formant.
In this project, linear prediction is performed and various manipulations are applied to the residual signal and spectral function. This allows the scaling of pitch and formants independently, which leads to a much better gender conversion result than simple pitch shifting. A simple 200% pitch shift yields a squirrel-like timbre. It is not a physically valid conversion, because it scales the vocal tract too much.
I experimented a great deal with the pitch shifting and pole-scaling process and arrived at a tweaked method for gender conversion.
Files labeled "audition" were processed in Adobe Audition using the time stretch/pitch shift effect and are included for comparison.
Converted
female_speech_to_male.wav
Converted
male_speech_to_female.wav
Converted
male_speech_to_child.wav
Error Signal
female_speech_error.wav
Reproduced Signal
female_speech_reproduced.wav
Restored Signal
caruso_restored.wav
Homomorphic signal processing can be used to separate two convolved signals. In this case, the original signal, a recording of Caruso singing, is convolved with the poor response of a transducer used in recording. In Homomorphic processing, components that have been convolved are converted into components that are added by taking the Fourier transform followed by the logarithm. After linear filtering to separate the added components, the original steps are undone. In this case, we take the log of the FFT of a modern Pavarotti recording and subtract it from the log of the FFT of the Caruso recording, and then exponentiate and IFFT the result, giving us a correction filter. When the Caruso recording is processed with this filter, it results in spectral content that more closely matches the reference Pavarotti recording.
The processing technique is not perfect, since it also amplifies the noisy part of the signal. A more advanced processing technique could be employed to isolate the speech signal from the noise.