Article - Disclaimer - About |
3. AlgorithmWe will now look at the algorithm we need to implement from a mathematical point of view. 3.1 What needs to be achievedSo now we know we first have to change the signal duration without changing the pitch. The scaling factor is defined as the factor used to stretch or compress the spectrum in order to adjust the frequencies such that the pitch is shifted. Once this is done, we will perform resampling to return to the initial duration with a shift in the pitch. For instance, if we want to shift the pitch up by one semitone, we need a scaling factor of 2(1/12) which equals to 1.0594. This means we first need to stretch the signal without changing the pitch such that the duration is now multiplied by 1.0594. Once this is done, we need to play the signal 1.0594 times faster.
Figure 3.1: How to shift the pitch up by one semitone On the other hand, if we want to shift the pitch down by one semitone, we need a scaling factor of 2(-1/12) which equals to 0.9439. This means we first need to compress the signal without changing the pitch such that the duration is now multiplied by 0.9439. Once this is done, we need to play the signal slower at 0.9439 times the initial speed.
Figure 3.2: How to shift the pitch down by one semitone 3.2 Superposition of framesWe will split our signals in many small frames. These frames are taken from the initial signal such that they overlap each other by 75%. These frames will then be spaced differently in order to stretch or compress in time the output signal.
Figure 3.3: Stretching and compressing the signal frame by frame But now we have another problem. When we space the frames differently, this creates discontinuities. This is shown in the next figure (space between frames is initially of length x and becomes of length y).
Figure 3.4: Signal discontinuities when stretching or compressing in time This will produce glitches and will be perceived by the human ear. We need to do something to make sure there is no discontinuity. That's why we need a phase vocoder. 3.3 Phase vocoder3.3.1 OverviewThe next figure shows how the phase vocoder algorithm works. It consists of 3 stages: analysis, processing and synthesis. I will assume you are familiar with Fast Fourier Transform (FFT), Inverse Fast Fourier Transform (IFFT) and complex numbers. I will have to go through some unpleasant maths in the next subsections. I wish I could avoid this but we really need to go through these equations to understand what the algorithm does.
Figure 3.5: Phase vocoder overview 3.3.2 AnalysisWindowing refers to taking a small frame out of a large signal (you are only looking at your signal through a small window of time). The windowing process alters the spectrum of the signal and this effect should be minimized. In order to reduce the effect of windowing on the signal frequency representation, a Hanning window of size N is used. Why is this? Well multiplying a rectangular window with your signal in the time-domain is equivalent to performing a convolution in the frequency domain. A rectangular window has more energy in the side lobes while a Hanning window focuses most of its energy near DC. This looks more like an impulse which would lead to a perfect reconstruction of the signal.
Figure 3.6: Windowing The frame made of N samples is then transformed with a Fast Fourier Transform (FFT) as shown in equation 3.1.
Equation 3.1: Analysis stage equation
In the previous equation, 3.3.3 Processing
Applying a FFT of length N results into N frequency bins starting from 0 up to
Figure 3.7: Sine waves with different frequencies and phase difference
The phase information of the two successive frames is of significant interest. For the first signal, there is no phase difference between the two frames. However, for the second signal, the phase of the first bin is greater than zero. This implies that the signal component corresponding to this bin is higher than the bin frequency. The phase difference between the two frames is referred to as the phase shift
Equation 3.2: True frequency when the phase shift is unwrapped
The matter is more complex since the phases are wrapped. In this case, the frequency deviation from
the bin is first calculated and is then wrapped. This amount is added to the bin frequency to obtain the
true frequency of the component within the frame. Equations 3.3, 3.4 and 3.5 illustrate this procedure.
The variables
Equation 3.3: Frequency deviation
Equation 3.4: Wrapped frequency deviation
Equation 3.5: True frequency The new phase of each bin can then be calculated by adding the phase shift required to avoid discontinuities. This is done by multiplying the true frequency with the time interval of the synthesis stage as shown in equation 3.6.
Equation 3.6: Phase adjustment for the synthesis stage
The phase of the previous frame for synthesis
Equation 3.7: Amplitude and phase of the spectrum of the synthesis stage 3.3.4 Synthesis
So now we managed to adjust the phase in the frequency domain for our current frame. We need to come back to the time domain. The inverse discrete Fourier transform (IDFT) is performed on each frame spectrum. The result is then windowed with a Hanning window to obtain
Equation 3.8: Synthesis stage equation
Each frame is then overlap-added as shown in equation 3.9. The variable
Equation 3.9: Overlap-add of the synthesis frames Sorry about that, it was really painful I know. The good news is that we’re done with most of the maths now. 3.4 ResamplingHey, good news! We have our signal that is now either stretched or compressed in time and the pitch is not changed. Now we need to resample our signal to get back to the initial duration and shift the pitch. Suppose I have a given sampling rate and I want to double the frequencies. The easiest thing to do is to pick only one sample out of two and to output the result.
Figure 3.8: Resampling at twice the previous rate Now this is easy because when we double the frequency we deal with a scaling factor that is an integer. If the scaling factor is not an integer (for instance 1.0594 when we shift the pitch up by one semitone), then we cannot use the same technique. We will use instead linear interpolation to approximate the sample that should lie at this location. Linear interpolating also behaves like a low-pass filter and thus removes most of the possible aliasing when we downsample. The next figure shows an example with a scaling factor of 1.2.
Figure 3.9: Linear interpolation Okay, we are done with the algorithm!
Copyright © 2009- François Grondin. All Rights Reserved. |