23 February 2007

matlab session on pitch tracking ctd.


I started analyzing the FFT based pitch tracking algorithm described in DAFX pages 337 and further.
First I did the find_pitch_fft.m file:
function [FFTidx, Fp_est, Fp_corr] = find_pitch_fft(x, win, Nfft, Fs, R, fmin, fmax, thres)

%[x,Fs] = wavread(soundFile);
%x = x(1:Nfft+R);
% [FFTidx, Fp_est, Fp_corr] = find_pitch_fft(x, win, Nfft, Fs, R, fmin, fmax, thres)
%
% M-file 9.15
% find_pitch_fft.m
%
% find pitch candidates of given signal block x
% x: input signal of length Nfft+R
% win: window for the FFT
% Nfft: FFT length
% Fs: sampling frequency
% R: FFT hop size
% fmin, fmax: minumum/ maximum pitch freqs to be detected
% thres: omit maxima more than thres dB below the main peak
%
% (c) 2002 Florian Keiler

FFTidx = []; % FFT indices
Fp_est = []; % FFT bin frequencies
Fp_corr = []; % corrected frequencies
dt = R/Fs; % time diff between FFTs
df = Fs/Nfft; % freq resolution
kp_min = round(fmin/df);
kp_max = round(fmax/df);
x1 = x(1:Nfft); % 1st block
x2 = x((1:Nfft)+R); % 2nd block with hop size R
[X1, Phi1] = fftdb(x1.*win,Nfft);
[X2, Phi2] = fftdb(x2.*win,Nfft);
X1 = X1(1:kp_max+1);
Phi1 = Phi1(1:kp_max+1);
X2 = X2(1:kp_max+1);
Phi2 = Phi2(1:kp_max+1);
idx = find_loc_max(X1);
Max = max(X1(idx));
ii = find(X1(idx)-Max>-thres);

%----- omit maxima more than thres dB below the main peak -----
idx = idx(ii);
Nidx = length(idx); % number of detected maxima
maxerr = R/Nfft; % max phase diff error/pi (pitch max. 0.5 bins wrong)
maxerr = maxerr*1.2; % some tolerance
for ii=1:Nidx
k = idx(ii) - 1; % FFT bin with maximum
phi1 = Phi1(k+1); % phase of x1 in [-pi,pi]
phi2_t = phi1 + 2*pi/Nfft*k*R; % expected target phase after hop size R
phi2 = Phi2(k+1); % phase of x2 in [-pi,pi]
phi2_err = princarg(phi2-phi2_t);
phi2_unwrap = phi2_t+phi2_err;
dphi = phi2_unwrap - phi1; % phase diff
if (k>kp_min) & (abs(phi2_err)/pi<maxerr)
Fp_corr = [Fp_corr; dphi/(2*pi*dt)];
FFTidx = [FFTidx; k];
Fp_est = [Fp_est; k*df];
end
end

Problems I encountered:
  • I erased Fs as an input argument, because I can get directly from the wavread command. This posed no problems until I used another function that calls this function. For both functions to use the same Fs, it's necessary to send it as an input argument.
  • When inputting the argument win I must describe it like this:
    find_pitch_fft('La.wav',hann(1024),1024,1,20,15000,40)

Then after that I went on to the Pitch_Tracker_FFT_main.m file.

In the main function only I encountered some problems:
  • I had to correct the first line like this:
    I had to add the apostrophes and the .wav extension
    fname='x1.wav';
  • Next thing was that they used the algorithm optimized for a specific wavfile. To generalize this I had to adjust the n0 and n1 boundaries. I reasoned, what is the most general solution for this, and it would be to set n0 to zero and n1 to the number of samples the given piece contains.
    %these are optimized for x1.wav
    %n0=2000; %start index
    %n1=210000;
    %I'll try it in general
    [X,Fs]=wavread(fname);
    n0=0
    n1=length(X)
    This gave some problems where they do some changes on n0 and n1, because suddenly the new n1 would be bigger than the length of the piece. I solved this by just eliminating this part. The algorithm still works then, so I ask why this part was contained in the algorithm. (OK)
    %Nx=n1-n0+1+R %signal length
    %blocks=floor(Nx/K)
    %Nx=(blocks-1)*K+Nfft+R
    %n1=n0+Nx % new end index
  • But this gives a matrix dimensions exceede problem:
    for b=1:blocks
    x=X((b-1)*K+1+(1:Nfft+R));
    Strange that matlab doesn't seem to complain about the fact that blocks isn't defined
  • These and other adjustments kept creating new problems. I tried it otherwise:
    I followed the original algorithm step-by-step to see what it did.

    Nx = n1 - n0 + 1 + R
    %so it actually makes the chosen part longer
    %in my case, as I want n0 to be 1 en n1 to be the length of the audio sample, it makes Nx greater than the total length of the wave file.
    blocks = floor(Nx/K)
    %As far is I understand this divides the total number of Nx samples in Nx/K blocks, each of length K (floored to the nearest smaller integer), and non-overlapping.
    %so this must be something else than the blocks for the FFT who are overlapping in their length-R.
    %anyone an idea what they're here for then? The algorithm talks about hop size for time resolution for pitch estimation (see lower for an answer (remarks after the algorithm))
    Nx = (blocks-1)*K + Nfft + R
    = ((floor(Nx/K))-1)*K + Nfft + R
    < ((Nx/K+1)-1)*K + Nfft + R
    = Nx + Nfft + R
    = n1 - n0 + 2*R + 1 + Nfft
    %the new end index becomes
    n1 = n0 + Nx
    < n0 + n1 - n0 + 2*R + 1 + Nfft
    = n1 + 2R + 1 + Nfft
    This is the maximum number of samples that are added to the original part of the audio sample, in my favored case this is the max number of samples added to the total audio sample.
    This now gives us the choice:
    • OR we do the wavread here. We must then care for the fact that the original n1 must be smaller than the total length of the audio sample minus 2*R + 1 + Nfft. Because now you will certainly not try to 'wavread' too many samples, samples that aren't there.
    • OR we can zero-pad the wave vector here until it has the new length n1-n0
      This is what I'll do here. It might give an small mistake for the last blocks that comprise these zeros, but there will always be problems at the end of a sample. I think the zeros minimize thiese problems. (I can be wrong)
This brings us to the first part (that works generally):
% M-file 9.18
% pitch_tracker_fft_main.m
%
% main file for a pitch tracker based on FFT with phase vocoder approach
% (c) 2002 Florian Keiler

fname='La.wav';
[X,Fs]=wavread(fname);
n0=1
n1=length(X)

Nfft=1024;
R=1; % FFT hop size for pitch estimation
K=200; % hop size for time resolution of pitch estimation
thres=50; % threshold for FFT maxima
% checked pitch range in Hz:
fmin=50;
fmax=800;
p_fac_thres=1.05; % threshold for voiced detection
% deviation of pitch from mean value

win=hanning(Nfft)'; % window for FFT

Nx=n1-n0+1+R %signal length
blocks=floor(Nx/K)
Nx=(blocks-1)*K+Nfft+R
n1=n0+Nx % new end index
X=X(:,1)';
length_afterCalculatingNewLength = length(X)
for i=length(X):n1 this I changed to:
X(i)=0; X = [X,zeros(1,n1-length(X))];
end
new_length = length(X)

pitches=zeros(1,blocks);
for b=1:blocks
x=X((b-1)*K+1+(1:Nfft+R));
[FFTidx, Fp_est, Fp_corr]=find_pitch_fft(x, win, Nfft, Fs, R, fmin, fmax, thres);
if ~isempty(Fp_corr)
pitches(b)=Fp_corr(1); % take candidate with lowest pitch
else
pitches(b)=0;
end
end
Remarks on parameters:
  • If I enlarge K (1000) the algorithm works much faster, but, as you see, the time resolution is not so good.
  • If I make K smaller, the calculations take very long and the pitch tracking is much more fluent. Sometimes it has short faulty tones. Maybe we can filter these out assuming that the voice cannot change that fast and short. But this would stop us from using the algorithm for instruments...

    K thus sets the nuber of times that the main for-loop is run.
  • A very big (1000) value for R results in long calculations and technically seen a worse pitch estimation, see fig9.27, p337 in DAFX.
  • For input x2.wav to give a meaningful result I set thres on 1 (which is very small compared to the preferred 30-50). This means that we leave the allowed deviation from zero for the phase error very small. I don't really understand why that gives better results here. If I let the thres value have its default value, then we get nothing but two low blobs of sound somewhere in time.
Next time I'll try the time-domain based pitch tracking algorithm from DAFX. For now I'm stopping here.

1 comment:

Anonymous said...

regarding the good results that you get when specifying a very low threshold value:
I see that the thres parameter in the find_pitch_fft function is defined as follows:

% thres: omit maxima more than thres dB below the main peak

which means that when you set thres to a very low value, probably only the main peak in the signal spectrum will be detected as a pitch candidate (e.g., when thres = 1, all peaks in the spectrum which are 1dB or more below the main peak, will be omitted from the selection of pitch candidates). when there is only one candidate, the pitch detection is expected to give good results.
obviously, all this only holds when the main peak in the spectrum indeed corresponds to the pitch frequency. so only for instruments or sounds which have a pitch frequency that is dominant over the (sub-)harmonic frequencies (in terms of amplitude), this technique (with low thres) will give nice results.