String search. Knuth-Morris-Pratt algirithm

Let the text T and the sample S

char @T; // text int nT; // text length char @S; // sample int nS; // sample length

It is necessary to determine whether the sample S occurs in text T and if so, it's position.

Naive solution may be as follows:

int I=0; while I<nT do int J=I; int K=0; while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position I end inc I; end

In the worst case, the the nested loop condition is calculated about nS * nT times. It is interesting that the number of comparisons can be reduced to nT and it is not depends on text. Algorithm was presented in an article "Fast pattern matching in strings" (SIAM J. COMPUT. Vol. 6, No. 2, June 1977) by D.Knuth, J.Mopris and V.Pratt.

Algorithm may seems difficult.

Algorithm are based on the fact that if a partial match found, sample may be shifted by more than one position. And in all cases no need to compare previously matched characters again.

When the sample are shifted by one position, the match are impossible if the start of the shifted sample does not match the non-shifted itself in positions in which the non-shifted sample match with the text. Also, a match are not possible if the shifted sample match with the non-shifted one at the position where the mismatch was found. The same words can be said about shifts by more then one position.

Here's how to do it:

int D[nS+1]; int I=0; int J=0; int K=0; while I<nT do while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position I end if K>0 then I=I+D[K]; K=J-I; else inc I; J =I; end end

Here D is an array of minimal possible shifts that depends only on the sample and can be calculated before the search. A question about array D calculation we postpone and review the text more attentively. At first step variable I can be removed:

int D[nS+1]; int J=0; int K=0; while J-K<nT do while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position J-K end if K>0 then K=K-D[K]; // K=J-(I+D[K])=(J-I)-D[K]=K-D[K] else inc J; end end

Outer loop condition J-K<nT can be changed with J<nT. When J>=nT и J-K<nT nested loop condition is false and K variable can't reach nS:

int D[nS+1]; int J=0; int K=0; while J<nT do while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position J-K end if K>0 then K=K-D[K]; // K=J-(I+D[K])=(J-I)-D[K]=K-D[K] else inc J; end end

Nested loop executed only when T [J] and S [K] are equals, so the following conversion can be done:

int D[nS+1]; int J=0; int K=0; while J<nT do while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position J-K end if J>=nT then exit end while K>=nS | T[J]!=S[K] do if K>0 then K=K-D[K]; else inc J; exit end end end

Let's swap nested loops:

int D[nS+1]; int J=0; int K=0; while J<nT do while K>=nS | T[J]!=S[K] do if K>0 then K=K-D[K]; else inc J; exit end end while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position J-K end if J>=nT then exit end end

Exit when J>=nT duplicates outer loop condition and may be removed. In addition, if K changed with K-D [K] when match found, condition K>=nS in first nested loop becomes always false and may be removed:

int D[nS+1]; int J=0; int K=0; while J<nT do while T[J]!=S[K] do if K>0 then K=K-D[K]; else inc J; exit end end while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position J-K K=K-D[K]; end end

The second nested loop may be replaced by a conditional operator. This is possible, because when its condition remains true, conditions of other two nested operators are false:

int D[nS+1]; int J=0; int K=0; while J<nT do while T[J]!=S[K] do if K>0 then K=K-D[K]; else inc J; exit end end if J<nT & K<nS & T[J]=S[K] then inc J; inc K; end if K>=nS then //Found at position J-K K=K-D[K]; end end

After exiting from remaining nested loop, K is always less than nS. When exiting by T[J]=S[K], then J<nT and next operator condition are true. When exiting by K=0, then J incremented and can be equals to nT, but before increment T[J]!=S[K] and next operator condition are false. I.e. after exiting from nested loop J and K are incremented or only J are incremented. In nested loop J increment may be replaced with K decrement and next condition operator may be removed:

int D[nS+1]; int J=0; int K=0; while J<nT do while T[J]!=S[K] do if K>0 then K=K-D[K]; else dec K; exit end end inc J; inc K; if K>=nS then //Found at position J-K K=K-D[K]; end end

Let's change the remaining nested loop:

int D[nS+1]; int J=0; int K=0; while J<nT do while K>=0 & T[J]!=S[K] do K=K-D[K]; // Let D[0]=1 end inc J; inc K; if K>=nS then //Found at position J-K K=K-D[K]; end end

And use D1[K]=K-D[K]:

int D1[nS+1]; int J=0; int K=0; while J<nT do while K>=0 & T[J]!=S[K] do K=D1[K]; end inc J; inc K; if K>=nS then //Found at position J-K K=D1[K]; end end

It should be noted that the initial algorithm version did not use zero element of the D array and negative values of its elements. The value of D[0]=-1 used only as trick and need to simplify code. But it can be considered as a sample shift by the match length plus one and it makes sense. For example, if a sample AA searched and mismatch of the second character detected, a shift to one character does not have sense. Although the sample match itself (shifted) - from the mismatch of the second character it is follows that the corresponding character of the string is not A.

It is good to slightly change the initial algorithm version and perform similar transformations. Revised initial version may be as follows:

int D1[nS1]; int J=0; int K=0; while J-K<nT do while J<nT & K<nS & T[J]=S[K] do inc J; inc K; end if K>=nS then //Found at position J-K end K=D1[K]; if K<0 then inc J; inc K; end end

And about D (D name used instead of D1) calculation. Naive solution may be as follows:

int D[nS1]; int J = 0; while J<=nS do D [J]=-1; int K =1; while K<=J do int M=K; int N=0; while M<J & S[M]=S[N] do inc M; inc N; end if M=J & (M>=nS | S[M]!=S[N]) then D[J]=N; // =J-K exit end inc K; end inc J; end

This code works slowly, it is not suitable for real use, but it may be used for test writing.

We can try to use the already calculated shifts. This is not a previous algorithmtransformation and still need to prove that the result are the same:

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; while J<=nS do while J<nS & S[J]=S[K] do D [J]=D[K]; inc J; inc K; end D [J]=K; while J<nS & K>=0 & S[J]!=S[K] do K=D[K]; end if K<0 then inc J; K=0; else inc J; inc K; end end

Because negative K in last condition operator are equals -1, assignment K=0 may be replaced with increment and all condition operator may be replaced with two increments:

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; while J<=nS do while J<nS & S[J]=S[K] do D [J]=D[K]; inc J; inc K; end D [J]=K; while J<nS & K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; end

This code larger then in many textbooks, but it works faster and, apparently, is more natural.

To get an equivalent short code, series of similar transformations may be done. First of all, some always true conditions K>=0 are added:

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; do if J>nS then exit end while K>=0 & (J< nS & S[J] =S[K]) do D [J]=D[K]; inc J; inc K; end if K>=0 & (J>=nS | S[J]!=S[K]) then D [J]=K; end while K>=0 & (J< nS & S[J]!=S[K]) do K=D[K]; end inc J; inc K; end

Let's change the index initialization, first pass of the outer loop will only increase them:

int D[nS+1]; D [0]=-1; int J = 0; // 1 -> 0 int K =-1; // 0 -> -1 do if J>nS then exit end while K>=0 & (J< nS & S[J] =S[K]) do D [J]=D[K]; inc J; inc K; end if K>=0 & (J>=nS | S[J]!=S[K]) then D [J]=K; end while K>=0 & (J< nS & S[J]!=S[K]) do K=D[K]; end inc J; inc K; end

Cyclic permutation of blocks inside the outer loop are performed:

int D[nS+1]; D [0]=-1; int J = 0; int K =-1; do while K>=0 & (J< nS & S[J]!=S[K]) do K=D[K]; end inc J; inc K; if J>nS then exit end while K>=0 & (J< nS & S[J] =S[K]) do D [J]=D[K]; inc J; inc K; end if K>=0 & (J>=nS | S[J]!=S[K]) then D [J]=K; end end

Condition J<nS are added to outer loop. When J=nS nothing useful has been done:

int D[nS+1]; D [0]=-1; int J = 0; int K =-1; while J <nS do while K>=0 & (J< nS & S[J]!=S[K]) do K=D[K]; end inc J; inc K; if J>nS then exit end while K>=0 & (J< nS & S[J] =S[K]) do D [J]=D[K]; inc J; inc K; end if K>=0 & (J>=nS | S[J]!=S[K]) then D [J]=K; end end

Always true conditions J<nS, K>=0 and always false condition J>nS are removed:

int D[nS+1]; D [0]=-1; int J = 0; int K =-1; while J <nS do while K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; while J< nS & S[J] =S[K] do D [J]=D[K]; inc J; inc K; end if J>=nS | S[J]!=S[K] then D [J]=K; end end

The second nested loop may be replaced with a conditional operator. When it's condition are true, the conditions of two other nested operators are false and control flow returns to it after incrementing of indexes:

int D[nS+1]; D [0]=-1; int J = 0; int K =-1; while J <nS do while K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; if J< nS & S[J] =S[K] then D [J]=D[K]; end if J>=nS | S[J]!=S[K] then D [J]=K; end end

Last two operators can be combined because their conditions exclude each other:

int D[nS+1]; D [0]=-1; int J = 0; int K =-1; while J <nS do while K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; if J< nS & S[J] =S[K] then D [J]=D[K]; else D [J]=K; end end

Finally, D[nS] calculation are moved from the loop. When only first match are required it may be omitted:

int D[nS+1]; D [0]=-1; int N = nS-1; int J = 0; int K =-1; while J < N do while K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; if S[J]=S[K] then D [J]=D[K]; else D [J]=K; end end while K>=0 & S[J]!=S[K] do K=D[K]; end D[nS]=K+1;

Initial vesion

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; while J<=nS do while J<nS & S[J]=S[K] do D [J]=D[K]; inc J; inc K; end D [J]=K; while J<nS & K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; end

may be similarly transformed to slightly more efficient form. After adding always true condition:

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; while J<=nS do while J< nS & S[J] =S[K] do D [J]=D[K]; inc J; inc K; end if J>=nS | S[J]!=S[K] then D [J]=K; while J<nS & K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; end end

First nested loop may be replaced with condition operator:

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; while J<=nS do if J< nS & S[J] =S[K] then D [J]=D[K]; inc J; inc K; end if J>=nS | S[J]!=S[K] then D [J]=K; while J<nS & K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; end end

Two condition operators may be merged:

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; while J<=nS do if J< nS & S[J] =S[K] then D [J]=D[K]; inc J; inc K; else D [J]=K; while J<nS & K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; end end

D[nS] calculation are moved from the loop, when current characters not equals one transformation of Kindex always done (before entering into nested loop it's condition are true) and always true conditions removed:

int D[nS+1]; D [0]=-1; int J = 1; int K = 0; while J <nS do if S[J] =S[K] then D [J]=D[K]; inc J; inc K; else D [J]=K; K=D[K]; while K>=0 & S[J]!=S[K] do K=D[K]; end inc J; inc K; end end D [J]=K;

This code is the result of fixing an wrong code from Wikipedia.

The proof of the correctness of the algorithm may be done by induction. Assumes that for all J that not exceed some number M two statement are true when D[J] assignment done:

For any K': K<K'<J match of [J-K'..J-1] characters with sample beginning [0..K'-1] are impossible
D[J] value calculated correctly

For M=1 this statements are true.

For details see russian page.

Сайт создан в системе uCoz