Let the text T and the sample S
It is necessary to determine whether the sample S occurs in text T and if so, it's position.
Naive solution may be as follows:
In the worst case, the the nested loop condition is calculated about nS * nT times. It is interesting that the number of comparisons can be reduced to nT and it is not depends on text. Algorithm was presented in an article "Fast pattern matching in strings" (SIAM J. COMPUT. Vol. 6, No. 2, June 1977) by D.Knuth, J.Mopris and V.Pratt.
Algorithm may seems difficult.
Algorithm are based on the fact that if a partial match found, sample may be shifted by more than one position. And in all cases no need to compare previously matched characters again.
When the sample are shifted by one position, the match are impossible if the start of the shifted sample does not match the non-shifted itself in positions in which the non-shifted sample match with the text. Also, a match are not possible if the shifted sample match with the non-shifted one at the position where the mismatch was found. The same words can be said about shifts by more then one position.
Here's how to do it:
Here D is an array of minimal possible shifts that depends only on the sample and can be calculated before the search. A question about array D calculation we postpone and review the text more attentively. At first step variable I can be removed:
Outer loop condition J-K<nT can be changed with J<nT. When J>=nT и J-K<nT nested loop condition is false and K variable can't reach nS:
Nested loop executed only when T [J] and S [K] are equals, so the following conversion can be done:
Let's swap nested loops:
Exit when J>=nT
duplicates outer loop condition and may be removed.
In addition, if K
changed with K-D [K]
when match found, condition
K>=nS
in first nested loop becomes always false and may be removed:
The second nested loop may be replaced by a conditional operator. This is possible, because when its condition remains true, conditions of other two nested operators are false:
After exiting from remaining nested loop, K is always less than nS. When exiting by T[J]=S[K], then J<nT and next operator condition are true. When exiting by K=0, then J incremented and can be equals to nT, but before increment T[J]!=S[K] and next operator condition are false. I.e. after exiting from nested loop J and K are incremented or only J are incremented. In nested loop J increment may be replaced with K decrement and next condition operator may be removed:
Let's change the remaining nested loop:
And use D1[K]=K-D[K]
:
It should be noted that the initial algorithm version did not
use zero element of the D
array and negative values of its elements.
The value of D[0]=-1
used only as trick and need to simplify
code. But it can be considered as a sample shift by the match length plus one
and it makes sense. For example, if a sample AA
searched and
mismatch of the second character detected, a shift to one character
does not have sense. Although the sample match itself (shifted) - from
the mismatch of the second character it is follows that the corresponding
character of the string is not A
.
It is good to slightly change the initial algorithm version and perform similar transformations. Revised initial version may be as follows:
And about D (D name used instead of D1) calculation. Naive solution may be as follows:
This code works slowly, it is not suitable for real use, but it may be used for test writing.
We can try to use the already calculated shifts. This is not a previous algorithmtransformation and still need to prove that the result are the same:
Because negative K
in last condition operator are equals -1
,
assignment K=0
may be replaced with increment and all condition operator
may be replaced with two increments:
This code larger then in many textbooks, but it works faster and, apparently, is more natural.
To get an equivalent short code, series of similar transformations may be done.
First of all, some always true conditions K>=0
are added:
Let's change the index initialization, first pass of the outer loop will only increase them:
Cyclic permutation of blocks inside the outer loop are performed:
Condition J<nS
are added to outer loop.
When J=nS
nothing useful has been done:
Always true conditions J<nS
, K>=0
and always false condition J>nS
are removed:
The second nested loop may be replaced with a conditional operator. When it's condition are true, the conditions of two other nested operators are false and control flow returns to it after incrementing of indexes:
Last two operators can be combined because their conditions exclude each other:
Finally, D[nS]
calculation are moved from the loop.
When only first match are required it may be omitted:
Initial vesion
may be similarly transformed to slightly more efficient form. After adding always true condition:
First nested loop may be replaced with condition operator:
Two condition operators may be merged:
D[nS]
calculation are moved from the loop, when current characters not equals
one transformation of K
index always done (before entering into nested loop it's
condition are true) and always true conditions removed:
This code is the result of fixing an wrong code from Wikipedia.
The proof of the correctness of the algorithm may be done by induction.
Assumes that for all J
that not exceed some number M
two statement are true when D[J]
assignment done:
K'
: K<K'<J
match of [J-K'..J-1]
characters with sample beginning [0..K'-1]
are impossibleD[J]
value calculated correctly
For M=1
this statements are true.
For details see russian page.