The Rabin-Karp algorithm is a linear time substring searching algorithm. Let’s see how it works in theory and in practice.
Problem Statement
Given two strings $S$ and $P$, find whether $P$ is a substring of $S$.
Naive Solution
Before reviewing the idea of the Rabin-Karp algorithm let’s first think about how we can solve the problem in a most simple way. The string $P$ might be found at any position in $S$, so we can check all the positions: iterate over $S$ and for every index $i$ check if $\nobreak S[i:i+|P|-1] = P$.
The pseudocode for this algorithm is going to look something like this:
outer:
for i := range len(S) - len(P) + 1 {
for j := range len(P) {
if S[i+j] != P[j] {
continue outer
}
}
return true
}
return false
The time complexity of this algorithm is $O(|S| \cdot |P|)$.
Rabin-Karp Idea
The inefficiency of the naive approach comes from the fact that we have to compare the string $P$ with each substring in $S$ of the same length. If instead we were able to replace this comparison with a constant time operation it would have given us the linear time complexity overall. We can do this by pre-computing a hash of $P$ and a list of hashes for every substring in $S$ of the same length. In this case the problem can be reduced to a check if the hash of $P$ is contained in the list of hashes. For that we need a hash function.
Hash Function
Given a string $S$ let’s define a hash function in the following way:
$$ hash(S) = S[0] \cdot p^0 + S[1] \cdot p^1 + S[2] \cdot p^2 + ... + S[n-1] \cdot p^{n-1} \mod m \\ = \sum_{i=0}^{|S|-1} S[i] \cdot p^i \mod m $$Where $p$ is some prime number usually a bit bigger than the alphabet size (for example 31
or 37
for a string containing only lowercase English letters) and $m$ is a big prime number needed to prevent integer overflow.
To illustrate let’s compute the hash of a string zorro
using $p=31$ and $m=999907$:
Finding Substrings
Now when we know how to compute hash of a given string we can compute the following hashes:
$$H_P = hash(P)$$$$H_{S[0:|P|-1]} = hash(S[0:|P|-1])$$Where the latter is the hash of the prefix of $S$ of the same length as $|P|$. If the two hashes match it means we’ve found the first occurrence of $P$ in $S$.
Now we need to figure out how to shift the hash by one character and whether it can be compared with $H_P$. As you remember any character at position $i$ contributes $S[i] \cdot p^i$ to the hash. Knowing that we can do the following transition:
$$H_{S[i:j]} = H_{S[i-1:j-1]} - S[i-1] \cdot p^{i-1} + S[j] \cdot p^j \mod m$$This hash can’t be compared to $H_P$ as powers of $p$ are different: in $H_P$ the powers of $p$ are $p^0, p^1, p^2 ...$ whereas in $H_{S[i:j]}$ the powers of $p$ are $p^i, p^{i+1}, p^{i+2}, ... $. To compensate for that we need to either divide $H_{S[i:j]}$ by $p^i$ or multiply $H_P$ by $p^i$. The latter is preferable as we need to apply $\mod m$ later on. Taking that into account the comparison is going to look like so:
$$H_{S[i:j]} = H_P \cdot p^i \mod m$$Calculating $H_{S[i:j]}$ for every position in $S$ except for the last $|P|-1$ positions is going to allow to find all occurrences of $P$ in $S$.
Below is the code of a function that given two strings S
and P
returns an array of all indexes of S
where P
is found:
func findAllIndexes(S, P string) []int {
m := 999907
p := 31
hP := 0
pPow := 1
for _, c := range P {
hP = (hP + (int(c-'a')+1) * pPow) % m
pPow = (pPow * p) % m
}
hS := 0
pLPow, pRPow := 1, 1
l := 0
res := []int{}
for r, c := range S {
hS = (hS + (int(c-'a')+1) * pRPow) % m
pRPow = (pRPow * p) % m
if r - l + 1 > len(P) {
hS = (hS + m - (int(S[l]-'a')+1) * pLPow) % m
pLPow = (pLPow * p) % m
l++
}
if hS == (hP * pLPow) % m {
res = append(res, l)
}
}
return res
}