Finding Substrings Using Rabin-Karp Algorithm

The Rabin-Karp algorithm is a linear time substring searching algorithm. Let’s see how it works in theory and in practice.

Problem Statement

Given two strings $S$ and $P$, find whether $P$ is a substring of $S$.

Naive Solution

Before reviewing the idea of the Rabin-Karp algorithm let’s first think about how we can solve the problem in a most simple way. The string $P$ might be found at any position in $S$, so we can check all the positions: iterate over $S$ and for every index $i$ check if $\nobreak S[i:i+|P|-1] = P$.

The pseudocode for this algorithm is going to look something like this:

outer:
for i := range len(S) - len(P) + 1 {
    for j := range len(P) {
        if S[i+j] != P[j] {
            continue outer
        }
    }
    return true
}
return false

The time complexity of this algorithm is $O(|S| \cdot |P|)$.

Rabin-Karp Idea

The inefficiency of the naive approach comes from the fact that we have to compare the string $P$ with each substring in $S$ of the same length. If instead we were able to replace this comparison with a constant time operation it would have given us the linear time complexity overall. We can do this by pre-computing a hash of $P$ and a list of hashes for every substring in $S$ of the same length. In this case the problem can be reduced to a check if the hash of $P$ is contained in the list of hashes. For that we need a hash function.

Hash Function

Given a string $S$ let’s define a hash function in the following way:

$$ hash(S) = S[0] \cdot p^0 + S[1] \cdot p^1 + S[2] \cdot p^2 + ... + S[n-1] \cdot p^{n-1} \mod m \\ = \sum_{i=0}^{|S|-1} S[i] \cdot p^i \mod m $$

Where $p$ is some prime number usually a bit bigger than the alphabet size (for example 31 or 37 for a string containing only lowercase English letters) and $m$ is a big prime number needed to prevent integer overflow.

To illustrate let’s compute the hash of a string zorro using $p=31$ and $m=999907$:

$$ hash("zorro") \\ = 26 \cdot 31^0 + 15 \cdot 31^1 + 18 \cdot 31^2 + 18 \cdot 31^3 + 15 \cdot 31^4 \mod 999\,907 \\ = 26 + 465 + 17\,298 + 536\,238 + 13\,852\,815 \mod 999\,907 \\ = 14\,406\,842 \mod 999\,907 = 408\,144 $$

Finding Substrings

Now when we know how to compute hash of a given string we can compute the following hashes:

$$H_P = hash(P)$$$$H_{S[0:|P|-1]} = hash(S[0:|P|-1])$$

Where the latter is the hash of the prefix of $S$ of the same length as $|P|$. If the two hashes match it means we’ve found the first occurrence of $P$ in $S$.

Now we need to figure out how to shift the hash by one character and whether it can be compared with $H_P$. As you remember any character at position $i$ contributes $S[i] \cdot p^i$ to the hash. Knowing that we can do the following transition:

$$H_{S[i:j]} = H_{S[i-1:j-1]} - S[i-1] \cdot p^{i-1} + S[j] \cdot p^j \mod m$$

This hash can’t be compared to $H_P$ as powers of $p$ are different: in $H_P$ the powers of $p$ are $p^0, p^1, p^2 ...$ whereas in $H_{S[i:j]}$ the powers of $p$ are $p^i, p^{i+1}, p^{i+2}, ... $. To compensate for that we need to either divide $H_{S[i:j]}$ by $p^i$ or multiply $H_P$ by $p^i$. The latter is preferable as we need to apply $\mod m$ later on. Taking that into account the comparison is going to look like so:

$$H_{S[i:j]} = H_P \cdot p^i \mod m$$

Calculating $H_{S[i:j]}$ for every position in $S$ except for the last $|P|-1$ positions is going to allow to find all occurrences of $P$ in $S$.

Below is the code of a function that given two strings S and P returns an array of all indexes of S where P is found:

func findAllIndexes(S, P string) []int {
    m := 999907
    p := 31

    hP := 0
    pPow := 1
    for _, c := range P {
        hP = (hP + (int(c-'a')+1) * pPow) % m
        pPow = (pPow * p) % m
    }

    hS := 0
    pLPow, pRPow := 1, 1
    l := 0
    res := []int{}
    for r, c := range S {
        hS = (hS + (int(c-'a')+1) * pRPow) % m
        pRPow = (pRPow * p) % m

        if r - l + 1 > len(P) {
            hS = (hS + m - (int(S[l]-'a')+1) * pLPow) % m
            pLPow = (pLPow * p) % m
            l++
        }

        if hS == (hP * pLPow) % m {
            res = append(res, l)
        }
    }
    return res
}

Problem Statement#

Naive Solution#

Rabin-Karp Idea#

Hash Function#

Finding Substrings#

LeetCode Problems to Practice#

Problem Statement

Naive Solution

Rabin-Karp Idea

Hash Function

Finding Substrings

LeetCode Problems to Practice