6 min read

On Rabin-Karp

I’m a sucker for anything that involves hashing, so when I came across the Rabin-Karp string-searching algorithm and its use of a rolling hash, I lost all control.

Let’s dig in.

The algorithm was named after Michael O. Rabin and Richard Karp, two computer scientists who created the string search algorithm in 1987.

The Problem

Searching for a pattern in a document can be expensive. One can reach for a brute-force solution, trying one letter at a time, but this can produce runtimes of O(nm) (n is the size of the document and m is the size of the pattern), such as when searching for a pattern of 10,000 a’s followed by a single b in a search string of 10 million a’s.

The idea, though, is to search through the document a letter at a time, testing each one against the pattern. When a match is found, an inner loop will then check the subsequent characters against the length of the pattern, looking for a complete match.

It could look something like this:

const naive = (word, text) => {
    const wordLen = word.length
    const textLen = text.length;
    let found = false;

    for (let i = 0; i < textLen - wordLen + 1; i++) {
        if (word[0] === text[i]) {
            let j = 1;

            while (j < wordLen) {
                if (word[j] !== text[i + j++]) {
                    break;
                }
            }

            if (j === wordLen) {
                console.log(`Found a match at index ${i}`);
                found = true;
            }
        }
    }

    return found;
};

As one would guess, there is a much better way to approach this problem.

The Rabin-Karp Algorithm

The Rabin-Karp algorithm introduces an important idea, the rolling hash. This is a sliding window that is compared against the hashed pattern. As the window slides or shifts to the right as it moves across the text, it only performs arithmetic operations in constant time, as opposed to recomputing the entire new hash every time.

Here is a rough sketch of the algorithm:

1. Hash the search pattern. Only do this once.

    hash('ooo')

2. Initiate the rolling hash from the text to
   search.  This is the window and will be the
   length of the search pattern.

    text = 'yahooo!'
    window = hash('yah')

3. Start the search from the beginning of the text.
4. Do the hashes match?

    If Yes:
    Since collisions can occur, check
    each individual letter in the hash.

5. Shift the window one character to the right:

    window = hash('aho')
    window = hash('hoo')
    window = hash('ooo')
    ...

5. Repeat step 4 until sizeof(haystack - needle + 1)

The hash function is important. It should take into consideration that as the hash value increases it could overflow the word, or int, so we must take care to constrain the outputs to something that can fit within the allotted memory space.

For instance, here’s a hash function that would work but has the potential to overflow its data type:

const BASE = 256;

const hash = (word, size = word.length) => {
    let hash = 0;

    for (let i = 0; i < size; i++) {
        hash += word[i].codePointAt() * BASE ** (size - i - 1)
    }   

    return hash;
};

hash('foo') // 6713199

This will create the hash in the following way:

'f'.codePointAt() * 256 ** 2 +
'o'.codePointAt() * 256 ** 1 +
'o'.codePointAt() * 256 ** 0

How would we slide this window one letter to the right?

1. Subtract the first hashed letter value:
    hash -= 'f'.codePointAt() * 256 ** 2

2. "Push" the letters left and add the next
   subsequent letter:
    hash = hash * 256 + 't'.codePoint()

To prevent an overflow, this can easily be done using modular arithmetic. But first, let’s optimize this a bit.

In the previous hash function, we’re multiplying and adding values more times than necessary. To fix this, we’ll employ Horner’s method, which will reduce the number of the arithmetic operations to be the size of the pattern a:

p(x) = a0 + a1x + a2x2 + a3x3 + ... + anxn
     = a0 + x( a1 + x( a2 + x( a3 + ... + x( an-1 + xan ) ... ) ) )

     So: ('o' * 2560) + ('o' * 2561) + ('f' * 2562)
       : ('f' * 2562) + ('o' * 2561) + ('o' * 2560)
Becomes: ('f' * 256 + 'o') * 256 + 'o'

Now, let’s pick a prime which we’ll use as a modulus to constrain the results to a range of our choice to prevent overflow.

foo = ( ( ( ( ( ( 'f'.codePointAt() % 101 ) * 256 + 'o'.codePointAt() ) % 101 ) * 256 ) % 101 ) + 'o'.codePointAt() ) % 101
    = 32

 oo = ( ( 'o'.codePointAt() % 101 ) * 256 + 'o'.codePointAt() ) % 101
    = 45

oot = ( ( ( ( ( ( 'o'.codePointAt() % 101 ) * 256 + 'o'.codePointAt() ) % 101 ) * 256 ) % 101 ) + 't'.codePointAt() ) % 101
    = 21

Remove the first letter `f` from `foo`:
oo = foo + 101 - ( 'f'.codePointAt() * ( 256 % 101 * 256 ) ) % 101
   = 45

You can see that equals the original definition of `oo`.

Let's now add `t` to it:
oot = ( oo * 256 + 't'.codePointAt() ) % 101
    = 21

And you can see that equals `oot`.

Hopefully, the above comparisons give an idea of how the window is shifted to the right as a rolling hash and that that hash is only computed once. The removal and addition of the bounds both take constant time O(1).

The last thing to note is that it is necessary to verify that there is indeed a match when the rolling hash window and the hashed pattern are equal. This is because there are bound to be hash collisions. See the function isMatch in the implementation below.

So, without further ado, here is an implementation of the Rabin-Karp seach string algorithm:

// Let the base equal 256, one byte.
const BASE = 256;

// Modulus should be prime.
const P = 113;

const hash = (word, size = word.length) => {
    let hash = 0;

    for (let i = 0; i < size; i++) {
        hash *= BASE;
        hash += word[i].charCodeAt();
        hash %= P;
    }   

    return hash;
};

const isMatch = (word, text, i) => {
    let found = true;

    // Could be a collision, we need to check every letter now.
    for (let j = 0; j < word.length; j++) {
        if (text[i + j] !== word[j]) {
            found = false;
            break;
        }
    }   

    return found;
};

const rabinKarp = (word, text) => {
    const wordLength = word.length;
    const hashedWord = hash(word);
    let rollingHash = hash(text, wordLength);
    let found = false;

    if (rollingHash === hashedWord) {
        if (isMatch(word, text, 0)) {
            found = true;
            console.log(`Found a match at index 0`);
        }
    }

    let multiplier = 1;
    for (let i = 1; i < wordLength; i++) {
        multiplier *= BASE;
        multiplier %= P;
    }

    for (let i = 0; i < text.length - wordLength; i++) {
        rollingHash += P;
        rollingHash -= (text[i].codePointAt() * multiplier) % P;

        rollingHash *= BASE;
        rollingHash += text[wordLength + i].codePointAt();
        rollingHash %= P;

        if (rollingHash === hashedWord) {
            if (isMatch(word, text, i + 1)) {
                found = true;
                console.log(`Found a match at index ${i + 1}`);
            }
        }
    }

    return found;
};

References