Boyer-Moore-Horspool String Matching Algorithm

Boyer-Moore-Horspool is the fastest way to search for substrings. If you are working on a searchbars, an auto-corrector, or Big Data and you are not using this algorithm, you are wasting CPU cycles and your time.

What is Boyer-Moore-Horspool Algorithm ?

Boyer-Moore-Horspool is an algorithm for finding substrings into strings. This algorithm compares each characters of substring to find a word or the same characters into the string. When characters do not match, the search jumps to the next matching position in the pattern by the value indicated in the Bad Match Table.

The Bad Match Table indicates how many jumps should it move from the current position to the next.

History

The algorithm was published by NIgel Horspool, in 1980, a professor of computer science at the University of Victoria. He is co-inventor of Dynamic Markov Compression and editor-at-large of the journal Software. Also, he is the author of C Programming in the Berkeley UNIX Environment.

His work was based on Boyer–Moore string search algorithm. This is related to the Knuth–Morris–Pratt algorithm. Horspool improve the efficiency of the original algorithm to achieve the complexity to O(mn) in the worst case.

How Does It Works?

As an example, we will find “abcd” into the string “eovadabcdftoy.”

The first step is calculate the value of each letter of the substring to create the Bad Match Table, using this formula,

Value = length of substring - index of each letter in the substring - 1

Note that the value of the last letter and other letters that are not in the substring will be the length of the substring

Finally, the value should be assigned to each letter in the Bad Match Table. After calculating the value, your table will look like this,

After that, you can compare the substring and the string. You start from the index of the end letter in the substring, in this case the letter "d."

If the letter matches, then compare with the preceding letter, "c" in this case.
If it doesn't match, check its value in the Bad Match Table.
Then, skip the number of spaces that the table value indicates.

Repeat this steps until all the letters match.

Here's another example,

Code

This code implements the Boyer-Moore-Horspool algorithm in Python.

https://gist.github.com/pakosaldanaort/6e51dcafeca7ccea3cbb41e175a8d241

Performance

The Boyer-Moore-Horspool algorithm execution time is linear in the size of the string being searched. It can have a lower execution time factor than many other search algorithms.

For one, it does not need to check all characters of the string. It skips over some of them with help of the Bad Match table.

The algorithm gets faster as the substring being searched for becomes longer. This is because with each unsuccessful attempt to find a match between the substring and the string, the algorithm uses the Bad Match table to rule out positions where the substring cannot match.

Complexity

In the worst-case the performance of the Boyer-Moore-Horspool algorithm is O(mn), where m is the length of the substring and n is the length of the string.

The average time is O(n). In the best case, the performance is sub-linear, and is, in fact, identical to Boyer-Moore original implementation.

Regardless, the Boyer-Moore-Horspool is quicker and the internal loop is simpler than Boyer-Moore.

Conclusion

Boyer-Moore-Horspool is faster, simpler and optimized the searches of substrings. It has the following uses,

Searchbars
Auto-correctors
String Analyzers
Big Data
Text labeling

Boyer-Moore-Horspool is the best algorithm for string searches.

The Boyer-Moore-Horspool Algorithm

Francisco Saldaña

Encora