Nazanin De's Blog

Projects, Ideas and thoughts

Javascript implementation of Aho-Corasick Algorithm for Pattern Searching

Aho-Corasick algorithm is a string search algorithm invented by Alfred V. Aho a Canadian Computer scientists mostly famous for his work on languages, compilers and related algorithms. This algorithm is used to find occurrences of text in a set of strings. The complexity of this algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches.

The algorithm actually build a finite state machine which represents a Trie with additional links between various small nodes. The automaton for this algorithm can get constructed once and then the compiled version of it can get used later. This algorithm later on formed the bases of Unix command fgrep.

So as the first step we need to build a simple Trie data structure in Javascript.
Trie is an efficient information retrieval data structure. Using Trie can optimized search algorithms and bring their complexity down to the limit. Trie can be used instead of binary tree or even a hash table to outperform the algorithm.

Here is an example of Trie we need to build for our pattern searching algorithm:

var trie = {  
    m: {
        a: {
            t: {
                end: true,
                e: {
                    end: true
                }
            }
        }
    },
    l: {
        o: {
            o: {
                end: true
            }
        }
    },
    p: {
        i: {
            p: {
                end: true,
                e: {
                    end: true
                }
            }
        }
    }
};

I used nodeJS to read the words I would like to put in trie from a txt file including all my words. So the first step is reading all words from the file and then transform them to an array.

var text =  
    require('fs').readFileSync('string.txt', 'utf8'),
    words = text.replace('/\n/g', '').split(' ').sort();

Now we have all the words in an array it is time to build our Trie. For that wee need to traverse all words and also all the characters inside the word and place it in the Trie. The algorithm contains following steps:
1. Traverse all words
2. For each word get all the letters
3. Traverse all the letters of a word
4. For each letter look at the trie:

  • If letter is not in the trie, look at the position of the letter.

    • If position is the last add 0 to the letter value in the trie otherwise add an object as the value so to continue building up next letters.
  • If letter position already exists in the trie as a final leaf, Make it an object to be able to continue building the word

  • Otherwise do nothing

  • Update the trie

Following is the code representing described trie building algorithm in Javascript:

const fs = require('fs'),  
    text = fs.readFileSync('string.txt', 'utf8'),
    words = text.replace('/\n/g', '').split(' ').sort(),
    trie = {};


for(let i = 0 ; i < words.length; i++) {  
    var word = words[i],
        //get all the word letters
        letters = word.split(''),
        cur_trie = trie;

    //iterate through all letters of a word
    for(let j = 0; j < letters.length; j++) {
        let letter = letters[j],
            position = cur_trie[ letter ];

        //If this letters does not exist in the cur_trie create a leaf
        if(position === undefined) {
            // If it is the end of the word make zero as and
            // Otherwise make an object to continue building the trie
            cur_trie = cur_trie[ letter ] = j === letters.length - 1 ? 0 : {};
        } else if(position === 0) {
            // If a final leaf already exists we need to turn it
            // into an object to continue building the trie
            cur_trie = cur_trie[ letter ] = { $: 0 };
        } else {
            // Otherwise there is nothing to be set, so continue on
            cur_trie = cur_trie[ letter ];
        }
    }
}

Now that we know how to build up a trie data structure it is time to use that for searching text inside the trie. You should know that the Aho-Corasick algorithm matches all patterns in the dictionary at once and that is a big advantage. Twitter is using this algorithm in Twitterfall Exclusions filter, to be able to efficiently search tweets for multiple substrings.

The Aho-Corasick algorithm is an extension of a trie tree, not far from the basic idea. Aho-Corasick algorithm adds a failed pointer to each node on a trie tree.

When fails, a trie tree will restart from the root (add the start index on L by 1), but Aho-Corasick algorithm will restart from the node D pointed by the failed pointer (add the start index on L by the depth of the node D).

Basically our search function should get a trie and a string to search inside the trie.

Following is the code represent the algorithm in Javascript:

ahoCorasickSearch(trie, s) => {  
    let current_node = trie,
        state = '',
        sentence = s.split(''),
        j = 0;

    while(1) {
       for (i = j; i < sentence.length; i++) {
            let r = current_node.hasChild(sentence[i]);
            // Does this character exist in the children of where we
            //are in the trie?
            if (r) {
                // If so, append to the state, and traverse to
                // that child
                state += sentence[i];
                current_node = r;
                // Have we found a word now?
                if (trie.getWordCount(state) != 0) {
                    return true;
                }
            } else {
                // If not, go back to where we started to match,
                //reduce i, and empty the state
                current_node = trie;
                i = i - state.length;
                state = '';
            }
        }
        // Reached the end of the string
        if (state == '') {
            // Just found nothing
            return false;
        } else {
            // Was in the middle of finding something, so possibly
            //missed something, so go back to check.
            current_node = trie;
            j = i - state.length + 1;
            state = '';
        }
    }
}

Aho-Corasick is not the only efficient pattern searching algorithm out there, There is another strong algorithm for pattern searching called Rabin Karp which I strongly recommend reading it.