github’s rust-gemsbpe crate is quite interesting. they use aho-corasick algorithm to make a fast chunking and backtracking tokenizer.
the aho-corasick algorithm uses special data structure, which looks like a trie but has back-links in case of failure, named failure links or suffix links
key insight of aho-corasick: match multiple patterns in O(n + m + z) time - one pass through text, no backtracking
failure links point to the longest proper suffix that also exists as a prefix in your pattern set. this is what lets you “fail gracefully” without rescanning
daachorse is an alternative aho-corasick implementation using a “double-array” structure instead of a trie. claims 3-5x faster and 56-60% smaller memory than burntsushi’s crate for large dictionaries
the double array works based on a BASE and CHECK array where state transitions on the BASE array are checked with the CHECK for valid transitions.