关于字符串匹配的数据结构和算法我们之前介绍过Trie和KMP,Trie在储存大量文本的时候有优势,不仅可以省去很多重复前缀的空间,而且对于字符串的查找也很快;KMP在处理单串匹配的时候也很有效率,因为next指定了失配时候状态的转移,从而避免了回溯,让我们可以在线性时间内完整子串的匹配。
然而这两种方法都注重处理单输入串的匹配,如果我们待匹配串有多个的话,这两者并没有很好的办法解决。比如对于文本ABCDEFG,和待匹配串{AB, BC, DEF, ABCD, DEG},我们想找到每个待匹配串时候在文本当中出现过。Trie不好处理子串的匹配,因为我们需要把所有前缀插入,KMP处理多个待匹配串并没有优势,我们需要对所有待匹配串生成next数组,然后一一进行匹配。
为了解决这个问题,我们可以使用AC自动机,我们可以把AC自动机看做Trie和KMP的结合,首先AC自动机的基本结构是由所有待匹配串最形成的Trie,此外,最重要的一点是所有节点都会有一个失配指针,这个和KMP中的next数组很像,都指定了失配时状态是应该如何转移的。我们拿{abce, abdexy, def, xya, xyx, xy}举例:
上图表示了{abce, abdexy, def, xya, xyx, xy}所对应的AC自动机,其中红色的节点代表对应的的节点有word存储。其中实线对应Trie的边,虚线对应失配指针的指向(注意这里每个节点都应该有失配指针,但是我们为了保持图片的整洁,省略了所有指向root的失配指针,同时root的失配指针是null,我们也没画出来)。我们可以看到,失配指针构建的原则和KMP的next数组是很相似的。在KMP中,next数组中next[i]表示的是前i个字符组成的字符串当中,最长的公共前后缀(不包括自身)的长度。在AC自动机中,对于在节点i处的fail指针,我们要找的是,从根节点到i处所组成的string,其所有的后缀(不包括自身),和Trie中存在的所有前缀能match的最长部分对应的节点。我们拿abde来举例,abde的后缀有{bde, de, e},我们对这三个后缀分别在Trie中查找,可以看出能match的最长后缀为de,所以我们可以看到对应的fail指针连接上了。同理abcdexy能match的最长后缀为xy,xyx能match的最长后缀为x,等等。
那么我们具体如何构建Fail指针呢。首先root自身的fail指针是null,root下面那一层的节点的fail指针都是root,在这一层失配的话我们是没有其他的地方可以去的,只能回到root。除此之外,如果我们在更深的某个节点失配,假设从根节点到当前节点组成的string为hijklmn,为了找n对应的fail指针,我们可以利用hijklm中m节点的fail指针,因为其表示的是hijklm所有后缀中能在Trie中match的最长后缀:
- 如果m的fail指针对应的节点,假设为klm,其子节点有n的话,klmn就是hijklmn所有后缀能在Trie表示的所有前缀中能match的最长部分,因为klm是hijklm在Trie中能match的最长后缀。
- 如果d的fail指针对应的节点,假设为klm,其子节点没有n的话。那么我们应该继续在klm所有后缀在trie中能match的最长部分继续找,这样可以保证我们尽可能match hijklmn在Trie中的最长后缀,也就是去往klm中m的fail指针指向的节点的子节点继续找e。直到到达根节点。
这个方法需要我们按层来建立fail指针,因为下一层的fail指针的建立要依靠上一层节点。代码如下:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
//build failed pointer for each node | |
void buildFailed() | |
{ | |
//for current node c with edge e, we look for its parent's fail pointer f | |
//if f has the same char as e as its edge to chidlren c', we point c to c' | |
//if not, we recursively look for f's parent until it is root | |
queue<Node*> q; | |
for (int i = 0; i < 26; ++i) | |
{ | |
if (root->children[i]) | |
{ | |
q.push(root->children[i]); | |
root->children[i]->fail = root; | |
} | |
} | |
while (q.size()) | |
{ | |
Node* curr = q.front(); | |
q.pop(); | |
for (int i = 0; i < 26; ++i) | |
{ | |
if (curr->children[i]) | |
{ | |
q.push(curr->children[i]); | |
Node* fail = curr->fail; | |
while (fail != root && !fail->children[i]) | |
fail = fail->fail; | |
if (fail->children[i]) | |
curr->children[i]->fail = fail->children[i]; | |
else | |
curr->children[i]->fail = root; | |
} | |
} | |
} | |
} |
查询的话很直接,如果match的话我们按照Trie的边走,否则我们按照fail指针走。如果我们碰到了一个match,我们不仅要把当前word加入结果集,我们还要把沿着fail指针走一直到root上所有的word也加入结果集。比如在上图表示的自动机中我们match了abdexy,而xy作为abdexy的后缀也被match了,我们要沿着fail指针将其收集。代码如下:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void query(string s, unordered_set<string>& res) | |
{ | |
Node* curr = root; | |
for(int i = 0; i < s.size(); ++i) | |
{ | |
char c = s[i]; | |
if (curr->children[c - 'a'])curr = curr->children[c - 'a']; | |
else | |
{ | |
while (curr && !curr->children[c - 'a']) | |
curr = curr->fail; | |
if (!curr) | |
curr = root; | |
else | |
curr = curr->children[c - 'a']; | |
} | |
//we follow fail path to find all matches, since the substring we | |
//are matching might contain a target word, for example: ACG, C | |
//when we are matching AC and actually C is a target word which is on | |
//failed path | |
Node* helper = curr; | |
while (helper != NULL) | |
{ | |
if (helper->isWord) | |
res.insert(curr->word); | |
helper = helper->fail; | |
} | |
} | |
} |
假设待匹配串有m个平均长度为k,输入文本长度为n。我们建立AC自动机的时间复杂度为O(m * k),两部分建立Trie和Fail指针的时间复杂度就等于Trie中的总node数m * k。多穿匹配的话时间复杂度为O(n),我们只需要对每个输入字符进行状态的转移即可。完整代码如下:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class ACAutomaton | |
{ | |
public: | |
ACAutomaton() | |
{ | |
root = new Node(); | |
} | |
~ACAutomaton() | |
{ | |
clear(root); | |
} | |
void insert(const string& str) | |
{ | |
Node* curr = root; | |
int len = str.size(); | |
for (int i = 0; i < len; ++i) | |
{ | |
char c = str[i]; | |
if (curr->children[c - 'a'] == NULL) | |
curr->children[c - 'a'] = new Node(); | |
curr = curr->children[c - 'a']; | |
} | |
curr->isWord = true; | |
curr->word = str; | |
} | |
//build failed pointer for each node | |
void buildFailed() | |
{ | |
//for current node c with edge e, we look for its parent's fail pointer f | |
//if f has the same char as e as its edge to chidlren c', we point c to c' | |
//if not, we recursively look for f's parent until it is root | |
queue<Node*> q; | |
for (int i = 0; i < 26; ++i) | |
{ | |
if (root->children[i]) | |
{ | |
q.push(root->children[i]); | |
root->children[i]->fail = root; | |
} | |
} | |
while (q.size()) | |
{ | |
Node* curr = q.front(); | |
q.pop(); | |
for (int i = 0; i < 26; ++i) | |
{ | |
if (curr->children[i]) | |
{ | |
q.push(curr->children[i]); | |
Node* fail = curr->fail; | |
while (fail != root && !fail->children[i]) | |
fail = fail->fail; | |
if (fail->children[i]) | |
curr->children[i]->fail = fail->children[i]; | |
else | |
curr->children[i]->fail = root; | |
} | |
} | |
} | |
} | |
void query(string s, unordered_set<string>& res) | |
{ | |
Node* curr = root; | |
for(int i = 0; i < s.size(); ++i) | |
{ | |
char c = s[i]; | |
if (curr->children[c - 'a'])curr = curr->children[c - 'a']; | |
else | |
{ | |
while (curr && !curr->children[c - 'a']) | |
curr = curr->fail; | |
if (!curr) | |
curr = root; | |
else | |
curr = curr->children[c - 'a']; | |
} | |
//we follow fail path to find all matches, since the substring we | |
//are matching might contain a target word, for example: ACG, C | |
//when we are matching AC and actually C is a target word which is on | |
//failed path | |
Node* helper = curr; | |
while (helper != NULL) | |
{ | |
if (helper->isWord) | |
res.insert(curr->word); | |
helper = helper->fail; | |
} | |
} | |
} | |
private: | |
Node* root; | |
void clear(Node* curr) | |
{ | |
if (!curr)return; | |
for (int i = 0; i < 26; ++i) | |
clear(curr->children[i]); | |
delete curr; | |
} | |
}; | |
int main() | |
{ | |
ACAutomaton ac; | |
ac.insert("abce"); | |
ac.insert("abdexy"); | |
ac.insert("def"); | |
ac.insert("xya"); | |
ac.insert("xyx"); | |
ac.insert("xy"); | |
ac.buildFailed(); | |
unordered_set<string> res; | |
//res = {"abce", "xya", "xy", "xyx"} | |
ac.query("abcexyxya", res); | |
res.clear(); | |
//res = {"def", "xy"} | |
ac.query("abdefxy", res); | |
} |
No comments:
Post a Comment