今際の国の呵呵君: [LeetCode]Aho–Corasick algorithm/AC自动机

关于字符串匹配的数据结构和算法我们之前介绍过Trie和KMP，Trie在储存大量文本的时候有优势，不仅可以省去很多重复前缀的空间，而且对于字符串的查找也很快；KMP在处理单串匹配的时候也很有效率，因为next指定了失配时候状态的转移，从而避免了回溯，让我们可以在线性时间内完整子串的匹配。
然而这两种方法都注重处理单输入串的匹配，如果我们待匹配串有多个的话，这两者并没有很好的办法解决。比如对于文本ABCDEFG，和待匹配串{AB, BC, DEF, ABCD, DEG}，我们想找到每个待匹配串时候在文本当中出现过。Trie不好处理子串的匹配，因为我们需要把所有前缀插入，KMP处理多个待匹配串并没有优势，我们需要对所有待匹配串生成next数组，然后一一进行匹配。
为了解决这个问题，我们可以使用AC自动机，我们可以把AC自动机看做Trie和KMP的结合，首先AC自动机的基本结构是由所有待匹配串最形成的Trie，此外，最重要的一点是所有节点都会有一个失配指针，这个和KMP中的next数组很像，都指定了失配时状态是应该如何转移的。我们拿{abce, abdexy, def, xya, xyx, xy}举例:

上图表示了{abce, abdexy, def, xya, xyx, xy}所对应的AC自动机，其中红色的节点代表对应的的节点有word存储。其中实线对应Trie的边，虚线对应失配指针的指向（注意这里每个节点都应该有失配指针，但是我们为了保持图片的整洁，省略了所有指向root的失配指针，同时root的失配指针是null，我们也没画出来）。我们可以看到，失配指针构建的原则和KMP的next数组是很相似的。在KMP中，next数组中next[i]表示的是前i个字符组成的字符串当中，最长的公共前后缀(不包括自身)的长度。在AC自动机中，对于在节点i处的fail指针，我们要找的是，从根节点到i处所组成的string，其所有的后缀(不包括自身)，和Trie中存在的所有前缀能match的最长部分对应的节点。我们拿abde来举例，abde的后缀有{bde, de, e}，我们对这三个后缀分别在Trie中查找，可以看出能match的最长后缀为de，所以我们可以看到对应的fail指针连接上了。同理abcdexy能match的最长后缀为xy，xyx能match的最长后缀为x，等等。
那么我们具体如何构建Fail指针呢。首先root自身的fail指针是null，root下面那一层的节点的fail指针都是root，在这一层失配的话我们是没有其他的地方可以去的，只能回到root。除此之外，如果我们在更深的某个节点失配，假设从根节点到当前节点组成的string为hijklmn，为了找n对应的fail指针，我们可以利用hijklm中m节点的fail指针，因为其表示的是hijklm所有后缀中能在Trie中match的最长后缀:

如果m的fail指针对应的节点，假设为klm，其子节点有n的话，klmn就是hijklmn所有后缀能在Trie表示的所有前缀中能match的最长部分，因为klm是hijklm在Trie中能match的最长后缀。
如果d的fail指针对应的节点，假设为klm，其子节点没有n的话。那么我们应该继续在klm所有后缀在trie中能match的最长部分继续找，这样可以保证我们尽可能match hijklmn在Trie中的最长后缀，也就是去往klm中m的fail指针指向的节点的子节点继续找e。直到到达根节点。

这个方法需要我们按层来建立fail指针，因为下一层的fail指针的建立要依靠上一层节点。代码如下：

	//build failed pointer for each node
	void buildFailed()
	{
	//for current node c with edge e, we look for its parent's fail pointer f
	//if f has the same char as e as its edge to chidlren c', we point c to c'
	//if not, we recursively look for f's parent until it is root
	queue<Node*> q;
	for (int i = 0; i < 26; ++i)
	{
	if (root->children[i])
	{
	q.push(root->children[i]);
	root->children[i]->fail = root;
	}
	}
	while (q.size())
	{
	Node* curr = q.front();
	q.pop();
	for (int i = 0; i < 26; ++i)
	{
	if (curr->children[i])
	{
	q.push(curr->children[i]);
	Node* fail = curr->fail;
	while (fail != root && !fail->children[i])
	fail = fail->fail;
	if (fail->children[i])
	curr->children[i]->fail = fail->children[i];
	else
	curr->children[i]->fail = root;
	}
	}
	}
	}

view raw AC_build_failed.cpp hosted with ❤ by GitHub

查询的话很直接，如果match的话我们按照Trie的边走，否则我们按照fail指针走。如果我们碰到了一个match，我们不仅要把当前word加入结果集，我们还要把沿着fail指针走一直到root上所有的word也加入结果集。比如在上图表示的自动机中我们match了abdexy，而xy作为abdexy的后缀也被match了，我们要沿着fail指针将其收集。代码如下:

	void query(string s, unordered_set<string>& res)
	{
	Node* curr = root;
	for(int i = 0; i < s.size(); ++i)
	{
	char c = s[i];
	if (curr->children[c - 'a'])curr = curr->children[c - 'a'];
	else
	{
	while (curr && !curr->children[c - 'a'])
	curr = curr->fail;
	if (!curr)
	curr = root;
	else
	curr = curr->children[c - 'a'];

	}
	//we follow fail path to find all matches, since the substring we
	//are matching might contain a target word, for example: ACG, C
	//when we are matching AC and actually C is a target word which is on
	//failed path
	Node* helper = curr;
	while (helper != NULL)
	{
	if (helper->isWord)
	res.insert(curr->word);
	helper = helper->fail;
	}
	}
	}

view raw AC_match.cpp hosted with ❤ by GitHub

假设待匹配串有m个平均长度为k，输入文本长度为n。我们建立AC自动机的时间复杂度为O(m * k)，两部分建立Trie和Fail指针的时间复杂度就等于Trie中的总node数m * k。多穿匹配的话时间复杂度为O(n)，我们只需要对每个输入字符进行状态的转移即可。完整代码如下：

	class ACAutomaton
	{
	public:

	ACAutomaton()
	{
	root = new Node();
	}

	~ACAutomaton()
	{
	clear(root);
	}

	void insert(const string& str)
	{
	Node* curr = root;
	int len = str.size();
	for (int i = 0; i < len; ++i)
	{
	char c = str[i];
	if (curr->children[c - 'a'] == NULL)
	curr->children[c - 'a'] = new Node();
	curr = curr->children[c - 'a'];
	}
	curr->isWord = true;
	curr->word = str;
	}

	//build failed pointer for each node
	void buildFailed()
	{
	//for current node c with edge e, we look for its parent's fail pointer f
	//if f has the same char as e as its edge to chidlren c', we point c to c'
	//if not, we recursively look for f's parent until it is root
	queue<Node*> q;
	for (int i = 0; i < 26; ++i)
	{
	if (root->children[i])
	{
	q.push(root->children[i]);
	root->children[i]->fail = root;
	}
	}
	while (q.size())
	{
	Node* curr = q.front();
	q.pop();
	for (int i = 0; i < 26; ++i)
	{
	if (curr->children[i])
	{
	q.push(curr->children[i]);
	Node* fail = curr->fail;
	while (fail != root && !fail->children[i])
	fail = fail->fail;
	if (fail->children[i])
	curr->children[i]->fail = fail->children[i];
	else
	curr->children[i]->fail = root;
	}
	}
	}
	}

	void query(string s, unordered_set<string>& res)
	{
	Node* curr = root;
	for(int i = 0; i < s.size(); ++i)
	{
	char c = s[i];
	if (curr->children[c - 'a'])curr = curr->children[c - 'a'];
	else
	{
	while (curr && !curr->children[c - 'a'])
	curr = curr->fail;
	if (!curr)
	curr = root;
	else
	curr = curr->children[c - 'a'];

	}
	//we follow fail path to find all matches, since the substring we
	//are matching might contain a target word, for example: ACG, C
	//when we are matching AC and actually C is a target word which is on
	//failed path
	Node* helper = curr;
	while (helper != NULL)
	{
	if (helper->isWord)
	res.insert(curr->word);
	helper = helper->fail;
	}
	}
	}

	private:
	Node* root;

	void clear(Node* curr)
	{
	if (!curr)return;
	for (int i = 0; i < 26; ++i)
	clear(curr->children[i]);
	delete curr;
	}
	};

	int main()
	{
	ACAutomaton ac;
	ac.insert("abce");
	ac.insert("abdexy");
	ac.insert("def");
	ac.insert("xya");
	ac.insert("xyx");
	ac.insert("xy");
	ac.buildFailed();
	unordered_set<string> res;
	//res = {"abce", "xya", "xy", "xyx"}
	ac.query("abcexyxya", res);
	res.clear();
	//res = {"def", "xy"}
	ac.query("abdefxy", res);
	}

view raw Aho–Corasick.cpp hosted with ❤ by GitHub

今際の国の呵呵君

Thursday, October 11, 2018

[LeetCode]Aho–Corasick algorithm/AC自动机

No comments:

Post a Comment