今際の国の呵呵君: [LeetCode]Regular Expression Matching

String问题的话，总是很容易想到DP。简单的正则匹配也可以用DP来做，不过在用DP之前，由于匹配模式的多种可能，其实我们也可以在解空间遍历所有可能的分支，然后根据需要的情况剪枝，当然backtracking不如DP高效，不过我们可以先试试这个方法。

基本的情况是三种，我们假设string s是要匹配的字符串，string p是RE，那么当s的第i个之前和p的第j个之前可以匹配的话，我们会出现以下的情况：

....../a, ....../c 两个char不匹配
....../a, ....../a 或者 ....../a, ....../. 两者匹配
....../a, ....../*

大体上我们会遇到以上三种情况，当然在遇到第三种情况时我们不能遇到了*再进行branch，我们应该在*前面的那个字符开始branch。比如：abd，abc*d，如果我们i到了d并且j到了c的位置，这时候我们发现不同就return false了，所以我们应该提前看下一位是不是*。

对于第一种情况，我们简单的返回false即可，因为没有办法匹配了。
对于第二种情况，可以匹配，我们接着去看i + 1和 j + 1位。
对于第三种情况，就是我们branch的来源，*可以造成多种匹配的模式，首先我们假设j到了某一位并且发现下一位是*, 这时候假设从i开始有一个长度为d的串，j位都可以匹配。比如：....../aaaa，....../a* 或者 ....../abcd，....../.* 这样的模式。长度d可以是>=0的任何数字，当然在string 长度范围内。那么就有(char*)这样一个RE（不是表示指针。。。）可以匹配0, 1, 2..., d个char，总共d + 1个分支。

那么DFS的终止条件是什么？首先假设我们到了p串的末尾，也就是匹配完了最后一位。发现s串也正好到了末尾，那么当然匹配成功，返回true。相反如果我们发现s串没有到末尾，p串没有其他的char继续匹配了，返回false。另一种情况是，如果我们到了s的末尾，但我们没有到p的末尾（到p末尾的情况跟上面的是一样的），那么匹配是还有可能进行的，只要p的后面全部是(char*)这样的RE。那么也就是说，如果我们到了s的末尾，j + 1不是*，我们返回false。相反，如果是*，我们还要让程序继续运行下去，让i去匹配j + 2。

代码如下：

	public class Solution {

	public boolean isMatch(String s, String p) {
	return dfs(s, p, 0, 0);
	}

	private boolean dfs(String s, String p, int i, int j) {

	//if we arrive at the end of p, but we have not arrived at the end of s,
	//they will not match since it is impossible to match the rest part of s
	//On the other hand, if we arrive at the end of s, but we have not arrive
	//the end of p, it still may match, if next is [a-z]*, if not they are
	//still not match. So when we arrive at the end of s, we must judge it
	//to be false, when next char of p is not *. Namely, when we arrive at the
	//end of s, we will still keep the program on going
	if (j == p.length())
	return i == s.length();


	//when j + 1 is not '*'
	//j cannot be * since we will skip '*'
	//when j is at last, the next cannot be '*'
	if (j == p.length() - 1 \|\| p.charAt(j + 1) != '*') {
	if ( i < s.length() && (s.charAt(i) == p.charAt(j) \|\| p.charAt(j) == '.'))
	return dfs(s,p, i + 1, j + 1);
	else // if we arrive at the end of s and we find they are not match, namely the next of p is not * and i doesn't match j
	return false;
	}

	//when j + 1 is '*'
	while (i < s.length() && (s.charAt(i) == p.charAt(j) \|\| p.charAt(j) == '.')) {
	//if one match, then it match. Using or is not convenient, use if
	if (dfs(s, p, i, j + 2))
	return true;
	i++;//i match, i + 1 match, i + 2 match....
	}

	//we should also dfs the one which doest't match the jth character in p or i
	//reaches s end, for example, "aab", "cab" and "aa", "ab"
	//if we still not arrive at the end of p, but we have arrived at the end of s,
	//we keep going and find if the the next one can still matching, until we arrive
	//at the end of p, or we find that they are not match
	return dfs(s, p, i, j + 2);
	}
	}

view raw Regular Expression Matching DFS.java hosted with ❤ by GitHub

接下来是DP的做法，有了上面的分析，我们的思路就更加明确了我们用[i, j]来表示s的前i个char和p的前j个char是否匹配，那么对于以上三种情况：

对于第一种情况，赋值false即可。
对于第二种情况，[i, j] = [i - 1, j - 1]
对于第三种情况，首先如果[i, j - 1]匹配，[i, j]自然也匹配(对应(char*)匹配一个字符的情况)。如果[i , j - 2]匹配，那么[i , j]也匹配(对应(char*)匹配0个字符的情况)。再者，如果[i - 1, j]匹配，[i, j]也匹配(对应(char*)匹配s[0, i]末尾多个字符的情况)。

代码如下：

	public class Solution {
	public boolean isMatch(String s, String p) {
	if (s == null \|\| p == null)
	return false;
	int len2 = s.length();
	int len1 = p.length();
	boolean[][] arr = new boolean[len1 + 1][len2 + 1];
	arr[0][0] = true;
	for (int i = 0; i < len1; i++) {
	if (p.charAt(i) != '*')
	arr[i + 1][0] = false;
	else
	arr[i + 1][0] = arr[i - 1][0];
	}
	for (int i = 0; i < len1; i++) {
	for (int j = 0; j < len2; j++) {
	if (p.charAt(i) == s.charAt(j) \|\| p.charAt(i) == '.')
	arr[i + 1][j + 1] = arr[i][j];
	else if (p.charAt(i) == '*') {
	arr[i + 1][j + 1] = arr[i - 1][j + 1] \|\| arr[i][j + 1] \|\| (arr[i + 1][j] && (p.charAt(i - 1) == s.charAt(j) \|\| p.charAt(i - 1) == '.'));
	} else
	arr[i + 1][j + 1] = false;
	}
	}
	return arr[len1][len2];
	}
	}

view raw Regular Expression Matching.java hosted with ❤ by GitHub

最后给出Regular Expression的普遍解法，具体的详解可以参考Princeton Algorithms一书的相关章节。简单的概括就是根据p字符串建立NFA，NFA和DFA的区别就是根据输入DFA每次都要相应的转换状态，而NFA没有输入也可以转换状态，也就是说NFA维护的是一个状态的集合，当有输入导致状态转移的时候，所有的维护的状态都要看是否可以转换到下一个状态，转换完之后再看没有输入的话还可以到哪哪些状态，再把这些状态维护起来。继续下一个输入。这里我们用的是Graph来模拟NFA，注意我们有两种edge，但是我们只需要连接epsilon-transition（也就是不需要输入就能进行的状态转移）的edge就行。所有的非epsilon-trainsition都是普通char到char的下一个，这个edge可以在我们每次输入字符的时候把他视为连接。如果有可以进行的转移。我们去到char的下一个字符表示的node。注意我们一共有len + 1个nodes，len是p字符串的长度。额外的node放在末尾表示success，到达sucess才表示我们成功匹配。简而言之，NFA的做法可以分为以下几步：

根据p string建立图，连上所有的epsilon-transition代表的edge，不同的符号要连接的epsilon edge不同，具体参考上述书籍。
dfs第一个字符，找到所有能到的状态。
输入s的第一个char，输入char之后找到可以进行的状态转移（对维护的所有状态），去到下一个状态。
对所有新找到的状态进行dfs，维护新的所有可以到达的状态。
重复第三步直到读完s string
在维护的状态中check是否有success，有返回true，否则返回false。

代码如下：

	public class Solution {

	private class Graph {

	private int V;
	private int E;
	private ArrayList<Integer>[] adj;
	private boolean[] visited;

	public Graph(int V) {
	adj = (ArrayList<Integer>[])new ArrayList[V];
	visited = new boolean[V];
	this.V = V;
	E = 0;

	for(int i = 0; i < V; ++i){
	adj[i] = new ArrayList<Integer>();
	}
	}

	public void addEdgeFromTo(int v, int w) {
	adj[v].add(w);
	E++;
	}

	public boolean visited(int v) {
	return visited[v];
	}

	public void dfs(int v) {
	visited[v] = true;
	for(int w : adj(v)) {
	if(!visited[w])
	dfs(w);
	}
	}

	public void reset() {
	for(int i = 0; i < V; ++i){
	visited[i] = false;
	}
	}

	public ArrayList<Integer> adj(int v) {
	return adj[v];
	}

	public int V() {
	return V;
	}

	public int E() {
	return E;
	}
	}

	private char[] reg;
	private int M;
	private Graph G;


	public boolean isMatch(String s, String p) {

	int len = p.length();
	reg = p.toCharArray();
	G = new Graph(len + 1);
	M = len;

	for(int i = 0; i < M; ++i){
	if(i < M - 1 && reg[i + 1] == '*'){
	G.addEdgeFromTo(i, i + 1);
	G.addEdgeFromTo(i + 1, i);
	}
	if(reg[i] == '*'){
	G.addEdgeFromTo(i, i + 1);
	}
	}

	ArrayList<Integer> pc = new ArrayList<Integer>();
	G.dfs(0);
	for(int i = 0; i < G.V(); ++i){
	if(G.visited(i))
	pc.add(i);
	}

	for(int i = 0; i < s.length(); ++i){
	ArrayList<Integer> newpc = new ArrayList<Integer>();
	for(int v : pc) {
	if(v < M && (reg[v] == s.charAt(i) \|\| reg[v] == '.')){
	newpc.add(v + 1);
	}
	}

	G.reset();
	for(int v : newpc) {
	G.dfs(v);
	}

	for(int j = 0; j < G.V(); ++j){
	if(G.visited(j))
	newpc.add(j);
	}

	pc.clear();
	pc = newpc;
	}

	for(int v : pc) {
	if(v == M)
	return true;
	}

	return false;
	}
	}

view raw Regular Expression Matching NFA.java hosted with ❤ by GitHub

今際の国の呵呵君

Tuesday, November 4, 2014

[LeetCode]Regular Expression Matching

No comments:

Post a Comment