The Equivalence of Regular Expressions

A niche and difficult, yet interesting and insightful, problem

The applet can be found here.

The problem of checking whether two regular expressions (regex) are equivalent has probably been the problem for which I've toiled the longest. Albeit, not continuously. But because of the circumstances in which this problem has been reccuringly presented to me, it seems that every year, for the past 5 years, I put significant effort to this problem around this time of year.

I plan to compile everything I learned about regexes and determining their equivalence in a series of articles. These articles will stylistically be a cross between a blog and some lecture notes. And hopefully, I can describe an interesting (at least to me) journey starting from computational theory to the development of the current algorithm I employ. I hope to highlight the pros and cons of other approaches as well. This naturally leads to a discussion about computational complexity in a surprisingly broad scope.

But for now, I plan to give a brief summary of this project at the cost of assuming those who read this will have sufficient knowledge to understand.

Why I Did This

I came across regular expressions, I think, a little late in my programming journey. I had already learned Java and Python, yet, I haven't really used regexes at all. My first exposition to the subject was actually in a Discrete Math course. It was the Spring Semester of 2018, and I was a junior in high school. The last unit of this course was an introduction to the theory of computation. We started with Finite State Machines, and examined their expressive equivalence to Regular Grammars, Regular Langauges, and Regular Expressions.

And as part of the last exam, we were required to convert a Deterministic Finite State Automaton (DFA) to a regular expression. We were taught an algorithm to perform such a conversion in the course, but there involved choices in this algorithm. In that, one can take different paths in the algorithm as the regular expression corresponding to the DFA is not unique. Some students can also just intuit the regular expression corresponding to a DFA. This left a problem for the teacher that has to grade the exam.

The teacher needed a tool to determine whether two regular expressions are equivalent. So that, he need not understand the student's work, but just compare the final answer of the student to his answer. Before this exam was given, he lamented that such a tool existed, but was then no longer supported by browsers. So, I began working on this tool. And it was to my detriment, as he initially thought my answer in the exam was correct, but the tool I created, correctly disagreed.

The First Algorithm

In the teacher's lament of the lack of such a tool, he provided a sketch of the "standard" (or so he claims) algorithm to determine the equivalence of regular expresions. It's as follows:

Starting with two regexes R₁ and R₂

First, convert R₁ and R₂ to Nondeterministic Finite Automata (NFAs) N₁ and N₂ using Thompson's Algorithm.
Next, convert the NFAs N₁ and N₂ to DFAs D₁ and D₂ using subset construction.
Minimize D₁ and D₂ to M₁ and M₂.
Check that M₁ and M₂ are isomorphic, as by some obscure theorem, M₁ and M₂ are isomorphic if and only if R₁ and R₂ are equivalent.

A quick note on step 4 as DFA isomorphism isn't trivially obvious: We can treat M₁ and M₂ as weighted digraphs where the alphabet serves as weights. Then M₁ and M₂ are isomorphic if there exists an isomorphism graphically, and such an isomorphism preserves the "start" state and whether each state is an accept state.

So, I never pursued this algorithm for two reasons. First, I did not know how to minimize DFAs at the time. And second, step 2 is, at worst, exponential in time and space, and step 4 is exponential in time. It did not interest me to learn how to minimize DFAs to just write an algorithm which is doubly exponential. Admittedly, I eventually learned to minimize DFAs, but we'll go over that in the more comprehensive, multi-article coverage of this journey.

The Current Algorithm

In the Discrete Math class, we learned the algorithms necessary for steps 1 and 2 for that first algorithm. But we also learned, in a homework problem, to find the complement of a DFA. The construction of the union of two NFAs is part of Thompson's Algorithm. And with the complement and the union, we can construct the difference of two DFAs using some manipulation and DeMorgan's law. But, in the current implementation, we use an algorithm to construct the intersection of two DFAs rather than just working with complements and unions.

The algorithm is as follows:

First, convert R₁ and R₂ to NFAs N₁ and N₂ using Thompson's Algorithm.
Next, convert the NFAs N₁ and N₂ to DFAs D₁ and D₂ using subset construction.
Construct the DFAs that are differences of D₁ and D₂: D_1\2 and D_2\1.
Use Breadth-First Search (BFS) to find a path from the start state to an accept state in D_1\2 and D_2\1. R₁ and R₂ are equivalent if and only if there are no paths from start states to any accept state in D_1\2 and D_2\1.

This algorithm works since by their construction, there is a path from the start state to an accept state in D_1\2 if and only if there is a string that R₁ accepts but R₂ doesn't. And analogously, we can say something similar about D_2\1.

Compared to the first algorithm, we skip the worst-case exponential time portion which checks for graph isomorphism. There's still the step 2 which is worst-case both exponential in time and in space. But can we do better?

The Theoretically Best Algorithm

Determining the equivalence of regular expressions is in PSPACE (in fact, it's PSPACE-complete). In other words, there exists an alternative algorithm which only takes polynomial space, even in its worst-case. So why didn't we use it for this app?

Well, what pushes the current algorithm to exponential space is the subset construction in step 2. But, in "most" cases, step 2 doesn't result in the exponential case. And when it doesn't, the algorithm runs well below exponential time and quite quickly even in just your browser. However, as we will see, this theoretically best algorithm will always run in exponential time.

While I somwhat simplify, this section will be the most theoretical, and perhaps, least relevant to most of the audience. I will now describe this algorithm. First, step 2 in the current algorithm is exponential in space as the DFAs, D₁ and D₂, have a number of states which are, at worst, exponential with respect to the combined lengths of R₁ and R₂. And, therefore, their differences, D_1\2 and D_2\1, also have an exponential number of states with respect to the lengths of the regexes.

Suppose R₁ and R₂ are not equivalent. As discussed, we can find a string on which they differ by performing BFS on either D_1\2 and D_2\1. The path found by BFS will have edges whose transitions will correspond to the characters of the string on which the two regexes differ.

Thus, if R₁ and R₂ are not equivalent, there exists a string, whose length is exponential with respect to the lengths of R₁ and R₂, for which the two regexes must differ. Now, to count up to a number exponential to the lengths of R₁ and R₂, we need only a counter whose size is polynomial with respect to the lengths of R₁ and R₂. Think, 10ⁿ has n digits, ie, the numbers necessary to display all numbers from 1 to 10ⁿ require at most n digits.

Now, step 1 (Thompson's algorithm) in the current algorithm produces NFAs whose number of states grows linearly with respect to the lengths of the regexes. So this is what we'll do. First, we will run the NFAs on a string with 0 characters. Now, we just need to keep track of the states on which the NFA could be after 0 characters. If one of the NFAs lands on an accept state, and the other doesn't then we can conclude that the two regexes are not equivalent.

We have all the states on which both NFAs could be after 0 characters. With that information, we could take one "step", and get all the states on which both NFAs could be after 1 character. And again, we check for discrepancies. And we continue until we reach that length when we are guaranteed to find a string on which R₁ and R₂ must differ if they are not equivalent.

At each step, we keep track of at most all the states that each NFA has. And since, the number of states each NFA has grows linearly with respect to the lengths of the regexes, keeping track of the states keeps us in polynomial space. Likewise, a counter reaching the exponential length necessary, as discussed, only takes a polynomial amount of space.

Now, in terms of time complexity, there has not been an algorithm found which runs in polynomial time for checking the equivalence of regexes. Finding one would imply that PSPACE = P, which further implies that, P = NP, solving a famously unsolved problem. But as we can see, this algorithm which requires at most polynomial space, also requires exponential time in every case. Thus, as a practical matter, I chose the current algorithm over this theoretically better algorithm.