REGEX Equivalence Checker Documentation

Basic Functionality
Operations

Supported Operations
Order of Operations

The Alphabet

Valid Characters
Implicit Alphabet
Explicit Alphabet

Counterexamples

Specific Difference
Multiple Counterexamples

Errors

Syntax Errors
Alphabet Errors

Basic Functionality

The application can be found here. To get started, all one needs to do is to input two regular expressions in the textboxes labeled REGEX 1 and REGEX 2 then click the "Check" button. In the example below, the user types in two regular expressions that match strings only containing the letters "a" and "b", and the string must contain at least one "a" and one "b".

Then, after the user clicks on the Check button, the application displays "Equivalent" next to the Check button.

The user now changes REGEX 2 to match strings only containing the letters "a" and "b", with at least one instance of "a". After clicking the Check button, the application displays "Not equivalent" and provides an example of a string which only one of the REGEXs matches.

The application, when displaying counterexamples, uses "~" to denote the empty string. In the example below, REGEX 1 matches the empty string but REGEX 2 does not. So, the two REGEXs are not equivalent.

Finally, this documentation page can be accessed by clicking the icon at the top right of the application.

Operations

Supported Operations

Checking the equivalence of two REGEXs is an exercise with primarily theoretical applications. And so, compared to the number of supported operations available in modern implementations of REGEXs used for matching, the ones supported by this application is limited but growing. Below is a table of the currently supported operations.

Operation	Symbol	Description
match-exact	Any character in the alphabet	A REGEX of this form matches only the string containing exactly one character which is the same as the REGEX.
Empty String	~	The REGEX ~ matches the string s if s is empty.
Union (Alternation)	\|	The REGEX R₁\|R₂ matches the string s if s is matched by either the REGEX R₁ or the REGEX R₂.
Concatenation	None or .	The REGEX R₁R₂ (or R₁.R₂) matches the string s if s can be exactly split into two substrings s=s₁s₂ such that s₁ is matched by the REGEX R₁ and s₂ is matched by the REGEX R₂.
Kleene Star (zero-or-more)	*	The REGEX R₁* matches the string s if s can be split into n substrings (where n is a nonnegative integer), like s=s₁s₂s₃...s_n, such that each substring s_i is matched by the REGEX R₁.
one-or-more	+	The REGEX R₁+ matches the string s if s can be split into n substrings (where n is a positive integer), like s=s₁s₂s₃...s_n, such that each substring s_i is matched by the REGEX R₁. R₁+ is equivalent to (R₁)(R₁*).
zero-or-one	?	The REGEX R₁? matches the string s if s is empty or if the REGEX R₁ matches the string s. R₁? is equivalent to R₁\|~.

Order of Operations

In short, the order of precedence for operations is as follows:

Parentheses
Kleene Star, one-or-more, zero-or-one (left to right as it appears in the REGEX)
Concatenation
Union

Parentheses "()" have the highest precedence in the order of operations. As such, it can always be used to explicitly define the order of operations present in a given REGEX. After parentheses, the next order of precedence are the unary operators, namely the Kleene Star (*), one-or-more (+), and zero-or-one (?). These operators have the same level of precedence, but when present in succession, the leftmost operator takes precedence. For example, a+? is interpreted the same as (a+)?.

The Alphabet

Finite-state automata are constructed in order to check for equivalence much like what is needed to check for matching. However, comparing two finite-state automata requires a precisely defined alphabet. As such, the application must maintain a common, well-defined, alphabet between the two REGEXs being compared

Valid Characters

Valid characters are the set of characters that the application allows to be part of the alphabet. They are best defined by considering the invalid characters. The invalid characters are whitespace (as defined by Javascript REGEX's "\s"), parentheses: "(",")", the symbols of operators: "|",".","*","+","?", and the symbol for the empty string "~". UTF-8 characters not previously listed as invalid are valid.

Implicit Alphabet

When the application is set to use an implicit alphabet, the application infers that the alphabet is composed precisely of the valid characters present in the two REGEX inputs. To set the application to use an implicit alphabet, ensure that the Implict Alphabet checkbox in the settings menu is checked. The settings menu can be accessed by clicking the icon at the top right of the application. An image of a checked Implicit Alphabet checkbox is displayed below.

The application has a checked Implicit Alphabet checkbox by default.

Explicit Alphabet

The alphabet can be explicitly set by the user in the settings menu. To access the settings menu, click the icon at the top right of the application. The setting menu looks as below.

By unchecking the Implicit Alphabet checkbox, the options to explicitly adjust the alphabet become available. By default, "a" and "b" are included in the Explicit Alphabet. To add a character into the alphabet, type a valid character into the textbox and click Add button.

After clicking the Add button with a valid character in the textbox, the valid character should appear in the Explicit Alphabet list. Below is what the setting menu looks like after clicking the Add button after the above screenshot.

Attempting to add nothing, an invalid character, or a string that has more than one character into the alphabet will result in no changes and an alert. To remove a character from the alphabet, first type a character in the alphabet into the textbox.

Then, click the remove button. After doing so, the character should be removed from the Explicit Alphabet list as shown below.

Attempting to remove characters not included in the Explicit Alphabet list will not result in any changes.

Counterexamples

Specific Difference

The application first looks for counterexamples that which match REGEX 1 but not REGEX 2. Below, the user compares two REGEXs: "((0|1)*0)?" (which matches the empty string and binary numbers divisible by 2) and "(0|1(01*0)*1)*" (which matches the empty string and binary numbers divisible by 3). The shortest string which "((0|1)*0)?" matches but "(0|1(01*0)*1)*" does not match should be the binary number that represents 2: "10". This can be found by setting "((0|1)*0)?" as REGEX 1 and "(0|1(01*0)*1)*" as REGEX 2, like below:

On the other hand, the shortest string which "(0|1(01*0)*1)*" matches but "((0|1)*0)?" does not match should be the binary number that represents 3: "11". This can be found by setting "(0|1(01*0)*1)*" as REGEX 1 and "((0|1)*0)?" as REGEX 2, like below:

A counterexample that matches REGEX 2 but not REGEX 1 will only be outputted if the set of strings that REGEX 1 matches is strictly a subset of the set of strings REGEX 2 matches. This can be seen by setting REGEX 1 to be the REGEX matching the empty string and binary numbers divisible by 6, and setting REGEX 2 to be the REGEX matching the empty string and binary numbers divisible by 3. The shortest string which represents a binary number divisible by 3 but not 6 is the string representing the binary number 3: "11".

Multiple Counterexamples

The number of counterexamples that is outputted can be set by the user in the settings menu. To access the settings menu, click the icon at the top right of the application. The setting menu looks as below.

To change the number of counterexamples, change the number in the Counterexample textbox as below:

Setting the Counterexample textbox to anything which is not a positive integer resets the number of counterexamples to 1. If the number of counterexamples are set to more than 1, then a More button will appear when counterexamples are available as below.

Clicking the More button will reveal counterexamples shown below. In this example, the user set the application to produce 10 shortest counterexamples. As with one counterexample, the application will only find counterexamples which REGEX 2 matches if there are no strings which REGEX 1 matches but not REGEX 2.

In the case where there are less than the set number of counter examples, the More button will reveal all the counterexamples. Below, the user is comparing "a|b|c" and "a*". The only string which the first REGEX matches but the second does not are "b" and "c". The user has set the number of counterexamples to 10, and below is the output once the user clicks the More button.

Errors

The application can also display error messages next to the check button when an error is detected within the REGEX inputs. These messages may only appear after clicking the Check button when both REGEX inputs are nonempty. Two types of error messages may appear: a Syntax Error message or an Alphabet Error message.

Syntax Errors

A syntax error occurs when the application cannot properly parse a REGEX input due to a malformed expression. The application then displays a message indicating a syntax error and in which REGEX input it occurs. Below is an example where a user has inputted an expression with an unmatched closing parenthesis in REGEX 2. The syntax error message appears after the user clicked on the Check button.

The application searches for errors in both inputs, so it may indicate error in both inputs as shown below.

Alphabet Errors

An alphabet error occurs when otherwise well-formed REGEXs appear in the inputs but contain valid characters outside of an explicitly set alphabet. Thus, alphabet errors only occur when the application is set to use an explicit alphabet. The application then displays a message indicating an alphabet error and in which REGEX input it occurs. Below is an example where a user has inputted a well-formed REGEX expression in REGEX 1 but uses the valid character "c" when the alphabet is explicitly set to be: a,b. The alphabet error message appears after the user clicked on the Check button.