The application can be found here. To get started, all one needs to do is to input two regular expressions in the textboxes labeled REGEX 1 and REGEX 2 then click the "Check" button. In the example below, the user types in two regular expressions that match strings only containing the letters "a" and "b", and the string must contain at least one "a" and one "b".
Then, after the user clicks on the Check button, the application displays "Equivalent" next to the Check button.
The user now changes REGEX 2 to match strings only containing the letters "a" and "b", with at least one instance of "a". After clicking the Check button, the application displays "Not equivalent" and provides an example of a string which only one of the REGEXs matches.
The application, when displaying counterexamples, uses "~" to denote the empty string. In the example below, REGEX 1 matches the empty string but REGEX 2 does not. So, the two REGEXs are not equivalent.
Finally, this documentation page can be accessed by clicking the icon at the top right of the application.
Checking the equivalence of two REGEXs is an exercise with primarily theoretical applications. And so, compared to the number of supported operations available in modern implementations of REGEXs used for matching, the ones supported by this application is limited but growing. Below is a table of the currently supported operations.
Operation | Symbol | Description |
---|---|---|
match-exact | Any character in the alphabet | A REGEX of this form matches only the string containing exactly one character which is the same as the REGEX. |
Empty String |
~ | The REGEX ~ matches the string s if s is empty. |
Union (Alternation) |
| | The REGEX R1|R2 matches the string s if s is matched by either the REGEX R1 or the REGEX R2. |
Concatenation | None or . |
The REGEX R1R2 (or R1.R2) matches the string s if s can be exactly split into two substrings s=s1s2 such that s1 is matched by the REGEX R1 and s2 is matched by the REGEX R2. |
Kleene Star (zero-or-more) |
* | The REGEX R1* matches the string s if s can be split into n substrings (where n is a nonnegative integer), like s=s1s2s3...sn, such that each substring si is matched by the REGEX R1. |
one-or-more | + | The REGEX R1+ matches the string s if s can be split into n substrings (where n is a positive integer), like s=s1s2s3...sn, such that each substring si is matched by the REGEX R1. R1+ is equivalent to (R1)(R1*). |
zero-or-one |
? | The REGEX R1? matches the string s if s is empty or if the REGEX R1 matches the string s. R1? is equivalent to R1|~. |
In short, the order of precedence for operations is as follows:
Parentheses "()" have the highest precedence in the order of operations. As such, it can always be used to explicitly define the order of operations present in a given REGEX. After parentheses, the next order of precedence are the unary operators, namely the Kleene Star (*), one-or-more (+), and zero-or-one (?). These operators have the same level of precedence, but when present in succession, the leftmost operator takes precedence. For example, a+? is interpreted the same as (a+)?.
The next highest order of precedence is concatenation, then followed by union. And so, a|bc is interpreted the same as (a)|(bc). Further, ab|c is interpreted the same as (ab)|(c). Putting it all together, a|bc* is interpreted the same as (a)|(b(c*)).
Finite-state automata are constructed in order to check for equivalence much like what is needed to check for matching. However, comparing two finite-state automata requires a precisely defined alphabet. As such, the application must maintain a common, well-defined, alphabet between the two REGEXs being compared
Valid characters are the set of characters that the application allows to be part of the alphabet. They are best defined by considering the invalid characters. The invalid characters are whitespace (as defined by Javascript REGEX's "\s"), parentheses: "(",")", the symbols of operators: "|",".","*","+","?", and the symbol for the empty string "~". UTF-8 characters not previously listed as invalid are valid.
When the application is set to use an implicit alphabet, the application infers that the alphabet is composed precisely of the valid characters present in the two REGEX inputs. To set the application to use an implicit alphabet, ensure that the Implict Alphabet checkbox in the settings menu is checked. The settings menu can be accessed by clicking the icon at the top right of the application. An image of a checked Implicit Alphabet checkbox is displayed below.
The application has a checked Implicit Alphabet checkbox by default.
The alphabet can be explicitly set by the user in the settings menu. To access the settings menu, click the icon at the top right of the application. The setting menu looks as below.
By unchecking the Implicit Alphabet checkbox, the options to explicitly adjust the alphabet become available. By default, "a" and "b" are included in the Explicit Alphabet. To add a character into the alphabet, type a valid character into the textbox and click Add button.
After clicking the Add button with a valid character in the textbox, the valid character should appear in the Explicit Alphabet list. Below is what the setting menu looks like after clicking the Add button after the above screenshot.
Attempting to add nothing, an invalid character, or a string that has more than one character into the alphabet will result in no changes and an alert. To remove a character from the alphabet, first type a character in the alphabet into the textbox.
Then, click the remove button. After doing so, the character should be removed from the Explicit Alphabet list as shown below.
Attempting to remove characters not included in the Explicit Alphabet list will not result in any changes.
The application first looks for counterexamples that which match REGEX 1 but not REGEX 2. Below, the user compares two REGEXs: "((0|1)*0)?" (which matches the empty string and binary numbers divisible by 2) and "(0|1(01*0)*1)*" (which matches the empty string and binary numbers divisible by 3). The shortest string which "((0|1)*0)?" matches but "(0|1(01*0)*1)*" does not match should be the binary number that represents 2: "10". This can be found by setting "((0|1)*0)?" as REGEX 1 and "(0|1(01*0)*1)*" as REGEX 2, like below:
On the other hand, the shortest string which "(0|1(01*0)*1)*" matches but "((0|1)*0)?" does not match should be the binary number that represents 3: "11". This can be found by setting "(0|1(01*0)*1)*" as REGEX 1 and "((0|1)*0)?" as REGEX 2, like below:
A counterexample that matches REGEX 2 but not REGEX 1 will only be outputted if the set of strings that REGEX 1 matches is strictly a subset of the set of strings REGEX 2 matches. This can be seen by setting REGEX 1 to be the REGEX matching the empty string and binary numbers divisible by 6, and setting REGEX 2 to be the REGEX matching the empty string and binary numbers divisible by 3. The shortest string which represents a binary number divisible by 3 but not 6 is the string representing the binary number 3: "11".
The number of counterexamples that is outputted can be set by the user in the settings menu. To access the settings menu, click the icon at the top right of the application. The setting menu looks as below.
To change the number of counterexamples, change the number in the Counterexample textbox as below:
Setting the Counterexample textbox to anything which is not a positive integer resets the number of counterexamples to 1. If the number of counterexamples are set to more than 1, then a More button will appear when counterexamples are available as below.
Clicking the More button will reveal counterexamples shown below. In this example, the user set the application to produce 10 shortest counterexamples. As with one counterexample, the application will only find counterexamples which REGEX 2 matches if there are no strings which REGEX 1 matches but not REGEX 2.
In the case where there are less than the set number of counter examples, the More button will reveal all the counterexamples. Below, the user is comparing "a|b|c" and "a*". The only string which the first REGEX matches but the second does not are "b" and "c". The user has set the number of counterexamples to 10, and below is the output once the user clicks the More button.
The application can also display error messages next to the check button when an error is detected within the REGEX inputs. These messages may only appear after clicking the Check button when both REGEX inputs are nonempty. Two types of error messages may appear: a Syntax Error message or an Alphabet Error message.
A syntax error occurs when the application cannot properly parse a REGEX input due to a malformed expression. The application then displays a message indicating a syntax error and in which REGEX input it occurs. Below is an example where a user has inputted an expression with an unmatched closing parenthesis in REGEX 2. The syntax error message appears after the user clicked on the Check button.
The application searches for errors in both inputs, so it may indicate error in both inputs as shown below.
An alphabet error occurs when otherwise well-formed REGEXs appear in the inputs but contain valid characters outside of an explicitly set alphabet. Thus, alphabet errors only occur when the application is set to use an explicit alphabet. The application then displays a message indicating an alphabet error and in which REGEX input it occurs. Below is an example where a user has inputted a well-formed REGEX expression in REGEX 1 but uses the valid character "c" when the alphabet is explicitly set to be: a,b. The alphabet error message appears after the user clicked on the Check button.
The application searches for errors in both inputs, so it may indicate error in both inputs as shown below.