Rematch Regex Engine
6 February 2025
Introduction
For the past few weeks, I have been working on and off on an ECMAScript flavoured Regex engine. The code for this can be found here.
But why write a Regex engine at all? There are several reasons. The first one being: because it sounded fun! I chose to write the Regex engine in a language called Jai. This language is in closed beta as of writing. The language really connects with me and I'd like to find any excuse to write something in it.
Since it's in closed beta, there's ample opportunity to be the first to invent something. Though I must acknowledge I am not the first to write a Regex engine in Jai, that would be Uniform, which is a (partial) port of Google's RE2 engine. I can still claim to be the first backtracking Regex engine though!
The Essence of Compiling
The first step is to compile the pattern string into somemthing more usable. This step is actually quite straight-forward. Let's take the string "Hell.". This is 3 literal characters Hell, followed by one wildcard.
We can represent our Regex as an array of steps it has to match against:
Symbol :: enum {
WILDCARD;
LITERAL;
}
Step :: struct {
symbol: Symbol;
union {
literal: struct {
char: int;
};
}
}
Regex :: struct {
steps: [..] Step;
}
In our compilation, we go character-by-character and manipulate the steps array as necessary. For example, when we encounter a ., we add a WILDCARD step, if we encounter a +, we update the last step's min/max quantifier, etc.
The Essence of Matching
Now that we have a compiled Regex pattern, let's get to matching! In principle it's quite simple: you loop over every step in your Regex, and if all of them succeed, you will have properly matched it. If you fail, you try again at the next character in the input text.
At first, my matching procedure looked roughly like this:
try_match :: (regex: *Regex, text: string) -> bool {
p := *text;
for step: regex.steps {
match_times := 0;
for 1..step.quantifier.max_times {
if p.count == 0 then break;
success := try_match_single(regex, step, p);
if !success then break;
match_times += 1;
}
if match_times < step.quantifier.min_times {
return false;
}
}
return true;
}
try_match_single :: (regex: *Regex, step: *Regex_Step, p: *string) -> bool {
if step.symbol == {
case .LITERAL;
char := get_codepoint(p);
if step.literal.char == char {
next_codepoint(p, char_size);
return true;
}
case .WILDCARD;
char, char_size := get_codepoint(p);
next_codepoint(p, char_size);
return true;
}
return false;
}
As we can see, it loops over every step, then tries to match that step. It will repeat the match for however many times it needs to and then checks if it matched the minimum amount.
There's a bit more code involved, for example, if the input text is exhausted, it should check if all remaining steps have a min quantifier of at least 0. Otherwise there are still necessary steps that need to be matched, but I have omitted this for brevity.
This is all looking quite good, but we're missing something... That's right, capture groups!
Capture groups let you define sub-patterns whose substring gets stored. For instance, you can do "Hello (.+)", which would match string like "Hello world", where you can query for capture group 1 to get just world.
We need to change our previous definitions a bit. I moved the steps field from the base Regex struct into a Group struct, and gave the Regex a root group.
Then when compiling, whenever it encounters a (, it pushes a new group to the stack, compiles as normal, and pops it once it encounters ).
This works quite well, and for our matching, I did the following:
// ...
try_match_group :: (regex: *Regex, group: *Regex_Group, p: *string) -> bool {
for group.steps {
// Old matching code here
}
}
try_match_single :: (regex: *Regex, step: *Regex_Step, p: *string) -> bool {
// ...
case .GROUP;
return try_match_group(regex, *step.group, p);
// ...
}
With a little extra logic where we give each group a capture index and save it in the Regex while try_match_group() is successful, we are matching groups! Wow, it seems like everything is going very smoothly.
Backtracking
As I added more example patterns to try and match on, I came to a realization. This type of Regex engine is called a backtracking engine, but we aren't actually doing any backtracking!
This means that for patterns like ".+A" with input text "FOOBAR", it will currently not match. The wildcard will currently consume the entire string, and then when it wants to match A, the input is gone and it will fail.
This makes are Regex next to useless! So let's add backtracking to our try_match_group(). Whenever it successfully matches a step, we record what step we were on and what the input string was looking like. If we fail to match a step, instead of returning immediately, we first see if we have anything in our backtracking stack, and then restore the state.
With this in place, the matching currently goes like this:
| Input | Description |
|---|---|
"FOOBAR" |
.+ tries to match between 1 and infinity times |
"" |
.+ has greedily matched the entire input, so step 2 has nothing to match |
"R" |
Tries to match A, fails |
"AR" |
Tries to match A, succeeds |
"R" |
All steps matched! |
Seems like we have added backtracking! There are some edge cases where we need extra logic, like for matching lazy quantifiers. Once it has reached the min_times, it will try to already match the next step. If that fails, it will match another time. Otherwise it will continue to the next step and consider the lazy quantifier matched.
Recursion and Restructuring
At this point I was pretty happy. I was slowly adding in all the syntax and everything seemed okay. But then I tried to match the pattern "(\w+)a" to "aaaaf", and it was not working...
So what is wrong? Well, the backtracking I had implemented only worked when backtracking inside a single group. As soon as the function returns, that backtracking state was lost due to the recursive nature.
So how do we fix this? We need to ditch recursion altogether and switch to using an explicit stack. This was quite a refactor since we now needed to keep track of things like the group stack, current captures, and the input text. Before this was neatly handled by the nature of recursion, but now we have to do it ourselves.
Optimizing the Regex
After getting it all to work as expected, I kept building onto the test suites. Currently there are roughly 150 patterns in total and 500 inputs in total. To run the entire test suite in release mode, it was taking around 125ms. In comparison, Node/V8 only took 12ms! Surely we can do better.
Removing redundant state
One very stupid thing that I had done was to create backtracking state on every successful match. With just 3 or so lines of code, I now only generate backtracking state in cases where the min and max quantifier are different, since this is the only time that a difference in the input can be created.
With just this simple change, the execution time dropped from 120ms to 58ms.
Changing allocation strategy
When optimizing, the general wisdom is to always profile and not make assumtions about what is fast, and what isn't. I had broken this rule, because I had chosen a custom allocation strategy without actually checking if it was any faster.
There is a module that is shipped with the Jai compiler called Flat_Pool. This module gives you a memory-mapped Arena that first reserves a large chunk of memory virtually, and through the power of pagefaults, automatically expands over time.
This approach can be miles faster in certain situations, and I had assumed that matching the Regex was one as well. Just allocate everything in the pool, and when we're done matching, just reset the pool. We don't even have to worry about freeing the memory!
But that belief was wrong. I switched back to using the default malloc-style allocator, and the execution time dropped from 70ms to 27ms.
Collapsing repeating state
Most of the time, there are multiple backtracking states generated for a single step. For example, \w{2,8} will create backtracking state for up to 6 successful matches.
The only difference in the state is the input text, which will be moved by 1 character. We can add a repeat_count field to our backtracking state and increment that in cases like this. This way we only have to store the state once in these kinds of situations.
This reduces the execution time from 27ms to 23ms.
Collapsing literals
A large part of a Regex is the literals. Instead of holding only a single Unicode codepoint, we can also just store substrings and match those all at once! The compile now merges all consecutive literals that have a {1,1} greedy quantifier.
While this did have less of an impact than I expected, it did still drop the execution time from 23ms to around 20ms. This mainly impacted compilation times (2x speedup) rather than matching times.
These are all the optimizations I have applied so far. As it currently stands, in the case of small input, on my machine, it's about 19ms vs 14ms (~1.3x slower) than V8/Node. I don't think that's too bad!
On larger input texts (100s of MiBs) however, it is currently dramatically slower. 40x in fact! So what else can we do to fix this?
Needle searching
I am not sure if this is the right name to call this, the optimization is as follows:
- Identify a string literal in your pattern
- Perform a regular string search over your input text
- When it has found a result, match the left side of your pattern in reverse
- If that is ok, match the right side of your pattern normally
As of now, I have not added this optimization yet since I will need to refactor some of the matching code to account for reverse matching. At the same time I am thinking of maybe rewriting it as (at least a partial) NFA/DFA.
Closing thoughts
I think that writing your own feature-complete Regex engine is actually quite doable. When I decided that I wanted to do this, I didn't really expect to get super far, but everything was surprisingly straight-forward.
The hardest part was switching from the recursive way, to the explicit stack-based way. This especially had impact in the way groups we're matched, since they now needed special treatment, instead of being able to handle them like normal steps.
The final product is around 1700 sloc. Pretty reasonable, if you ask me.