Limits of patterns based routing using regular expressions

Let's explore regular expressions routing, and compare it to a different new approach

December 26, 2021

This was part of a bigger article, "Ertuo, allegedly the fastest PHP routing library". I got some good feedback that it will be easier to read if I split it into at least two parts, one on the more theoretical part and another for the actual Ertuo implementation. This is the theoretical part, where I want to dive into routing as a concept, pattern based routing with regular expressions and suggest a new different approach that should produce better results.

What is Routing ?

I am not very good at summarizing things, so let's borrow a definition from ASP.net Docs:

URL routing allows you to configure an application to accept request URLs that do not map to physical files.

In the context of web development, routing is the process of routing HTTP requests to the code that should handle them.

History of Routing

In the early days of the internet, the web was composed mostly of websites with static HTML pages. The way static files are delivered to web visitors is for the web-server to ask the file-system to obtain the requested physical file. The contents of the file are then returned by the web-server with the HTTP response. The role of the file-system is important here as we will discuss it further in the second part of this article.

As the web evolved, the websites started having more dynamic content. Unlike static content, the dynamic content is generated by some code on the server side. To connect the HTTP request with the code that generates the response, we need something like the routing process.

As the web evolved even more there were benefits to make the dynamic content URLs look like the static file URLs. For the visitor it is all the same and they don't care who generates the content and how it is served. However, this was the dawn of the search engine optimization. This is when the URLs started having more weight in the search results algorithms, and people started to so URLs that are more expressive of the content they deliver. This is when the routing process evolved as well. It now needed to connect different parts of the URL to code arguments, and it needed to start looking for patterns in the URLs.

Conventional Routing

Currently, the prevalent type of web request routing is pattern based, or more accurately - regular expression based. This concept is used not only by PHP but a lot of other languages as well. This is why I consider this approach to be the conventional type of routing.

The way it works is somewhat similar no matter the language, the library or package, or the framework.

First you have to declare a list of routes (aka "route collection"). Those routes occasionally will have parameters in them, declared using some formats. These formats are based on regular expressions, and allow to extract these parameters from the URLs and optionally validate their values.

Associated with each route is also a "handler", which is going to be called if the URI is a match for the route.

The actual routing happens when we take the current web request URI and starts to compare it against each from the routes from the route collection. It has to to be valid when compared against the regular expression to have a match. If there is a match, then the "handler" is returned, along with any parameters extracted from the URI.

This concept, the regular expression based routing, is very popular. As I wrote earlier, this is the de-facto standard approach and it is very wide-spread. In PHP land, it is used by the two most popular PHP routing libraries, Symfony Routing and FastRoute. While different, if you explore both projects you will find a lot of a similarities in the way they work and in the features they offer.

Issues with conventional routing

I'll try to describe and illustrate the issues with the regular expression based routing. The problems described below are present in both Symfony Routing and FastRoute, and I guess any other project that uses the same pattern based approach.

Route Collection Size

Most of the issues I am going to talk about here are well known in the web development community. This one perhaps is the most common and obvious one.

Considering how the regular expression based routing process works the biggest issue is the size of the routing collection. As the collection of routes grows in size, the routing will have to examine more routes until a match is found. This is a scale issue. People found about this as their web projects started to grow and more routes were added to their routes collection.

This issue was identified early and both Symfony Routing component and FastRoute have some really great improvements to decrease the negative effect of that. You should take a look at their source code and see the amazing work done there to optimize this. There are things like grouping common patterns into more sophisticated regular expressions, separating the routes without parameters that do not require regular expressions, and a lot more.

Even with these optimizations the issue of scale is still there as in essence the way this works is not changed. Even with better organized regular expressions the routing still has to go through each of the routes until a match is found.

The worst case for this iterative approach is when the route you are looking for is the last route on the route collection list. More precisely, when the route has parameters and it is the last on the list, as the "static" routes are checked first before explore the "dynamic" routes that have parameters in them.

It is an equally bad case when there is simply no match even after going through the whole list of declared routes.

Route Declaration

Declaring the routes is not free.

A large route collection is actually a double whammy. It is not only the routing itself that is impacted by the huge size of the route collection. The route collection itself will take more time to have all of the routes declared in it as well.

Let's do a quick back of the envelope math. With 200 routes, the routing library/package/component will have to call the code for declaring a route 200 times, one for each route. After that, when the routing starts, the web app will have to inspect up to 200 routes until a match is found (or not found at all). This is 400 iterations in total.

Think about it. You are declaring all of these routes but in the end you are actually going to use only the one that you find a match for.

I decided to benchmark how much time Symfony Routing and FastRoute each take just to setup their routing collection. I added a new benchmark case called benchSetup that only tracks the time needed to setup the routes and the "matcher/dispatcher" object that is going to use them.

I ran a the quick benchmark to compare the results from benchSetup against another benchmark case, benchAll. That one involves matching every route from the list of routes. You can read more about the benchmark here and here.

+-------------------------+--------------+--------+------------------+-----------------+
| Case                    | Scenario     | Routes | Time             | Per Second      |
+-------------------------+--------------+--------+------------------+-----------------+
| fast_cached_group_pos   | benchSetup   | 300    | 0.145653 seconds | 2059.689677581  |
| fast_cached_char_count  | benchSetup   | 300    | 0.155296 seconds | 1931.7936805971 |
| fast_cached_group_count | benchSetup   | 300    | 0.155662 seconds | 1927.2518969388 |
| fast_cached_mark        | benchSetup   | 300    | 0.156476 seconds | 1917.2295892014 |
| fast_cached_group_count | benchAll     | 364    | 0.201276 seconds | 1808.4614380494 |
| fast_cached_char_count  | benchAll     | 364    | 0.209884 seconds | 1734.2899452243 |
| fast_cached_group_pos   | benchAll     | 364    | 0.212928 seconds | 1709.4975909348 |
| fast_cached_mark        | benchAll     | 364    | 0.236505 seconds | 1539.0793065141 |
| symfony_compiled        | benchAll     | 364    | 0.307934 seconds | 1182.0713070691 |
| symfony_compiled        | benchSetup   | 300    | 0.257987 seconds | 1162.8491898906 |
| symfony                 | benchSetup   | 300    | 0.391358 seconds | 766.56176095387 |
| fast_char_count         | benchSetup   | 300    | 1.126699 seconds | 266.26455493112 |
| fast_group_pos          | benchAll     | 364    | 1.374404 seconds | 264.84203273423 |
| fast_group_pos          | benchSetup   | 300    | 1.152111 seconds | 260.3915647691  |
| fast_mark               | benchSetup   | 300    | 1.154325 seconds | 259.89214287043 |
| fast_mark               | benchAll     | 364    | 1.402848 seconds | 259.47215851336 |
| fast_char_count         | benchAll     | 364    | 1.558076 seconds | 233.62144499387 |
| fast_group_count        | benchAll     | 364    | 1.623828 seconds | 224.16168136973 |
| fast_group_count        | benchSetup   | 300    | 1.389072 seconds | 215.97153542882 |
| symfony                 | benchAll     | 364    | 1.851425 seconds | 196.60529937776 |
+-------------------------+--------------+--------+------------------+-----------------+

It turns out most of the time spent is indeed in "setup". Here's a table with the percentage calculated of how much time is spent in setting up the routes, versus how much time is spent for the whole operation.

Case	`benchSetup`	`benchAll`	%
fast_cached_group_pos	0.145653 seconds	0.212928 seconds	68.40%
fast_cached_char_count	0.155296 seconds	0.209884 seconds	73.99%
fast_cached_group_count	0.155662 seconds	0.201276 seconds	77.34%
fast_cached_mark	0.156476 seconds	0.236505 seconds	66.16%
symfony_compiled	0.257987 seconds	0.307934 seconds	83.78%
symfony	0.391358 seconds	1.851425 seconds	21.14%
fast_char_count	1.126699 seconds	1.558076 seconds	72.31%
fast_group_pos	1.152111 seconds	1.374404 seconds	83.83%
fast_mark	1.154325 seconds	1.402848 seconds	82.28%
fast_group_count	1.389072 seconds	1.623828 seconds	85.54%

This makes the case for me. It is pretty obvious that while all of the regular expression magic done when it comes to matching the routes, most of the time is spent on reading the declaration of the routes.

Sorting and Priority

It is important how you arrange the routes in the routes collection.

Considering the scaling issues described above it would be beneficial if the most common routes are higher in the route collection. In this way they will be matched early without going through the rest of the list. The more "exotic" and obscure and less popular route options will be buried lower on the routes list.

At the same time, the routes with parameters inside them occasionally will have greedy regular expressions inside them. If you put such a route early in the route collection, it might match an URI that it is not meant for it, as the real route for it is lower on the list.

There is more. The most popular way to declare your routes nowadays in Symfony is to do it right inside the controllers using either annotations or attributes (as of PHP8). There are several objects that help to extract those declarations from the controllers, and collect them in the route collection. As the scanning of the controller classes is indiscriminate, you do not know in what order they will be processed. This means that you do not know in what order the routes will be inserted into the routes collection. To solve this issue, the Symfony Routing component introduced a "priority" setting for the annotated routes.

These sorting constraints require developers to be mindful of the way they arrange the routes in the route collections. This is extra requirement is there not because of the actual routes, but how the routing works.

Limited Flexibility

Regular expressions are really powerful. You can do a lot with them. However, they are still limited to pattern matching and validation.

Seriously, think about it.

You can find a pattern in an URL and extract the value from that pattern. You can also do some extra validation with that pattern to make sure that the extracted value follows that format.

There are some operations that are just require more than regular expressions.

Can you check if the extracted value is within a range for example? No, you just extract the value with the regular expression and pass it to the controller, and it will determine if it is within that range.

Can you call some function or method to validate the value? Here's a lame example: why validate the extracted UUID parameter with a regular expression if you already have such a feature in your web app?

Can you make the validation of one URL parameter dependent on another URL parameter? Another lame example: having year, month and day as URL parameters and you wanting to accept only a valid date, excluding Feb 29th on non-leap years?

Regular expressions are great, but there is a lot more you can do beyond their capabilities.

Slashes

Let's start with trailing slashes. Well, they are not so much a thing nowadays. Well sanitized input will remove the extra slash.

Without that, the job of the regular expression gets more difficult. That's because the regular expression is trying to match the whole request URI, and it needs to be aware of potentially having a trailing slash there.

In reality it should not matter as both cases should be treated the same.

Another issue with slashes is when there are slashes in route parameters.

This falls under the limitations of the regular expressions, but it deserves to be mentioned on its own.

The problem with the slashes comes from the way the route parameters are declared in order to detect them. They use a wider .+ regular expression that only works if it is the last parameter in a route. The pattern will match everything. You can put restrictions on the length of the parameter extracted in some cases, but that is not always applicable.

Incomplete Match

This is a bit exotic. It is about what options we have when the provided URI does not match fully a route, but at least some portion of it does has s match.

It's easier to explain this with an example. Imagine that on my website I have these URIs:

    /about/me
    /about/me/social

Trying to access /about/me/test123 will result in "404 Not Found" as there is no such URL.

How would you actually handle that 404 though? Just show the generic "Not Found" page? That is the traditional approach. Is it a good one? I would argue you can do better in this case. How about if instead you serve the /about/me page plus an error message saying that requested address is not found. This page is the incomplete match for the requested URI, and although there is not a full match, there is an incomplete one.

You can do this with pattern based routing, but that will require introducing even more routes in the route collection making the issue of scale even worse.

Using incomplete matches we can offer a slightly better experience for web visitors than regular 404s. Although I realize this is a very obscure and exotic feature, I do see benefit in it.

Back to the Basics

Are you still here? We are half-way in.

Let's go back to the beginning, and let's look back at the definition of routing

URL routing allows you to configure an application to accept request URLs that do not map to physical files.

The whole deal with routing started because dynamic content wanted to pretend to be static and appear is if it is just a static physical file.

With static files, where does the "routing" happen? The web-server asks the file-system for the file and that is. So, as an oversimplification, we can say that the file-system does the routing.

How file finding works ?

Let's have an over-simplified look at how this works.

The file-system is a hierarchical tree-like structure. The file path to a particular file is composed of folders and occasionally a file at the end. The folders are specially type of file that contains list of what other files and folders are nested inside it.

When you want to find a file (ahm, "routing"), the file-system takes the file path and goes folder into folder until the file is found. You can say that it takes one "step" at each folder looking for the file, it does this "step by step".

This type of "routing" gets slower the deeper the file path is as it needs to go into more folders to find the desired file.

Are the same issues present here ?

Now, let's see if the file-system suffers any of the issues outlined above.

Regarding size, delivering files does not get slower if we have more files in the file system. I know it's obvious, but for the sake of argument I must say it.

There is not a one-to-one way to compare how the declaration is done here vs. the regular expression based routing does it. The files and folders are already declared as they are on the file-system. However, if you look this regrading the type of structure used to do the declaration, the file-system uses a tree while the conventional routing uses a list. This is an important difference between the two. The tree offers a step by step exploration into finding a match, and for each match it only will traverse the list of nodes in the file-system that compose the file path. The list on the other hand requires a very iterative process of trying to match the whole path against the members of the list, AND you need to have the whole and complete list before you start.

Next, sorting plays no role in this process. This is because you break up the file path into folders and you go in depth, and as you do this, you are only working with a very limited set of options as your next "step".

The flexibility part is moot here as that is a dynamic feature and the file-system is very static.

Lastly, incomplete matches are easy to do with a file-system as you can find what part of the file path does really exist.

Forgot the slashes. There are no issues here as slashes are part of the file path syntax. You can use trailing slashes for folders, you can omit them, it is all the same. As the slash is considered a separator, it is not really allowed to be used in file and folder names.

Step by Step Routing

I know I have sharp audience, and you already have figured where I am going with this. It is obvious, isn't it?

We can address all of the shortcomings of the regular expression based routing if we use an approach similar to how file-systems deliver files.

That is with a routing process that:

uses a TREE-like structure to declare the routes
looks for a match going STEP BY STEP on the parts of the URL

It is really no rocket surgery. The tree-like structure will allow us to traverse only a series of nodes in the whole tree when we are looking for a match. The tree-like structure will also allow us to potentially declare only a branch of the tree that we are going to explore, saving us all of the time spent in route collection declaration.

Extra bonus points for introducing more flexibility into the process and not relying solely on regular expressions.

I've done some deep dive research into the existing PHP routes and I haven't found anything using an approach like this one. I might be wrong, so if you have already done something similar to this, please contact me and let's compare notes.

The Next Step

Everything on paper sounds better than in reality. To be able to demonstrate the benefits of the "step by step" routing, there must be an implementation that can explored, tested and benchmarked.

This is where Ertuo comes im. You can read the rest at the sibling article, "Ertuo, allegedly the fastest PHP routing library".

Kaloyan K. Tsvetkov