I Love Regular Expressions

I Love Regular Expressions

And You Should, Too

Regular Expressions (aka regex or regexp) have existed since the 1950s and entered the tech world via Unix text-processing utilities. Perl was one of the first programming languages I learned and you can't seem to learn Perl without learning about regex. I've noticed that many developers with a different upbringing than mine don't use regex and don't seem to like it. I want to try to convert a few non-believers with this blog post.

What exactly is a "regular expression"?

Let me start by just writing a simple one and then I'll break it down. Let's say I have the official Scrabble dictionary as a text file. Each word is on a line by itself and happens to be uppercase. I am using Notepad++ which has regex support built into the search feature. Let's start with some searching examples. How many words in the Scrabble word list are exactly 5 characters long? Answer: 8,938. I found that count by using this expression: ^[A-Z]{5}$

The carat at the front indicates that I only want matches from the start of the string. The [A-Z] is a character set or range representing ONE character in the string. In this case, I specified that I am looking for characters from A to Z inclusive. This limits the matching to uppercase letters. The number within the curly brace indicates how many consecutive characters I am looking for. Finally, the dollar sign indicates that I also want matches that go to the end of the string. If I didn't specify the start and end of the string, this regex would match any 5 characters in strings with 5 or more characters in them.

I Don't Play Scrabble, So...

Besides cheating at Scrabble or Wordle, what can do you with regex? In my experience, I find regex most useful outside of application development. I've used regex to create URLs, add parameters to SQL scripts, and extract values from log files.

For example, let's say a QA gives me a list of record IDs related to a problem in production. The list is one integer ID per line and I want to select rows from a database for those values. If it is only a few IDs, it isn't hard to write "select * from records where ID in (1,2,3);" If there are more IDs, though, and if you may need to delete or update those records, crafting the SQL can be a bit tedious and error-prone.

Using regex, I can match those IDs on each line using "(\d+)\r\n" and then replace them with "\1,". The "\r\n" part causes the carriage return and line feed characters to be included in the match. (Those two characters are typical for Windows-based text files.) The "\d" is a shorthand way of matching digits. It could also be written as [0-9]+. The plus symbol indicates that the match should have at least one character. The special part is the parentheses. They allow me to reference the enclosed part of the match in the replace pattern. So, I can change:

1
2
3
10
12
33
45

To this:

1,2,3,10,12,33,45

Or, if these records were defective and needed to be removed, I could remove the "\r\n" from the find pattern (because I want to put the SQL commands on their own lines) and update the replace pattern with a partial SQL command, like "delete from records where ID = \1;" When executed in Notepad++, I get this:

delete from records where ID = 1;
delete from records where ID = 2;
delete from records where ID = 3;
delete from records where ID = 10;
delete from records where ID = 12;
delete from records where ID = 33;
delete from records where ID = 45;

As another example, I've used regex to form URLs for cURL like "curl host.myapi.com/resource\1" and get a result like:

curl https://host.myapi.com/resource/1
curl https://host.myapi.com/resource/2
curl https://host.myapi.com/resource/3
curl https://host.myapi.com/resource/10
curl https://host.myapi.com/resource/12
curl https://host.myapi.com/resource/33
curl https://host.myapi.com/resource/45

Then, I just save the text file as a script and execute it. Easy cheesy!

Within an application, I tend to use regex for validation. The nice thing about using regex for validation is that you often don't need to figure it out on your own. There are plenty of examples online. For example, you shouldn't need to write a regex pattern for a UUID, an email address, or a URL. Those have been done with varying levels of complexity. Regex can allow you to validate data in unique ways for your specific use cases.

A great example of how complicated regex can get is email validation. We all know that an email is just a string with an '@' symbol and then a domain, but there are nuances in the RFC definition that can be mind-bending to implement in regex. This book sample page on the O'Reilly website does a great job of showing the layers of complexity that can be added to a regex related to an email address. It gets deep fast!

Link: 4.1. Validate Email Addresses - Regular Expressions Cookbook, 2nd Edition [Book]

While the regex example linked above gets gnarly by the end of the page, you don't have to go that deep to find regex useful. I believe that all software developers, analysts, and QAs can benefit from knowing how to use regex to some degree.

I've only introduced the idea of regular expressions here. There are plenty of resources online that will help you learn the topic properly. However, I do have a couple of links to help you on your journey. I hope regex saves the day for you sometime.

Cheers!

Regular Expressions Cheat Sheet by DaveChild

RegExr: Learn, Build, & Test RegEx