Master regex: From basics to advanced pattern matching - Presentation
Video transcript
We utilized ChatGPT to enhance the grammar and syntax of the transcript.
Hello everybody! Wow, you're really awake. Thank you so much for being here to demystify regular expressions with me. I'm sure many of you already have some knowledge about it. This is very much Regex 101, and I should note that I’ll be referring to it as "regex." Apologies to those who say "regx" or something else—I'm not entering the GIF/GIF pronunciation debate here! We can discuss that after; I have opinions.
You can follow along if you'd like. This session was designed as a workshop where participants would engage and do things as we go. Everything is in the GitHub repo, and I also encourage you to open regex101.com. I’ll be showing you that in just a moment. The slides are available as well, so if you want to review them later, feel free.
A couple of disclaimers: you don't need advanced math to understand regular expressions. Yes, they originate from mathematical theory, but you don’t need to be a math expert to grasp them. It’s more about learning and practicing, so I encourage you to try it out on your own to really get the hang of it.
I’m going to show you a tool. Unfortunately, I don’t have time today to go through every example and show you how it works, but I encourage you to explore it on your own. It's my favorite, and the repo is structured accordingly, with an informative README, slides from my colleague Paul, who is the original author of this talk, and some text examples. There's plenty of content, so feel free to dive in and test it out.
Now, what's this? Oh, I’ve encountered a demo effect—how fun! I'm locked out of my slides, which is fantastic. But don’t worry, I’m a professional—don’t try this at home. Oh, it looks like I’m still sharing my screen—great for security as well.
I’m sure you’re eager to learn about my password, but let’s move on. Sorry about that, folks. If you’ve never seen someone crash live, here I am, happy to oblige.
Regular expressions come from math. Essentially, they are a way to describe language through mathematical equations. While they originate from the mathematical world, you don’t need to be a mathematician to understand them. And now that my screen is back to normal, let’s continue.
Typically, this presentation takes 50 minutes, but I’ve only got 35, and I’ve already lost five minutes. Yay!
So, a bit of history: regular expressions were created by Stephen Cole Kleene in 1951. They describe language, as I’ve mentioned, and they’re essentially just a mathematical way to explain language. However, what you really want to know is, what exactly are regular expressions? By the end of this session, you will understand that they are sequences of characters that specify a search pattern. That’s what we’re doing: creating a pattern to search through vast amounts of text for the specific pieces we’re interested in.
It’s important to note that regular expressions are not a programming language. They are not difficult to learn, and I promise you can do it. I have a degree in English language, and if I can master it, so can you. However, it’s also not a perfect solution to every problem. Sometimes, using regular expressions can be more costly than other methods, like a simple SQL query. It’s a great tool, but you’re responsible for how you use it.
Now, for some fun—here’s the obligatory comic to lighten the mood. I’ll give you a moment to enjoy the artistry.
So, what can you use regular expressions for? The most obvious use is finding text. But did you know you can use regular expressions in Google Docs and Word? Yes, you can! It’s also useful for validating text, like email validation. You can even use it in Excel and Google Sheets.
Now that you know where you can use it, let’s talk about how to use it. There are different types of characters: literal characters, special characters, character classes, shorthand character classes—there are characters everywhere! The syntax is what makes it fun, and we’ll go over all of these, so hang on to your seat because it’s going to go fast.
Literal characters are pretty straightforward. If you’re looking for "foo," that’s a valid regular expression. See, you were already doing it without even realizing it! But if all you need is a basic text search, why use regular expressions? Let’s explore more interesting characters.
Delimiters are common; I’m sure you’ve seen them. If you visit regex101.com, this is the default. Delimiters tell the engine where the pattern begins and ends. In PHP, for example, the slash is most common. Different engines will handle them differently, so always check which engine you’re using, as it can affect your results.
Special characters, also known as metacharacters, are the main reason you’re here today. There are 12 of them, and that’s a lot! Here they all are—we’ll go over each one.
First, we have anchors. The caret (^) is an anchor that signifies the beginning of a string or line. For example, the regular expression "^The" looks for lines that start with "The." The dollar sign ($) is another anchor, but it signifies the end of the line. You won’t get the same results if you anchor at the beginning or the end, so it’s important to be mindful of that. Anchoring helps make your searches more efficient and accurate.
Next, we have character classes. The opening square bracket ([) defines the beginning of a character class, allowing you to specify a range of characters. For example, "[a-z]" finds any lowercase letter. Inside a character class, you don’t need to escape special characters, except for a few, which are listed here.
Negation is another concept—using the caret inside a character class negates the characters inside. For example, "[^a-z]" would match anything that is not a lowercase letter.
Shorthand character classes, like "\d" for digits, are useful for simplifying your patterns. There are many shorthand classes, and they can make your expressions more readable.
The dot (.) is a wildcard that matches any character except a line break. For example, "b.r" could match "bar," "bir," "bur," etc. However, use it with caution, as it can lead to unintended matches.
The pipe (|) is for alternation, meaning "or." For example, "cat|dog" matches either "cat" or "dog." However, the pipe looks for the first match on the left, so be mindful of the order of your expressions.
Quantifiers, like the question mark (?), asterisk (*), and plus sign (+), allow you to specify how many times a character or group should appear. For example, "fo?" matches "f" followed by zero or one "o." The asterisk matches zero or more times, and the plus sign matches one or more times.
Finally, we have grouping with parentheses, which allows you to group parts of your pattern and apply quantifiers or alternations to the group. For example, "(foo|bar)" matches either "foo" or "bar."
In conclusion, regular expressions are a powerful tool that allows you to perform complex searches and text manipulations. With practice, you can master them and use them effectively in your work.
Thank you so much for your attention. I hope you learned something valuable today.