Turing Talks
Posts
Issue #6: A Complete Guide To Regular Expressions

Issue #6: A Complete Guide To Regular Expressions

January 09, 2022

^([a-zA-Z0–9_\-\.]+)@([a-zA-Z0–9_\-\.]+)\.([a-zA-Z]{2,5})$

“^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}$”

/^\d{2}\/\d{2}\/\d{4}$/

Ever seen these weird symbols before? If you are a developer, you might have seen them being used while checking for password validation, finding dates from text, and a few other places. This is called a Regular Expression.

Regular expressions are a powerful concept in computer science. They help us to search and extract data from text. This text can include system logs, documents, PDF, source code, and whatever you want, really!

As long as it is a piece of text, you can use regular expression patterns to match and extract data.

Let's take a quick look at the regular expression patterns above. Even though this looks complicated, this regular expression is used to validate an email address.

^([a-zA-Z0–9_\-\.]+)@([a-zA-Z0–9_\-\.]+)\.([a-zA-Z]{2,5})$

And this is a password regex. It only matches if the password has at least 8 characters, one letter, and one number.

^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}$

This one is a date matcher. It matches if there is a string with the format mm/dd/yyyy within a piece of text.

/^\d{2}\/\d{2}\/\d{4}$/

The above ones are not the only way of writing regex for a given pattern. You can write them based on what you need and you can make it a strong regex or a simple regex.

For example, the mm/dd/yyyy date match regex can be made stronger by limiting years between 1900 and 2099.

/^(0[1-9]|1[0-2])\/(0[1-9]|1\d|2\d|3[01])\/(19|20)\d{2}$/

I am sure it still looks confusing. Let’s go through some basic regex building blocks and you should be able to construct your own regular expressions in no time.

Building Blocks of Regular Expressions

You can try the following examples as you go through them using RegExr.

Before we start writing regular expressions, you should know what an escape symbol is. An escape symbol (\) tells the computer that the character following the escape symbol is a pattern and not the character itself. e.g. \d is a pattern and not the character ‘d’. You can learn more about escaping characters here.

The important thing to remember is that every piece of text contains characters. These characters can be numbers, alphabets, symbols, or other special characters. In most cases, you will just need to match a combination of numbers, alphabets, and commonly used symbols.

Simple Matches

A simple match is not really a regex pattern but explicitly defines what you want to match.

eg. app -> apple,application

This includes letters, digits, and symbols. It is always recommended to start with simple regex patterns and then move on to more complex ones.

Numeric (\d) and Non-Numeric (\D)

Let's start with numbers. The pattern \d represents a digit. For example, the \d pattern will match the first number in all the listed strings.

axg12ud
hello123
email3[email protected]

However, if you use \d+, it matches a complete number. Let's look at the same example from above.

axg12ud
hello123
email321@gmail.com

If you use the pattern with the capital \D, that will match everything except numbers. So for the same examples, using \D will give us the following match.

axg12ud
hello123
e[email protected]

And using \D+ will match the string till it finds a numeric character or symbol.

axg12ud
hello123
email[email protected]

Alpha Numeric (\w) and Non-Alpha Numeric (\W)

The \w regex matches any alphanumeric character. This includes all characters from a to z, A to Z, and 0 to 9 including the underscore (_) symbol. Similar to the numeric regex, you can add the + symbol to match an entire string of alphanumeric characters.

\w -> hello123&!ab

\w+ -> hello123&!ab

And to match the non-alphanumeric characters, you can use the capital \W.

\W -> hello123&!ab

\W+ -> hello123&!ab

WhiteSpaces, Wildcards, and Optional Characters

Now let's look at some more regular expressions.

To represent whitespace, you can use the \s regex. Conversely, you can use \S to match any non-whitespace character.

\s -> hello 123 (matches the whitespace)

\S -> hello 123 (matches everything other than the whitespace)

Next is a wildcard. A wildcard matches any character and is denoted using the dot (.) symbol.

\. -> abc, &12, 123,$43,Amp

We also have an optional character match i.e., it will match for both the characters being present or absent. It is denoted by the question mark (?) symbol.

ab?c -> Matches abc and ac.

hello? -> Matches hello and hell.

Ranges & Repetitions

You can’t keep writing the regular expression to match the length of the string. And in most cases, you might not know the exact length of the string you are looking for. We can solve those cases using ranges and repetition matches.

You can use the square brackets to specify multiple options for a single character. e.g. [abc] will match a,b and c.

To match all lowercase alphabets, you can use the range [a-z]. Similarly, to match all uppercase alphabets, it's [A-Z] and [0–9] for numbers.

We have seen the plus (+) symbol to match more than one occurrence of strings and numbers. We can also use star (*) to match zero or more repetitions.

\w* — Matches any number of characters

e.g. aab* matches aaaabcc and aabbbbc

\w+ — Matches only if at least one character is present.

e.g. aab+ matches aaaabcc and aabbbbc

What if you need a fixed number of repetitions? You can use the number next to the character/string using flower brackets to set a limit.

e.g. hel{2}o will match hello. oz{4}y will match exactly ozzzzy.

Conditional Matching

Conditional matching is another useful option in regular expressions. You can use the or (|) and not (^) symbols to conditionally check for patterns.

To match everything other than the lowercase characters a-z, the regex would be [^a-z].

e.g. [^a-z] -> 123string, Anderson

To use the or (|) character, you have to use parenthesis instead of square brackets.

e.g. I like (green|blue) -> Matches both I like green and I like blue

Start and End

The more specific a regex pattern is better the results. you can also write regular expressions by defining the start and end of the string using the hat (^) and dollar ($) symbols.

e.g.

^support -> Matches supportive but doesn't match unsupportive

supportive$ -> Matches unsupportive

^support$ -> Matches exactly support

Note: The same hat (^) symbol is used for both the not operation as well as for defining the start of the string. it is important to note that the not operator is added inside square brackets.

Summary

Regular expressions are text strings that help you to extract data from a piece of text. They have a number of use cases ranging from web scraping to search engines. If you want to learn more about regular expressions, here is a video tutorial by FreeCodeCamp.

Hope you had a great time learning about regular expressions and pattern matching. If you want to learn regex by doing, you can try this interactive tutorial here.

Hope you enjoyed this article. If you have any questions, let me know in the comments. See you soon with a new topic.

Reply

or to participate.