Technology

#How Do You Actually Use Regex? – CloudSavvy IT

“#How Do You Actually Use Regex? – CloudSavvy IT”

Regex, short for regular expression, is often used in programming languages for matching patterns in strings, find and replace, input validation, and reformatting text. Learning how to properly use Regex can make working with text much easier.

Regex Syntax, Explained

Regex has a reputation for having horrendous syntax, but it’s much easier to write than it is to read. For example, here is a general regex for an RFC 5322-compliant email validator:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-
x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")
@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?
:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-
9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|
\[x01-x09x0bx0cx0e-x7f])+)])

If it looks like someone smashed their face into the keyboard, you’re not alone. But under the hood, all of this mess is actually programming a finite-state machine. This machine runs for each character, chugging along and matching based on rules you’ve set. Plenty of online tools will render railroad diagrams, showing how your Regex machine works. Here’s that same Regex in visual form:

Still very confusing, but it’s a lot more understandable. It’s a machine with moving parts that have rules defining how it all fits together. You can see how someone assembled this; it’s not just a big glob of text.

First Off: Use a Regex Debugger

Before we begin, unless your Regex is particularly short or you’re particularly proficient, you should use an online debugger when writing and testing it. It makes understanding the syntax much easier. We recommend Regex101 and RegExr, both which offer testing and built-in syntax reference.

How Does Regex Work?

For now, let’s focus on something much simpler. This is a diagram from Regulex for a very short (and definitely not RFC 5322 compliant) email-matching Regex:

The Regex engine starts at the left and travels down the lines, matching characters as it goes. Group #1 matches any character except a line break, and will continue to match characters until the next block finds a match. In this case, it stops when it reaches an @ symbol, which means Group #1 captures the name of the email address and everything after matches the domain.

The Regex that defines Group #1 in our email example is:

(.+)

The parentheses define a capture group, which tells the Regex engine to include the contents of this group’s match in a special variable. When you run a Regex on a string, the default return is the entire match (in this case, the whole email). But it also returns each capture group, which makes this Regex useful for pulling names out of emails.

The period is the symbol for “Any Character Except Newline.” This matches everything on a line, so if you passed this email Regex an address like:

%$#^&%*#%$#^@gmail.com

It would match %$#^&%*#%$#^ as the name, even though that’s ludicrous.

The plus (+) symbol is a control structure that means “match the preceding character or group one or more times.” It ensures that the whole name is matched, and not just the first character. This is what creates the loop found on the railroad diagram.

The rest of the Regex is fairly simple to decipher:

(.+)@(.+..+)

The first group stops when it hits the @ symbol. The next group then starts, which again matches multiple characters until it reaches a period character.

Because characters like periods, parentheses, and slashes are used as part of the syntax in Regrex, anytime you want to match those characters you need to properly escape them with a backslash. In this example, to match the period we write . and the parser treats it as one symbol meaning “match a period.”

Character Matching

If you have non-control characters in your Regex, the Regex engine will assume those characters will form a matching block. For example, the Regex:

he+llo

Will match the word “hello” with any number of e’s. Any other characters need to be escaped to work properly.

Regex also has character classes, which act as shorthand for a set of characters. These can vary based on the Regex implementation, but these few are standard:

  • . – matches anything except newline.
  • w – matches any “word” character, including digits and underscores.
  • d – matches numbers.
  • b – matches whitespace characters (i.e., space, tab, newline).

These three all have uppercase counterparts that invert their function. For example, D matches anything that isn’t a number.

Regex also has character-set matching. For example:

[abc]

Will match either a, b, or c. This acts as one block, and the square brackets are just control structures. Alternatively, you can specify a range of characters:

[a-c]

Or negate the set, which will match any character that isn’t in the set:

[^a-c]

Quantifiers

Quantifiers are an important part of Regex. They let you match strings where you don’t know the exact format, but you have a pretty good idea.

The + operator from the email example is a quantifier, specifically the “one or more” quantifier. If we don’t know how long a certain string is, but we know it’s made up of alphanumeric characters (and isn’t empty), we can write:

w+

In addition to +, there’s also:

  • The * operator, which matches “zero or more.” Essentially the same as +, except it has the option of not finding a match.
  • The ? operator, which matches “zero or one.” It has the effect of making a character optional; either it’s there or it isn’t, and it won’t match more than once.
  • Numerical quantifiers. These can be a single number like {3}, which means “exactly 3 times,” or a range like {3-6}. You can leave out the second number to make it unlimited. For example, {3,} means “3 or more times”. Oddly enough, you can’t leave out the first number, so if you want “3 or less times,” you’ll have to use a range.

Greedy and Lazy Quantifiers

Under the hood, the * and + operators are greedy. It matches as much as possible, and gives back what is needed to start the next block. This can be a massive problem.

Here’s an example: say you’re trying to match HTML, or anything else with closing braces. Your input text is:

<div>Hello World</div>

And you want to match everything within the brackets. You may write something like:

<.*>

This is the right idea, but it fails for one crucial reason: the Regex engine matches “div>Hello World</div>” for the sequence .*, and then backtracks until the next block matches, in this case, a closing bracket (>). You would expect it to backtrack to only match “div“, and then repeat again to match the closing div. But the backtracker runs from the end of the string, and will stop on the ending bracket, which ends up matching everything inside the brackets.

The solution is to make our quantifier lazy, which means it will match as few characters as possible. Under the hood, this actually will only match one character, and then expand to fill the space until the next block match, which makes it much more performant in large Regex operations.

Making a quantifier lazy is done by adding a question mark directly after the quantifier. This is a bit confusing because ? is already a quantifier (and is actually greedy by default). For our HTML example, the Regex is fixed with this simple addition:

<.*?>

The lazy operator can be tacked on to any quantifier, including +?, {0,3}?, and even ??. Though the last one doesn’t have any effect; because you’re matching zero or one characters anyway, there’s no room to expand.

Grouping and Lookarounds

Groups in Regex have a lot of purposes. At a basic level, they join together multiple tokens into one block. For example, you can create a group, then use a quantifier on the entire group:

ba(na)+

This groups the repeated “na” to match the phrase banana, and banananana, and so on. Without the group, the Regex engine would just match the ending character over and over.

This type of group with two simple parentheses is called a capture group, and will include it in the output:

If you’d like to avoid this, and simply group tokens together for execution reasons, you can use a non-capturing group:

ba(?:na)

The question mark (a reserved character) defines a non-standard group, and the following character defines what kind of group it is. Starting groups with a question mark is ideal, because otherwise if you wanted to match semicolons in a group, you’d need to escape them for no good reason. But you always have to escape question marks in Regex.

You can also name your groups, for convenience, when working with the output:

(?'group')

You can reference these in your Regex, which makes them work similar to variables. You can reference non-named groups with the token 1, but this only goes up to 7, after which you’ll need to start naming groups. The syntax for referencing named groups is:

k{group}

This references the results of the named group, which can be dynamic. Essentially, it checks if the group occurs multiple times but doesn’t care about the position. For example, this can be used to match all text between three identical words:

The group class is where you’ll find most of Regex’s control structure, including lookaheads. Lookaheads ensure that an expression must match but doesn’t include it in the result. In a way, it’s similar to an if statement, and will fail to match if it returns false.

The syntax for a positive lookahead is (?=). Here’s an example:

This matches the name part of an email address very cleanly, by stopping execution at the dividing @. Lookaheads don’t consume any characters, so if you wanted to continue running after a lookahead succeeds, you can still match the character used in the lookahead.

In addition to positive lookaheads, there are also:

  • (?!) – Negative lookaheads, which ensure an expression doesn’t match.
  • (?<=) – Positive lookbehinds, which are not supported everywhere due to some technical constraints. These are placed before the expression you want to match, and they must have a fixed width (i.e., no quantifiers except {number}. In this example, you could use (?<=@)w+.w+ to match the domain part of the email.
  • (?<!) – Negative lookbehinds, which are same as positive lookbehinds, but negated.

Differences Between Regex Engines

Not all Regex is created equal. Most Regex engines don’t follow any specific standard, and some switch things up a bit to suit their language. Some features that work in one language may not work in another.

For example, the versions of sed compiled for macOS and FreeBSD do not support using t to represent a tab character. You have to manually copy a tab character and paste it into the terminal to use a tab in command line sed.

Most of this tutorial is compatible with PCRE, the default Regex engine used for PHP. But JavaScript’s Regex engine is different—it doesn’t support named capture groups with quotation marks (it wants brackets) and can’t do recursion, among other things. Even PCRE isn’t entirely compatible with different versions, and it has many differences from Perl regex.

There are too many minor differences to list here, so you can use this reference table to compare the differences between multiple Regex engines. Also, Regex debuggers like Regex101 let you switch Regex engines, so make sure you’re debugging using the correct engine.

How To Run Regex

We’ve been discussing the matching portion of regular expressions, which makes up most of what makes a Regex. But when you actually want to run your Regex, you’ll need to form it into a full regular expression.

This usually takes the format:

/match/g

Everything inside the forward slashes is our match. The g is a mode modifier. In this case, it tells the engine not to stop running after it finds the first match. For find and replace Regex, you’ll often have to format it like:

/find/replace/g

This replaces all throughout the file. You can use capture group references when replacing, which makes Regex very good at formatting text. For example, this Regex will match any HTML tags and replace the standard brackets with square brackets:

/<(.+?)>/[1]/g

When this runs, the engine will match <div> and </div>, allowing you to replace this text (and this text only). As you can see, the inner HTML is unaffected:

This makes Regex very useful for finding and replacing text. The command line utility to do this is sed, which uses the basic format of:

sed '/find/replace/g' file > file

This runs on a file, and outputs to STDOUT. You’ll need to pipe it to itself (as shown here) to actually replace the file on disk.

Regex is also supported in many text editors, and can really speed up your workflow when doing batch operations. Vim, Atom, and VS Code all have Regex find and replace built in.

Of course, Regex can also be used programmatically, and is usually built in to a lot of languages. The exact implementation will depend on the language, so you’ll need to consult your language’s documentation.

For example, in JavaScript regex can be created literally, or dynamically using the global RegExp object:

var re = new RegExp('abc')

This can be used directly by calling the .exec() method of the newly created regex object, or by using the .replace(), .match(), and .matchAll() methods on strings.

If you liked the article, do not forget to share it with your friends. Follow us on Google News too, click on the star and choose us from your favorites.

For forums sites go to Forum.BuradaBiliyorum.Com

If you want to read more like this article, you can visit our Technology category.

Source

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
Close

Please allow ads on our site

Please consider supporting us by disabling your ad blocker!