Android Robot

DnnForge - NewsArticles


The Basics of Regular Expressions

As web developers we tend to manipulate text often. This text is often HTML or XML, and in many cases the manipulation is done inefficiently. For text manipulation and searching you should be using Regular Expressions, but many developers do not. Do regular expressions look like cartoon characters swearing to you? Often times, I have ran into string manipulation code that contained loops, inside of loops, inside of loops which could have been avoided with a Regular Expression that would have been more performant and more readable? Wait, What a Regular Expression that is readable, yes once you get the basics of Regular Expressions down, they are more easier to interpret then stepping threw a bunch of control statements trying to figure out what they do.


Simple Searches
Find “jeff” jeff
Find “jeff” as a whole word \bjeff\b
Find “jeff” followed by “mcwherter” \bjeffb.*\bmcwherter\b
Special Characters
Match any character except newline .
Match any alphanumeric character \w
Match any whitespace character \s
Match any digit \d
Match the beginning or end of a word \b
Match the beginning of the string ^
Match the end of the string $
Find words that start with the letter a \ba\w*\b
Find repeated strings of digits \d+
Find six letter words \b\w{6}\b

Sometimes we need to search for a character that is not a member of an easily defined class of characters. The following table shows how this can be specified.

Match any character that is not alphanumeric \W
Match any character that is not whitespace \S
Match any character that is not a digit \D
Match a position that is not the beginning or end of a word \B
Match any character that is not x [^x]
Match any character that is not one of the characters aeiou [^aeiou]
Positive Lookaround
    Next we look at the lookahead or lookbehind assertions. They look for things that go before or after the current match without including them in the match. It is important to understand that these expressions match a position like “^” or “\b” and never match any text. For this reason, they are known as “zero-width assertions”. They are best illustrated by example:
    The beginning of words ending with “ing” \b\w+(?=ing\b)
    The end of words starting with “re” (?<=\bre)\w+\b
    Three digits at the end of a word, preceded by a digit (?<=\d)\d{3}\b
    Alphanumeric strings bounded by whitespace (?<=\s)\w+(?=\s)
Negative Lookaround

Earlier, I showed how to search for a character that is not a specific character or the member of a character class. What if we simply want to verify that a character is not present, but don’t want to match anything? Negative lookarounds will match a position and does not consume any text. As with positive lookaround, it can also be used to match the position of an arbitrarily complex subexpression, rather than just a single character.

Search for words with “q” not followed by “u” \b\w*q(?!u)\w*\b
Three digits not followed by another digit \d{3}(?!\d)
Strings of 7 alphanumeric not preceded by a letter or space (?<![a-z ])\w{7}
Text between HTML tags (?<=<(\w+)>).*(?=<\/\1>)
Greedy and Lazy

When a regular expression has a quantifier that can accept a range of repetitions (like “.*“), the normal behavior is to match as many characters as possible. If the first Regular Expression in the table below is used to search the string “aabab”, it will match the entire string “aabab”. This is called “greedy” matching. Sometimes, we prefer “lazy” matching in which a match using the minimum number of repetitions is found. If we apply the second regular expression in the table below to the same string “aabab” it will first match “aab” and then “ab”.

The longest string starting with a and ending with b a.*b
The shortest string starting with a and ending with b a.*?b
Repeat any number of times, but as few as possible *?
Repeat one or more times, but as few as possible +?
Repeat zero or one time, but as few as possible ??
Repeat at least n, but no more than m times, but as few as possible {n,m}?
Repeat at least n times, but as few as possible {n,}?
Tip #1 Try Expresso

Expresso is a free tool to aid in Regular Expression Development. Created for “.net Regular Expressions”, I have found it to be a very valiable tool. One of the best features is the Regex Analyzer. In the screen shot below, towards the right of the screen you will see the feature. This feature has helped me figure out exactly what Regular Expressions created by someone do, without pulling out my hair.


Tip #2 Start Slow

You are not going to master this skill over night, Regular Expression syntax takes time to learn, and even longer to master. Start using Regular Expressions when ever you can for simple matches or substitutions. As you become more familiar with the syntax you will find places to use Regular Expressions and make your code or efficient.

Reference: Expresso Help Documentation