Breaking It Down: Regular Expressions
Regex, or regular expressions, are one of the most powerful tools that a software developer can have in his/her professional toolbox. As powerful as they are however, regular expressions are often deemed incredibly complicated and foolishly messy. In addition, people who are new to writing regular expressions find that they are a pain to write and understand. If this wasn’t enough, some developers are even too scared to use them!

It doesn't have to be this way!
But, never fear! I intend to change all of that with this post, which will be the first in a series of articles known as Breaking It Down. Breaking It Down will be dedicated to enlightening both amateur and advanced programmers on selected topics in computer science and software engineering.
Please take note: I do most of my software development in Java. This means that while this may be useful for every language that supports regular expressions, there may be many differences I am currently unaware of. That being said, I am trying to point people in a direction that is useful for Java developers.
Regular expressions are, as defined by Wikipedia:
In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
In their most primitive form, regexes simply look like a garbled mess of letters and random characters. For example, the following regular expression is used to find a somewhat basic URL:
^https?://[\d\-a-zA-Z]+(\.[\d\-a-zA-Z]+)*/?$
Looks rather daunting, eh? Well, it doesn’t have to be! Let’s break it down, piece by piece, and figure out exactly what each section means:
- ^https?://
- Basically, this means that we need the “http” at the beginning of our input. This is denoted by the “^” character.
- From here, the URL can be followed by an “s” if need be (a character followed by a “?” represents either 0 or 1 of the preceding character),
- finally, the input string needs to continue with the following characters, “://”.
- This gives us two possibilities: “http://” and “https://”, which are both standard types of HTTP.
- [\d\-a-zA-Z]+
- This generally means that you can have a number of possible characters present, one or more times.
- Square brackets are used to enclose a defined character class,
- “\d” represents any decimal number (0-9),
- “\-” represents the hyphen character (-), Note: Because the hyphen is a metacharacter, it must be preceded by a backslash (\) in order for it to be included in the search,
- “a-z” represents all lower-case characters,
- “A-Z” represents all upper-case characters,
- Finally, the “+” is a quantification character, which means that the characters defined within the previous character class may be found one to many times.
- (\.[\d\-a-zA-Z]+)*
- This block accounts for any number of domains and sub-domains.
- First off, each section of the domain must be preceded by a period, which is denoted by “\.”. Again, the period character is a metacharacter, so it must be escaped before it is used in a regular expression.
- Next, we have the same character class (”[\d\-a-zA-Z]“) which was defined in the previous section. And, once again, it is followed by a plus “+”, which means we are required to have one or more characters.
- Finally, the entire block is enclosed by a star (*), which represents a quantification of zero or more occurrences.
- /?$
- Once again, a very simple pattern. This tells the regex that we want to look for a forward slash, but if it isn’t found, that’s okay too!
- The last character, “$” (a dollar sign) represents that this is the end of the line, and in this case, the end of the input.
Now, that wasn’t so bad, was it?
To sum everything up, let’s take a look at what we’ve learned about:
- Character Classes: Using square brackets, we can define a custom character class to search for. Examples include:
- “a-z” for lowercase characters
- “A-Z” for uppercase characters
- “\d” for a decimal number
- “\D” for a non-decimal number
- Quantifiers: Special characters used to denote how many times a pattern should appear in an input string. Examples include:
- “?” for 0 or 1 occurrences
- “*” for 0 or many occurrences
- “+” for 1 or many occurrences
- “X{n}” for exactly n occurrences
- “X{n,m}” for anywhere between n and m occurrences
- Metacharacters: These are characters that serve a special purpose in the provided regular expression engine. In the case of Java, these include quantifiers, various types of brackets and boundary matchers.
- Boundary Matchers: As seen in the above example, boundary matcher(s) are a way to search for various positions in an input:
- “^” for the beginning of a line
- “$” for the end of a line
- “\z” for the end of the input
Now, this may be all fine and dandy if you have something cool to implement right now. If you’re like me, however, you’ll want to try some of these things out immediately. For this, I recommend you try Regex Util for Eclipse:
I originally picked up this plug-in to quickly test some sample input at work. It is absolutely beautiful. Install it now. You will not regret this decision, especially if you’re a Java developer who works with Eclipse.
Hopefully this article has been somewhat interesting and educational. Just last week, I was also too scared to use regular expressions in my project(s). There’s no reason to avoid using regex any longer. Download Regex Util, try my example, hack around with it, and learn something new. And, in the meantime, here are some more useful resources to help you learn more about regex:
> Regexes are frequently hard to read and hard to reason about. If you have written a regex for a production system, perhaps you should include some comments that detail the expected cases the regex is designed for. *This* is the failure of the crap programmer, they leave no indication that what they have written is correct or has ever been tested. And if you don’t understand regex in the first place, you probably aren’t going to do this — because you got your regex from a forum you found on Google.
This comment was originally posted on Reddit
Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl
This comment was originally posted on Twitter
Breaking It Down: Regular Expressions – mobilitea.com/blog http://is.gd/1Pq4k
This comment was originally posted on Twitter
Breaking It Down: Regular Expressions – mobilitea.com/blog http://is.gd/1Pq4k
This comment was originally posted on Twitter
Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/5TGDa
This comment was originally posted on Twitter
Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl
This comment was originally posted on Twitter
Breaking It Down: Regular Expressions – mobilitea.com/blog http://ff.im/-5NKay
This comment was originally posted on Twitter
Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl http://ff.im/-5NLXw
This comment was originally posted on Twitter
Regular Expression magic: Quick string tricks to learn now http://bit.ly/S5qsq + More reasons 2Luv Eclipse
This comment was originally posted on Twitter
Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl
This comment was originally posted on Twitter
Jesus Josh, your article just came up in my delicious feed. Nice going!
LifeStream Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl: Breaking It….. http://ff.im/-5NYog
This comment was originally posted on Twitter
Thanks man. I’m pretty much the next Mark Brophy. Isn’t that cool? Look out for my podcast series soon too. The first is just an introduction to me for the most part, but I plan on it being interesting!
Breaking It Down: Regular Expressions http://bit.ly/p7dru #programming #regex #tutorial
This comment was originally posted on Twitter
He doesn’t explain the difference between [] and () though.
This comment was originally posted on Reddit
Breaking It Down: Regular Expressions http://bit.ly/p7dru (via @nicholasdr, @lazorstudios)
This comment was originally posted on Twitter
RegEx for Dummies: http://tinyurl.com/np3sac
This comment was originally posted on Twitter
@Article Author:
Two things to note:
a) depending on your locale, \d might also match exponents. One sure fire way is to explicitly declare 0-9 instead.
b) If you place the dash as the very first or very last character within the character class, you do not need to escape it, as regex will treat it as a literal, and not a range.
@f3nd3r
The difference between [] and () is simple. [] is a character class. This checks for a single character at the current location in a string. So if I have x[ab]z for example, what this means, is match an x, followed by either an a or b (but not both, as this is for a single character), followed by a z. When dealing with (), we are talking about group caturing. This matches whatever is found inside and stored as a variable (which can be reused in the expression as a backreference). A more detailed explanation can be found here:
http://www.regular-expressions.info/brackets.html