Home > Breaking It Down, Development > Breaking It Down: Regular Expressions

Breaking It Down: Regular Expressions

July 22nd, 2009

Regex, or regular expressions, are one of the most powerful tools that a software developer can have in his/her professional toolbox. As powerful as they are however, regular expressions are often deemed incredibly complicated and foolishly messy. In addition, people who are new to writing regular expressions find that they are a pain to write and understand. If this wasn’t enough, some developers are even too scared to use them!

It doesn't have to be this way!

It doesn't have to be this way!

But, never fear! I intend to change all of that with this post, which will be the first in a series of articles known as Breaking It Down. Breaking It Down will be dedicated to enlightening both amateur and advanced programmers on selected topics in computer science and software engineering.

Please take note: I do most of my software development in Java. This means that while this may be useful for every language that supports regular expressions, there may be many differences I am currently unaware of. That being said, I am trying to point people in a direction that is useful for Java developers.

Regular expressions are, as defined by Wikipedia:

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

In their most primitive form, regexes simply look like a garbled mess of letters and random characters. For example, the following regular expression is used to find a somewhat basic URL:

^https?://[\d\-a-zA-Z]+(\.[\d\-a-zA-Z]+)*/?$

Looks rather daunting, eh? Well, it doesn’t have to be! Let’s break it down, piece by piece, and figure out exactly what each section means:

  • ^https?://
    • Basically, this means that we need the “http” at the beginning of our input. This is denoted by the “^” character.
    • From here, the URL can be followed by an “s” if need be (a character followed by a “?” represents either 0 or 1 of the preceding character),
    • finally, the input string needs to continue with the following characters, “://”.
    • This gives us two possibilities: “http://” and “https://”, which are both standard types of HTTP.
  • [\d\-a-zA-Z]+
    • This generally means that you can have a number of possible characters present, one or more times.
    • Square brackets are used to enclose a defined character class,
    • “\d” represents any decimal number (0-9),
    • “\-” represents the hyphen character (-), Note: Because the hyphen is a metacharacter, it must be preceded by a backslash (\) in order for it to be included in the search,
    • “a-z” represents all lower-case characters,
    • “A-Z” represents all upper-case characters,
    • Finally, the “+” is a quantification character, which means that the characters defined within the previous character class may be found one to many times.
  • (\.[\d\-a-zA-Z]+)*
    • This block accounts for any number of domains and sub-domains.
    • First off, each section of the domain must be preceded by a period, which is denoted by “\.”. Again, the period character is a metacharacter, so it must be escaped before it is used in a regular expression.
    • Next, we have the same character class (”[\d\-a-zA-Z]“) which was defined in the previous section. And, once again, it is followed by a plus “+”, which means we are required to have one or more characters.
    • Finally, the entire block is enclosed by a star (*), which represents a quantification of zero or more occurrences.
  • /?$
    • Once again, a very simple pattern. This tells the regex that we want to look for a forward slash, but if it isn’t found, that’s okay too!
    • The last character, “$” (a dollar sign) represents that this is the end of the line, and in this case, the end of the input.

Now, that wasn’t so bad, was it?

To sum everything up, let’s take a look at what we’ve learned about:

  1. Character Classes: Using square brackets, we can define a custom character class to search for. Examples include:
    1. “a-z” for lowercase characters
    2. “A-Z” for uppercase characters
    3. “\d” for a decimal number
    4. “\D” for a non-decimal number
  2. Quantifiers: Special characters used to denote how many times a pattern should appear in an input string. Examples include:
    1. “?” for 0 or 1 occurrences
    2. “*” for 0 or many occurrences
    3. “+” for 1 or many occurrences
    4. “X{n}” for exactly n occurrences
    5. “X{n,m}” for anywhere between n and m occurrences
  3. Metacharacters: These are characters that serve a special purpose in the provided regular expression engine. In the case of Java, these include quantifiers, various types of brackets and boundary matchers.
  4. Boundary Matchers: As seen in the above example, boundary matcher(s) are a way to search for various positions in an input:
    1. “^” for the beginning of a line
    2. “$” for the end of a line
    3. “\z” for the end of the input

Now, this may be all fine and dandy if you have something cool to implement right now. If you’re like me, however, you’ll want to try some of these things out immediately. For this, I recommend you try Regex Util for Eclipse:

Regex Util Screenshot

Regex Util Screenshot

I originally picked up this plug-in to quickly test some sample input at work. It is absolutely beautiful. Install it now. You will not regret this decision, especially if you’re a Java developer who works with Eclipse.

Hopefully this article has been somewhat interesting and educational. Just last week, I was also too scared to use regular expressions in my project(s). There’s no reason to avoid using regex any longer. Download Regex Util, try my example, hack around with it, and learn something new. And, in the meantime, here are some more useful resources to help you learn more about regex:

  1. An example of how to use regular expressions in Java
  2. More examples of regular expressions
  3. A great tutorial on using regex in .NET
  4. Examples of regular expressions in .NET

jlgosse Breaking It Down, Development , , , , , , ,

23 Comments 34 Tweets 17 Comments

  1. July 27th, 2009 at 16:40 | #1

    > Regexes are frequently hard to read and hard to reason about. If you have written a regex for a production system, perhaps you should include some comments that detail the expected cases the regex is designed for. *This* is the failure of the crap programmer, they leave no indication that what they have written is correct or has ever been tested. And if you don’t understand regex in the first place, you probably aren’t going to do this — because you got your regex from a forum you found on Google.

    This comment was originally posted on Reddit

  2. July 27th, 2009 at 17:57 | #2

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl

    This comment was originally posted on Twitter

  3. July 27th, 2009 at 17:59 | #3

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://is.gd/1Pq4k

    This comment was originally posted on Twitter

  4. July 27th, 2009 at 17:59 | #4

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://is.gd/1Pq4k

    This comment was originally posted on Twitter

  5. July 27th, 2009 at 18:10 | #5

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/5TGDa

    This comment was originally posted on Twitter

  6. July 27th, 2009 at 18:12 | #6

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl

    This comment was originally posted on Twitter

  7. July 27th, 2009 at 18:41 | #7

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://ff.im/-5NKay

    This comment was originally posted on Twitter

  8. July 27th, 2009 at 18:55 | #8

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl http://ff.im/-5NLXw

    This comment was originally posted on Twitter

  9. July 27th, 2009 at 19:31 | #9

    Regular Expression magic: Quick string tricks to learn now http://bit.ly/S5qsq + More reasons 2Luv Eclipse

    This comment was originally posted on Twitter

  10. July 27th, 2009 at 19:36 | #10

    Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl

    This comment was originally posted on Twitter

  11. Mark B.
    July 27th, 2009 at 19:41 | #11

    Jesus Josh, your article just came up in my delicious feed. Nice going!

  12. July 27th, 2009 at 23:37 | #12

    LifeStream Breaking It Down: Regular Expressions – mobilitea.com/blog http://bit.ly/yslKl: Breaking It….. http://ff.im/-5NYog

    This comment was originally posted on Twitter

  13. July 28th, 2009 at 08:26 | #13

    Thanks man. I’m pretty much the next Mark Brophy. Isn’t that cool? Look out for my podcast series soon too. The first is just an introduction to me for the most part, but I plan on it being interesting!

  14. July 28th, 2009 at 19:51 | #14

    Breaking It Down: Regular Expressions http://bit.ly/p7dru #programming #regex #tutorial

    This comment was originally posted on Twitter

  15. July 28th, 2009 at 23:55 | #15

    He doesn’t explain the difference between [] and () though. :(

    This comment was originally posted on Reddit

  16. July 29th, 2009 at 14:04 | #16

    Breaking It Down: Regular Expressions http://bit.ly/p7dru (via @nicholasdr, @lazorstudios)

    This comment was originally posted on Twitter

  17. August 6th, 2009 at 03:52 | #17

    RegEx for Dummies: http://tinyurl.com/np3sac

    This comment was originally posted on Twitter

  18. nrg_alpha
    August 12th, 2009 at 14:09 | #18

    @Article Author:

    Two things to note:

    a) depending on your locale, \d might also match exponents. One sure fire way is to explicitly declare 0-9 instead.

    b) If you place the dash as the very first or very last character within the character class, you do not need to escape it, as regex will treat it as a literal, and not a range.

    @f3nd3r

    The difference between [] and () is simple. [] is a character class. This checks for a single character at the current location in a string. So if I have x[ab]z for example, what this means, is match an x, followed by either an a or b (but not both, as this is for a single character), followed by a z. When dealing with (), we are talking about group caturing. This matches whatever is found inside and stored as a variable (which can be reused in the expression as a backreference). A more detailed explanation can be found here:
    http://www.regular-expressions.info/brackets.html

Comment pages
  1. July 24th, 2009 at 08:39 | #1
  2. July 24th, 2009 at 14:07 | #2
  3. July 26th, 2009 at 18:16 | #3
  4. July 26th, 2009 at 18:40 | #4
  5. July 26th, 2009 at 23:37 | #5
  6. July 27th, 2009 at 04:50 | #6
  7. July 27th, 2009 at 09:22 | #7
  8. August 1st, 2009 at 10:17 | #8
  9. August 8th, 2009 at 11:52 | #9
  10. August 10th, 2009 at 12:12 | #10
  11. August 28th, 2009 at 17:11 | #11
  12. September 8th, 2009 at 00:21 | #12

Additional comments powered by BackType