Creating Extracts with REGEX

If you make your own content management system like I have for ThePCSpy, chances are, somewhere along the line, you will want to create extracts so that you can show introductory paragraphs of things outside its original scope.

One example is an RSS feed where you want to give your users a sneak-peek of the full thing. Another is sending trackbacks or pingbacks to other blogs when you mention them.

Using Regular Expressions, we’re going to take a chunk of the following text that we can use as we like:

<strong>Hello!</strong> My name is <em>Oli</em> and <br/ >
I love programming <em>Regular Expressions</em>.

Step 1 - Nuke the HTML

This is a simple REGEX Replace that matches anything that could be construed as an HTML tag:

<[^>]*>

And just replace all matches of that with an empty string.

This could be expanded to rip out the contents of headers tags, so if your text starts with a header, that isn’t part of the extract.

Step 2 - Extract your chunk

For this example we’re going to extract the first 35 characters of our HTML-less using this another simple REGEX:

^(.{0,35})

Outputs:

Hello! My name is Oli and I like pr

The problem being, we've cut a word in half. That just looks silly and it's not going to make any sense to people. By specifying we'd like the first 35 characters plus all the characters up til the next *space*, we ensure that no words are broken:

^(.{0,35}[^\s]*)

And there we have it. From one long string we now output:

Hello! My name is Oli and I like programming