JDT

 

John Dixon
Technology
Limited

 
Google

Regular Expressions: Using the Substitution Operation


Regular expressions enable you to find patterns in strings, for example, all the <h1> tags in an HTML file, or all the words beginning with the letter 'p'. Although the use of regular expressions is possible in several programming languages, it is Perl's support for regular expressions that makes it the language of choice for pattern matching.

In Perl, there are three main uses of pattern matching: matching, substition, and translation. In this tutorial we'll look at the substitution operation, which substitutes one expression for another.

The following Perl script shows a typical Perl template that can be used to process all the HTML files in the current folder or directory, that is, the location where the Perl script and the files to be processed are located.

1   opendir(DIR, ".") or die "can't opendir: $!";
2   @allfiles = grep (/.htm$/i, readdir DIR);
3   closedir(DIR);
4   foreach $name (@allfiles) {
5       rename $file, "$file.bak";
6       open (IN, "<$file.bak");
7       open (OUT, ">$file");
8       while ($line = <IN>) {
9          REGULAR EXPRESSION GOES HERE
10       (print OUT $line);
11       }
12       close IN;
13       close OUT;
14   }

Here's a very quick description of the above script:

Lines 1 - 3: Get all the names of the files in the current directory and puts them into an array called @allfiles. (In Perl, the '@' symbol is used to identify and array, whereas the '$' symbol is used to identify a variable.)

Lines 4 - 11: Each file is processed in turn.

Lines 12 - 14: Close the script.

Line 9: This is where the regular expression goes.

To run the script you will need to have a perl interpreter installed on your computer. If the above script was called script1.pl, for example, you would then be able to run it by double-clicking it in, say, Windows Explorer, or you could open a DOS window and type 'perl script1.pl' (without the quotes) at the command line prompt from within the directory where the files to be processed are stored.

The Substitution Operator

As mentioned above, line 9 is where the pattern matching (substitution) operation, performed by using a regular expression, goes.

The substition operation uses the s/// operator, and is used to change strings. So, for example, you could change all occurrences of the word 'hello' to 'goodbye', or all occurrences of the HTML tag '<br>' to '<hr>'.

To change the word 'hello' to 'goodbye', you would use the following regular expression (in place of line 9 in the above script):

    $line =~ s/hello/goodbye/;

The '$line' variable holds the current string being processed, and it gets modified if the substitution is successful, ie, if the word 'hello' is found, it is changed to 'goodbye'.

The '=~' is called the comparison operator, and the 's' stands for 'substitution.

Substitution Options

The above regular expression will only change the first occurrence of 'hello' to 'goodbye' in the string being processed. So, for example, if we had the following bit of text, only the first 'hello' would be changed:

I would like to say hello to everyone who knows me, and hello to everyone who does not know me.

The line of text would become:

I would like to say goodbye to everyone who knows me, and hello to everyone who does not know me.

The following regular expression, on the other hand, will change all occurrences:

    $line =~ s/hello/goodbye/g;

The 'g' at the end of the expression means 'global', and ensures that all occurrences of the word 'hello' are changed.

There are several other substitution options, for example, 'i' (for insensitive) can be used to ignore the case of letters, so that, for example, the word 'hello' would be substituted irrespective of the upper and lower case letters used - HeLlo, hELLo, hellO, hello, HELLO - would all be subsituted.

Others substitution options include:

s, which causes the string to be treated as a single line, so the end of line character (\n) will be matched, and

m, which causes the string to be treated as multiple lines.

Meta Characters

Meta characters are characters that have a special meaning when creating search patterns. For example, the '|' character is the alternation meta character, and lets you specify two values that can be matched for the substitution to succeed. For example, the following regular expression will succeed if either 'hello' or 'hi' are found - both will be changed to 'goodbye'.

    $line =~ s/hello|hi/goodbye/g;

Note: if the 'g' (for global - see above) was omitted, only the first occurrence found of either 'hello' or 'hi' would be substituted.

A couple of other meta characters commonly used are the '*' and the '+'. The '*' is used to match the character immediately to the left of the '*' 0 or more times, whereas the '+' is used to match the character immediately to the left of the '+' 1 or more times. These meta tags are often used in conjunction with meta sequences (see below).

If you want a meta character to be treated as a normal character, for example, if you want a '+' to be treated as a plus sign rather than a meta character, you need to place a backslash '\' in front of it: '\+'. By doing this, the regular expression won't try to process the '+' but will instead treat it as a normal character.

Meta Sequences

Meta sequences are characters that are given special meaning by virtue of having a backslash '\' placed in front of them. For example, '\d' means a single digit. So, to match (and substitute) one or more digits you would use '\d+':

    $line =~ s/\d+//g;

The above regular expression would find all digits of one or more characters in length, and replace them with nothing, ie. delete them.

Other meta sequences include '\s' (a single space), '\t' (tab), and '\r' (carriage return).

Meta Brackets

Various types of brackets can be used as part of a regular expression, for example, the use of square brackets [...] can be used to create a character class. For instance, the following substitution succeed if a, b, or c are found.

    $line =~ s/[abc]//g;

Note: [abc] means the same as a|b|c.

The (...) bracket sequence can be used to remember a pattern. For example, the following substitution will delete all the contents from within a file.

    $line =~ s/(.*?)//g;

The '.', '*', and '?' are all meta characters, and they can be used together to match everything, normally between a starting point and an ending point, as here:

    $line =~ s/<h1>(.*?)<\/h1>/<h2>$1<\/h2>/g;

This substitution will change all heading level ones to heading level twos in an HTML file. The '$1' is a buffer and it is used to store the contents matched by the '(.*?)'.

Notice that in the closing heading tags, a backslash '\' is placed in front of the slash '/'. This is to stop the '/' being processed by the regular expression.


Author: John Dixon
John Dixon Technology Ltd







Go back to Perl Tutorials home page

Go back to Tutorials home page



Earnings Tracker is an easy-to-use FREE open source accounting / bookkeeping software tool aimed at UK contractors, freelancers, and other very small businesses.

The software enables invoice amounts, salaries, expenses, pension contributions, and bank interest to be recorded, and calculates the amount of VAT and corporation tax due. The software also enables dividend tax vouchers to be generated.

Earnings Tracker is written in PHP and MySQL and is available to use for FREE online, or as a FREE download.

Earnings Tracker can also be used simply as a dividend, corporation tax, or VAT calculator.

Need free accounting software
 











JDT

© 2007-2009 - John Dixon Technology Ltd

Privacy Statement

Terms & Conditions