JDT

 

John Dixon
Technology
Limited

 
Google

Using Perl and Regular Expressions to Process ASCII Files - Part 3


In Part 1 we had a quick look at what Perl and regular expressions are, and introduced the idea of using them to process HTML files. In Part 2 we developed a simple Perl script to process a single HTML file. In this part we'll look at how to process multiple files, which is often where this kind of processing comes into its own.

The script we looked at in Part 2 (script1.pl - repeated below for convenience) has one major drawback, making it virtually unusable in real terms: the name of the web page (HTML file) that the script processes is hard coded into the script itself. For the script to be useful, we need to be able to run it on any web page. Changing the script so that it can do this is fairly straightforward.

Below, I've given two scripts: script1.pl, which was our original script from Part 2, and script2.pl, which is a new script that will process multiple files.

script1.pl

1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = <IN>) {
4     $line =~ s/<h1>/<h1 class="big">/;
5     (print OUT $line);
6 }
7 close (IN);
8 close (OUT);

script2.pl

1 foreach $file (@ARGV) {
2 rename $file, "$file.bak";
3 open (IN, "<$file.bak");
4 open (OUT, ">$file");
5 while ($line = <IN>) {
6     $line =~ s/<h1>/<h1 class="big">/;
7     (print OUT $line);
8 }
9 close IN;
10 close OUT;
11 }

Before looking at each line of the script in detail, let's just quickly establish what script2.pl does. Well, it processes one or more files entered at the command line prompt (for example, the MS-DOS prompt) and then, for each file entered, the script initially makes a backup copy before changing every occurrence of <h1> to <h1 class="big">.

A couple of definitions:

Variable
A temporary storage place for a value. In the above script, $file is a variable. The filename file1.htm, which will be entered at the command line prompt, is a value that will be temporarily stored in that variable when the script is run.

Array
A storage place for a list of values.

Let's take a look at each line of script2.pl.

Line 1
This line enables one or more files to be entered at the command line and processed by the script. We only have one file, file1.htm, so when we run the script we'll only enter one file to be processed.

Line 2
This line makes a backup copy of each file before processing it. So, for file1.htm, the backup file would be file1.htm.bak.

Line 3
This line opens a filehandle for the file being processed. See Part 2 for more information about filehandles.

Line 4
This line opens another filehandle, but this time for the output from the script.

Note: file1.htm.bak will contain the contents of the file from before the script is run. file1.htm will contain the updated contents, that's to say, the output from the script.

Line 5
This line sets up a loop in which each line in the input file (the file being processed) will be examined individually.

Line 6
This is the regular expression. It searches for one occurrence of <h1> on each line of the input file and, if it finds one, changes it to <h1 class="big">.

See Part 2 for a full description of the actual regular expression.

Line 7
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to the output file.

Line 8
This line closes the 'while' loop. The loop is repeated until all the lines in the file currently being processed have been examined.

Lines 9 and 10
These two lines close the two file handles that have been used in the script.

Line 11
This line closes the 'foreach' loop. The loop is repeated until all the files entered at the command line prompt have been processed.

Running the Script

To run the script, at the command line type:

     C:>perl script2.pl file1.htm

If the script executes successfully, a new file should be created called file1.htm.bak, which is a backup of the orginal file (ie before it was processed). A new version of file1.htm should also have been produced, containing the modified <h1> tag.


Author: John Dixon
John Dixon Technology Ltd







Go to Using Perl and Regular Expressions to Process ASCII Files - Part 1

Go to Using Perl and Regular Expressions to Process ASCII Files - Part 2

Go to Using Perl and Regular Expressions to Process ASCII Files - Part 4

Go to Using Perl and Regular Expressions to Process ASCII Files - Part 5

Go back to Perl Tutorials home page

Go back to Tutorials home page



Need a FREE bookkeeping solution?

Why not try Earnings Tracker? John Dixon Technology's free accounting software.

The software is written in PHP and MySQL and is available to use for FREE online, or as a FREE download.

Need free accounting software
 



JDT

© 2007-2009 - John Dixon Technology Ltd

Privacy Statement

Terms & Conditions