Episode 3.

Meanwhile, on Tatooine ...

Something I overlooked, as I scrawled in my near-stupor
last time, is that most of the packaging for data traveling
on the internet is text.  E-mail is sent as text, including
the headings, content, etc.  Web pages are composed
using HTML (text), JavaScript (text) and, other scripting
types (also text).  Data packaged for transport rather than
display may be sent using XML (more text).

You get the point.  And what is Perl REALLY good at?
Duh ... text?  Why yes, how did you guess?

Today I was asked to write a program to scan some
rather large files, summarizing the counts of things
like transaction records by property code, date, brand,
and that sort of thing.  The resulting program, of less
than 350 lines of code, scanned and summarized
a 250Mb file (230,000 lines) in just 30 seconds.  It
also did the same for a 2.2Gb (2,200Mb) file of more
than 2 million lines in under 5 minutes.

Now, granted, this is running on a Sun Unix box, so
your milage may vary, but it ran at speeds comparable
to compiled C code -- and this is interpreted.

I'm not going to include the whole thing here, but
a few lines of it may provide some insight into the
kind of thing you might do with Perl.

(Note:  the lines beginning with '#' are comments,
 as are the line-ends which begin with '#' -- Perl has
 no multi-line comments, but it has an internal
 documentation feature that achieves that nicely.)

Near the top of the program, I have a little segment
that prints the name of the program (including the
path to the perl interpreter) along with its arguments.
----------
# reserved variables:
# $^X   = fully qualified name of executing binary
# $0    = name of actual program or script
# @ARGV = list of command line args, from 1 to n
print "Now running: (#! $^X) $0 @ARGV\n"; # Perl binary, program name, argv
list
----------
This printed:  Now running (#! /usr/bin/perl) fxvu -f G2M080120002711.DAT
----------
(Those of you familiar with Unix will recognize the "shebang" notation,
 that little "#! /usr/bin/perl" line, placed at the very beginning of
scripts
 (or "batch" files in DOS) to tell the OS which command interpreter will
be used to run this list of commands.  Unix supports several shells,
so when writing a script, this is important.)
The @ARGV argument list contained everything after "fxvu" (that's "-f" and
"G2M080120002711.DAT"), but printing it out only required a single
reference to @ARGV.

A little later, I declare a formatting mask for the columns I
will output for the summary report.
----------
format OUTPUT =
@>>>>>>>>>>>>>>>>>>>> = @>>>>>>>>>
$label, $value
.
----------
This creates a file "handle" called OUTPUT, which expects
two variables ($label and $value), which it places in the
field areas marked with ">>>>" (normally would be "<<<<"
but I want this right-justified).
This format makes it possible later in the program to say
stuff like $label = "blah"; $value = a_number; write;
to produce a line in the report.

Some ways farther down, I have a routine that reads the
lines from the input file and does stuff with them.
----------
sub scan_file {
  my ( $FH, $fname ) = @_;  # fetch passed parameters
  while ( <$FH> ) {    # fetch an input line from FH
    chomp;    # trim the record
    ( $prop, $brand, $txcode, $txdate ) = /(.{6})(.{5}).{1113}(.)(.{8})/;
----------
This last line means, from a line of text that's 1151 bytes
long, grab the first six letters, the next five, skip 1113, grab
one and then eight, and assign the line segments to the
four variables enclosed in parens at the left end of the
line (prop, brand, txcode, txdate).
----------
    $proptally{ $prop } += 1;  # total transactions for a property
    $brandtally{ $brand } += 1;  # total transactions for a brand
----------
which creates an array called "proptally" which uses the
WORD contained in "prop" as an index, and accumulates
a count of each of the values for "prop".  Same for brand.
Such an array is called an "associative" array or "hash."
A very powerful way of handling lists of data.
----------
  $~ = "OUTPUT";    # bind OUTPUT format handle to STDOUT
  $tottrx = 0;            # zero the local total
  $nkeys = keys(%proptally);    # count number of keys in list
  print "\nTransaction counts for Properties: ($nkeys rows)\n";
  # now sort the proptally index keywords into another array
  @ordprop = sort { $a cmp $b } keys (%proptally);
  foreach ( @ordprop ) {        # walk through the whole list
    $label = $_;                        # current keyword is the label
    $value = $proptally{ $_ };    # which points to the counter
    write;                        # print formatted output line
    $tottrx += $value;    # accumulate the total for counts
  }
  # format the total line same as above
  $label = "Total transactions"; $value = $tottrx;
  write;
----------
and this process is carried out for each of 11 different
totals groups:
----------
    $x = ":";
    $proptally{ $prop } += 1;  # total transactions for a property
    $brandtally{ $brand } += 1;  # total transactions for a brand
    $trxtally{ $txcode } += 1;  # total transactions for a tx code
    $datetally{ $txdate } += 1;  # total transactions for a date
    $proptrxcnt{ $prop . $x . $txcode } += 1; # total by property+txcode
    $brandtrxcnt{ $brand . $x . $txcode } += 1; # total by brand+txcode
    $trxbydate{ $txdate . $x . $txcode } += 1; # total by date+txcode
    $propdatetrx{ $prop . $x . $txdate . $x . $txcode } += 1; # total by
property+date+txcode
    $branddatetrx{ $brand . $x . $txdate . $x . $txcode } += 1; # total by
brand+date+txcode
    $dateproptrx{ $txdate . $x . $prop . $x . $txcode } += 1; # total by
date+property+txcode
    $datebrandtrx{ $txdate . $x . $brand . $x . $txcode } += 1; # total by
date+brand+txcode
----------
Note the "." period used to concatenate (join) strings
together to form compound keys for the different
totalling lists.

Now, this may not be the most elegant Perl code,
but it achieved the desired result, ran like a bat,
and took less than a day to write, including the
time I spent looking stuff up in the O'Reilly books.

Remember, from a business perspective, the most
expensive part of a software project is what you're
paying your programmers, so time saved getting
something running is always valuable.

Always,
Garry

P.S.
  The complete program is available for the asking
  for anyone who wants it directly (off list).

To unsubscribe from SURVPC send a message to [EMAIL PROTECTED] with 
unsubscribe SURVPC in the body of the message.
Also, trim this footer from any quoted replies.
More info can be found at;
http://www.softcon.com/archives/SURVPC.html

Reply via email to