Devin Weaver wrote:
I don't fully understand what you mean by a cvs file whether that refers to a congruent visioning file or if you meant a comma separated values file. Based on the sample output I'm assuming a CSV file using semi-colons.

I choose PERL at the Swiss-Army knife of scripts and was able to whip up a parser in about fifteen minutes. attached is what I came up with.

I left the loading of multiple files to the student. I used mainly regular expressions so it could be ported to VIM script in theory but this type of parsing would be better suited for a scripting language not an editor.

Hope this gives some inspiration.

On Sep 6, 2006, at 06:14, Nikolaos A. Patsopoulos wrote:
I have a huge pack of html files (>1000) and I want to extract some info on cvs files.

------------------------------------------------------------------------

#!/usr/bin/perl

# Very simple script to parse a specific styled HTML document and output a file
# parsed with a delimiter.
# # The folowing are the settings. Pick what you need. Using command line
# arguments left for the student.

$file = "portal_002.htm";
$output = "out.csv";
$csv_delim = ';';
$quiet = 0; # set this to 1 to stop debug output

$months_pat = "(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)";

######
sub msg
{
    my $str = shift;
    my $line_no = shift;

    if (!$quiet)
    {
        print $str;
        if ($line_no ne "")
        {
            print " (line: $line_no)";
        }
        print "\n";
    }
}

$line_no = 0; # used to track the line number.
open FD, "<$file" || die "Could not open file";
open OUT, ">$output" || die "Unable to open output file";
while ($line = <FD>)
{
    $line_no++;
    if ($line =~ /Source:/i)
    {
        $line =~ /$months_pat\s+[0-9]+\s+([0-9]+)/i;
        $year = $2;
        msg ("Found 'Source:'; Year = $year", $line_no);
    }
    elsif ($line =~ /Addresses:/i)
    {
        $line =~ /<a(\s.+?)?>(.+?)<\/a>/i;
        $univ = $2;
        $univ =~ s/^\s+//;
        $univ =~ s/(\s+|[,;])$//;
        # pull out the HTML &amp;
        $univ =~ s/&amp;/&/gi;
        msg ("  Child Found 'Addresses:'; Univ = $univ", $line_no);
        # Since this should be the end of the record write to file.
        print OUT "$year$csv_delim$univ$csv_delim\n";
    }
}
close OUT;
close FD;
msg ("Done. (Parsed $line_no lines) CSV output to $output", "");
------------------------------------------------------------------------


------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.11.7/438 - Release Date: 5/9/2006
Thanks for the time and effort. I work on WinXP machine and cannot brag for my Perl knowledge. From the very few code I can understand it seems that you are close to what I want to do but much are missing. I'm sorry but I'm unable to follow a Perl script.

Thanks,


Nikos

Reply via email to