I don't fully understand what you mean by a cvs file whether that refers to a congruent visioning file or if you meant a comma separated values file. Based on the sample output I'm assuming a CSV file using semi-colons.

I choose PERL at the Swiss-Army knife of scripts and was able to whip up a parser in about fifteen minutes. attached is what I came up with.

I left the loading of multiple files to the student. I used mainly regular expressions so it could be ported to VIM script in theory but this type of parsing would be better suited for a scripting language not an editor.

Hope this gives some inspiration.

On Sep 6, 2006, at 06:14, Nikolaos A. Patsopoulos wrote:
I have a huge pack of html files (>1000) and I want to extract some info on cvs files.

#!/usr/bin/perl

# Very simple script to parse a specific styled HTML document and output a file
# parsed with a delimiter.
# 
# The folowing are the settings. Pick what you need. Using command line
# arguments left for the student.

$file = "portal_002.htm";
$output = "out.csv";
$csv_delim = ';';
$quiet = 0; # set this to 1 to stop debug output

$months_pat = "(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)";

######
sub msg
{
    my $str = shift;
    my $line_no = shift;

    if (!$quiet)
    {
	print $str;
	if ($line_no ne "")
	{
	    print " (line: $line_no)";
	}
	print "\n";
    }
}

$line_no = 0; # used to track the line number.
open FD, "<$file" || die "Could not open file";
open OUT, ">$output" || die "Unable to open output file";
while ($line = <FD>)
{
    $line_no++;
    if ($line =~ /Source:/i)
    {
	$line =~ /$months_pat\s+[0-9]+\s+([0-9]+)/i;
	$year = $2;
	msg ("Found 'Source:'; Year = $year", $line_no);
    }
    elsif ($line =~ /Addresses:/i)
    {
	$line =~ /<a(\s.+?)?>(.+?)<\/a>/i;
	$univ = $2;
	$univ =~ s/^\s+//;
	$univ =~ s/(\s+|[,;])$//;
	# pull out the HTML &amp;
	$univ =~ s/&amp;/&/gi;
	msg ("  Child Found 'Addresses:'; Univ = $univ", $line_no);
	# Since this should be the end of the record write to file.
	print OUT "$year$csv_delim$univ$csv_delim\n";
    }
}
close OUT;
close FD;
msg ("Done. (Parsed $line_no lines) CSV output to $output", "");

Reply via email to