I don't fully understand what you mean by a cvs file whether that
refers to a congruent visioning file or if you meant a comma
separated values file. Based on the sample output I'm assuming a CSV
file using semi-colons.
I choose PERL at the Swiss-Army knife of scripts and was able to whip
up a parser in about fifteen minutes. attached is what I came up with.
I left the loading of multiple files to the student. I used mainly
regular expressions so it could be ported to VIM script in theory but
this type of parsing would be better suited for a scripting language
not an editor.
Hope this gives some inspiration.
On Sep 6, 2006, at 06:14, Nikolaos A. Patsopoulos wrote:
I have a huge pack of html files (1000) and I want to extract some
info on cvs files.
#!/usr/bin/perl
# Very simple script to parse a specific styled HTML document and output a file
# parsed with a delimiter.
#
# The folowing are the settings. Pick what you need. Using command line
# arguments left for the student.
$file = portal_002.htm;
$output = out.csv;
$csv_delim = ';';
$quiet = 0; # set this to 1 to stop debug output
$months_pat = (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC);
##
sub msg
{
my $str = shift;
my $line_no = shift;
if (!$quiet)
{
print $str;
if ($line_no ne )
{
print (line: $line_no);
}
print \n;
}
}
$line_no = 0; # used to track the line number.
open FD, $file || die Could not open file;
open OUT, $output || die Unable to open output file;
while ($line = FD)
{
$line_no++;
if ($line =~ /Source:/i)
{
$line =~ /$months_pat\s+[0-9]+\s+([0-9]+)/i;
$year = $2;
msg (Found 'Source:'; Year = $year, $line_no);
}
elsif ($line =~ /Addresses:/i)
{
$line =~ /a(\s.+?)?(.+?)\/a/i;
$univ = $2;
$univ =~ s/^\s+//;
$univ =~ s/(\s+|[,;])$//;
# pull out the HTML amp;
$univ =~ s/amp;//gi;
msg ( Child Found 'Addresses:'; Univ = $univ, $line_no);
# Since this should be the end of the record write to file.
print OUT $year$csv_delim$univ$csv_delim\n;
}
}
close OUT;
close FD;
msg (Done. (Parsed $line_no lines) CSV output to $output, );