Devin Weaver wrote:
I don't fully understand what you mean by a cvs file whether that
refers to a congruent visioning file or if you meant a comma separated
values file. Based on the sample output I'm assuming a CSV file using
semi-colons.
I choose PERL at the Swiss-Army knife of scripts and was able to whip
up a parser in about fifteen minutes. attached is what I came up with.
I left the loading of multiple files to the student. I used mainly
regular expressions so it could be ported to VIM script in theory but
this type of parsing would be better suited for a scripting language
not an editor.
Hope this gives some inspiration.
On Sep 6, 2006, at 06:14, Nikolaos A. Patsopoulos wrote:
I have a huge pack of html files (>1000) and I want to extract some
info on cvs files.
------------------------------------------------------------------------
#!/usr/bin/perl
# Very simple script to parse a specific styled HTML document and output a file
# parsed with a delimiter.
#
# The folowing are the settings. Pick what you need. Using command line
# arguments left for the student.
$file = "portal_002.htm";
$output = "out.csv";
$csv_delim = ';';
$quiet = 0; # set this to 1 to stop debug output
$months_pat = "(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)";
######
sub msg
{
my $str = shift;
my $line_no = shift;
if (!$quiet)
{
print $str;
if ($line_no ne "")
{
print " (line: $line_no)";
}
print "\n";
}
}
$line_no = 0; # used to track the line number.
open FD, "<$file" || die "Could not open file";
open OUT, ">$output" || die "Unable to open output file";
while ($line = <FD>)
{
$line_no++;
if ($line =~ /Source:/i)
{
$line =~ /$months_pat\s+[0-9]+\s+([0-9]+)/i;
$year = $2;
msg ("Found 'Source:'; Year = $year", $line_no);
}
elsif ($line =~ /Addresses:/i)
{
$line =~ /<a(\s.+?)?>(.+?)<\/a>/i;
$univ = $2;
$univ =~ s/^\s+//;
$univ =~ s/(\s+|[,;])$//;
# pull out the HTML &
$univ =~ s/&/&/gi;
msg (" Child Found 'Addresses:'; Univ = $univ", $line_no);
# Since this should be the end of the record write to file.
print OUT "$year$csv_delim$univ$csv_delim\n";
}
}
close OUT;
close FD;
msg ("Done. (Parsed $line_no lines) CSV output to $output", "");
------------------------------------------------------------------------
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.11.7/438 - Release Date: 5/9/2006
Thanks for the time and effort. I work on WinXP machine and cannot brag
for my Perl knowledge. From the very few code I can understand it seems
that you are close to what I want to do but much are missing. I'm sorry
but I'm unable to follow a Perl script.
Thanks,
Nikos