Re: OT: how to strip out SGML tags?
On Sat, Sep 02, 2000 at 06:53:46PM -0500, Will Trillich wrote: So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping the flow down to closing the file. What's the perl equivalent of WHILE NOT EOF? g while (FILEHANDLE) { ... } i.e. while ($_ = FILEHANDLE) { munge $_; } The truly lazy would write while (FILEHANDLE) { munge $_; } ;) -- Nathan E Norman Eschew Obfuscation email:[EMAIL PROTECTED] http://incanus.net/~nnorman pgpugSfb49Wsd.pgp Description: PGP signature
OT: how to strip out SGML tags?
I have found a perl script to do this but it doesn't seem to work. Does anyone know of something that does? Just for conversation's sake, here's the pl that does _not_ seem to handle my DocBook SGML: #!/usr/bin/perl ## ## sgmlstripper - Strip SGML markup from input. ## ## by Robert J Seymour [EMAIL PROTECTED] ## Copyright 1995, 1996, Robert Seymour and Springer-Verlag. ## All rights reserved. This program may be distributed and/or ## modified in electronic form under the same terms as Perl ## itself. ## ## CPAN menu: # # File Name: sgmlstripper # File Size in BYTES: 1469 # Sender/Author/Poster: Robert J. Seymour [EMAIL PROTECTED] # Subject: sgmlstripper - Strip SGML markup from input. # # sgmlstripper removes SGML markup tags from input (taken through # specified files or STDIN). sgmlstripper uses a # character-by-character read mode which, though not as fast as a # regexp, is guaranteed to strip tags which fall across line or # paragraph boundaries and preserves whitespace so that line numbers # will be the same (the latter is useful for search engines which # don't want to index markup, but want line numbers to be preserved). ## Use STDIN if no files are given $ARGV[0] = - unless @ARGV; ## Strip out anything contained in an SGML markup tag. This is not ## very pretty and rather inefficient, but it does take care of tags ## which cross line or paragraph boundaries. foreach $file (@ARGV) { open(INPUT,$file); while($char = getc(INPUT)) { if($char eq ) { IGNORE: for(;;) { last IGNORE if (getc(INPUT) eq ); } } else { print $char; } } close(INPUT); TIA, and sorry to be spamming the list with this OT post. } -- Bob Bernstein http://www.ruptured-duck.com
Re[2]: OT: how to strip out SGML tags?
erik [EMAIL PROTECTED] wrote: ## Use STDIN if no files are given $ARGV[0] = - unless @ARGV; ## Strip out anything contained in an SGML markup tag. This is not ## very pretty and rather inefficient, but it does take care of tags ## which cross line or paragraph boundaries. foreach $file (@ARGV) { open(INPUT,$file); while($char = getc(INPUT)) { if($char eq ) { IGNORE: for(;;) { last IGNORE if (getc(INPUT) eq ); ... not sure why the IGNORE thing is in here; it seems like this should work but I would have simply done : if($char eq ) { while(getc(INPUT) ne ) { ; } } I had trouble with your idea, but I went back to the original script I posted and discovered that the problem is it dies whenever a numerical '0' is encountered! Apart from that it works fine. It just so happened I had a '0' in the first few lines of my SGML, but I didn't get the implication. So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping the flow down to closing the file. What's the perl equivalent of WHILE NOT EOF? g Look reasonable? -- Bob Bernstein http://www.ruptured-duck.com
Re[2]: OT: how to strip out SGML tags?
Ok. My last post in this thread! Here is what does work: (thanks list) #!/usr/bin/perl -w ## ## sgmlstripper - Strip SGML markup from input. ## ## by Robert J Seymour [EMAIL PROTECTED] ## Copyright 1995, 1996, Robert Seymour and Springer-Verlag. ## Fixed by Bob Bernstein to handle zeros., 9/2/2000 ## All rights reserved. This program may be distributed and/or ## modified in electronic form under the same terms as Perl ## itself. ## ## CPAN menu: # # File Name: sgmlstripper # File Size in BYTES: 1469 # Sender/Author/Poster: Robert J. Seymour [EMAIL PROTECTED] # Subject: sgmlstripper - Strip SGML markup from input. # # sgmlstripper removes SGML markup tags from input (taken through # specified files or STDIN). sgmlstripper uses a # character-by-character read mode which, though not as fast as a # regexp, is guaranteed to strip tags which fall across line or # paragraph boundaries and preserves whitespace so that line numbers # will be the same (the latter is useful for search engines which # don't want to index markup, but want line numbers to be preserved). ## Use STDIN if no files are given $ARGV[0] = - unless @ARGV; ## Strip out anything contained in an SGML markup tag. This is not ## very pretty and rather inefficient, but it does take care of tags ## which cross line or paragraph boundaries. foreach $file (@ARGV) { open(INPUT,$file); while(!eof(INPUT)) { $char = getc(INPUT); if($char eq ) { IGNORE: for(;;) { last IGNORE if (getc(INPUT) eq ); } } else { print $char; } } close(INPUT); } -- Bob Bernstein http://www.ruptured-duck.com
Re: OT: how to strip out SGML tags?
On Sat, Sep 02, 2000 at 05:27:49PM -0400, Bob Bernstein wrote: erik [EMAIL PROTECTED] wrote: ## Use STDIN if no files are given $ARGV[0] = - unless @ARGV; ## Strip out anything contained in an SGML markup tag. This is not ## very pretty and rather inefficient, but it does take care of tags ## which cross line or paragraph boundaries. foreach $file (@ARGV) { open(INPUT,$file); # while there's text to get while(INPUT) { # while there's a starting (maybe complete) tag while (s/[^]*(?)//) { # if not complete (start but no finish) if ( ! $1) { my $tag; while($tag = INPUT) { # keep going until we find the end-of-tag last if $tag =~ s/.*?//; } # maybe add a space wherever tags were ripped out? up 2 u $_ .= $tag; } } munge $_; } note -- this ain't tested, but it looks to me like it's workable; plus it reads lines at a time and uses the powerful perl muscles to help you do your job... of course, tmtowtdi... I had trouble with your idea, but I went back to the original script I posted and discovered that the problem is it dies whenever a numerical '0' is encountered! Apart from that it works fine. It just so happened I had a '0' in the first few lines of my SGML, but I didn't get the implication. So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping the flow down to closing the file. What's the perl equivalent of WHILE NOT EOF? g while (FILEHANDLE) { ... } i.e. while ($_ = FILEHANDLE) { munge $_; }