Re: OT: how to strip out SGML tags?

2000-09-05 Thread Nathan E Norman
On Sat, Sep 02, 2000 at 06:53:46PM -0500, Will Trillich wrote:
  So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping
  the flow down to closing the file. What's the perl equivalent of WHILE NOT
  EOF? g
 
   while (FILEHANDLE) { ... }
 i.e.
   while ($_ = FILEHANDLE) { munge $_; }

The truly lazy would write

  while (FILEHANDLE) { munge $_; }

;)

-- 
Nathan E Norman   Eschew Obfuscation
email:[EMAIL PROTECTED]  http://incanus.net/~nnorman


pgpugSfb49Wsd.pgp
Description: PGP signature


OT: how to strip out SGML tags?

2000-09-02 Thread Bob Bernstein
I have found a perl script to do this but it doesn't seem to work. 

Does anyone know of something that does? 

Just for conversation's sake, here's the pl that does _not_ seem to handle my
DocBook SGML:

#!/usr/bin/perl
##
##  sgmlstripper - Strip SGML markup from input.
##
##  by Robert J Seymour [EMAIL PROTECTED]
## Copyright 1995, 1996, Robert Seymour and Springer-Verlag.
## All rights reserved.  This program may be distributed and/or
## modified in electronic form under the same terms as Perl
## itself.
##
##  CPAN menu:
#
# File Name: sgmlstripper
# File Size in BYTES: 1469
# Sender/Author/Poster: Robert J. Seymour [EMAIL PROTECTED]
# Subject: sgmlstripper - Strip SGML markup from input.
#
# sgmlstripper removes SGML markup tags from input (taken through
# specified files or STDIN).  sgmlstripper uses a 
# character-by-character read mode which, though not as fast as a
# regexp, is guaranteed to strip tags which fall across line or
# paragraph boundaries and preserves whitespace so that line numbers
# will be the same (the latter is useful for search engines which
# don't want to index markup, but want line numbers to be preserved).


##  Use STDIN if no files are given
$ARGV[0] = - unless @ARGV;

##  Strip out anything contained in an SGML markup tag.  This is not
##  very pretty and rather inefficient, but it does take care of tags
##  which cross line or paragraph boundaries.
foreach $file (@ARGV) {
  open(INPUT,$file);
  while($char = getc(INPUT)) {
if($char eq ) {
  IGNORE: for(;;) {
last IGNORE if (getc(INPUT) eq );
  }
} else {
  print $char;
}
  }
  close(INPUT);

TIA, and sorry to be spamming the list with this OT post. 
}


--
Bob Bernstein  http://www.ruptured-duck.com





Re[2]: OT: how to strip out SGML tags?

2000-09-02 Thread Bob Bernstein
erik [EMAIL PROTECTED] wrote:

  ##  Use STDIN if no files are given
  $ARGV[0] = - unless @ARGV;
  
  ##  Strip out anything contained in an SGML markup tag.  This is not
  ##  very pretty and rather inefficient, but it does take care of tags
  ##  which cross line or paragraph boundaries.
  foreach $file (@ARGV) {
open(INPUT,$file);
while($char = getc(INPUT)) {
  if($char eq ) {
IGNORE: for(;;) {
  last IGNORE if (getc(INPUT) eq );
  
  ... not sure why the IGNORE thing is in here; it seems like this should
 work but I would have simply done :
   if($char eq ) {
  while(getc(INPUT) ne ) {
   ;
   }
   }
 

I had trouble with your idea, but I went back to the original script I posted
and discovered that the problem is it dies whenever a numerical '0' is
encountered! Apart from that it works fine. It just so happened I had a '0' in
the first few lines of my SGML, but I didn't get the implication.

So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping
the flow down to closing the file. What's the perl equivalent of WHILE NOT
EOF? g

 Look reasonable? 


--
Bob Bernstein  http://www.ruptured-duck.com





Re[2]: OT: how to strip out SGML tags?

2000-09-02 Thread Bob Bernstein
Ok. My last post in this thread! Here is what does work: (thanks list)

#!/usr/bin/perl -w
##
##  sgmlstripper - Strip SGML markup from input.
##
##  by Robert J Seymour [EMAIL PROTECTED]
## Copyright 1995, 1996, Robert Seymour and Springer-Verlag.

## Fixed by Bob Bernstein to handle zeros., 9/2/2000

## All rights reserved.  This program may be distributed and/or
## modified in electronic form under the same terms as Perl
## itself.
##
##  CPAN menu:
#
# File Name: sgmlstripper
# File Size in BYTES: 1469
# Sender/Author/Poster: Robert J. Seymour [EMAIL PROTECTED]
# Subject: sgmlstripper - Strip SGML markup from input.
#
# sgmlstripper removes SGML markup tags from input (taken through
# specified files or STDIN).  sgmlstripper uses a 
# character-by-character read mode which, though not as fast as a
# regexp, is guaranteed to strip tags which fall across line or
# paragraph boundaries and preserves whitespace so that line numbers
# will be the same (the latter is useful for search engines which
# don't want to index markup, but want line numbers to be preserved).


##  Use STDIN if no files are given
$ARGV[0] = - unless @ARGV;

##  Strip out anything contained in an SGML markup tag.  This is not
##  very pretty and rather inefficient, but it does take care of tags
##  which cross line or paragraph boundaries.
foreach $file (@ARGV) {
  open(INPUT,$file);
  while(!eof(INPUT)) {
 $char = getc(INPUT);
if($char eq ) {
  IGNORE: for(;;) {
last IGNORE if (getc(INPUT) eq );
  }
} else {
  print $char;
}
  }
  close(INPUT);
}


--
Bob Bernstein  http://www.ruptured-duck.com





Re: OT: how to strip out SGML tags?

2000-09-02 Thread Will Trillich
On Sat, Sep 02, 2000 at 05:27:49PM -0400, Bob Bernstein wrote:
 erik [EMAIL PROTECTED] wrote:
 
   ##  Use STDIN if no files are given
   $ARGV[0] = - unless @ARGV;
   
   ##  Strip out anything contained in an SGML markup tag.  This is not
   ##  very pretty and rather inefficient, but it does take care of tags
   ##  which cross line or paragraph boundaries.
   foreach $file (@ARGV) {
 open(INPUT,$file);
# while there's text to get
while(INPUT) {
# while there's a starting (maybe complete) tag
while (s/[^]*(?)//) {
# if not complete (start but no finish)
if ( ! $1) {
my $tag;
while($tag = INPUT) {
# keep going until we find the 
end-of-tag
last if $tag =~ s/.*?//;
}
# maybe add a space wherever tags were ripped 
out? up 2 u
$_ .= $tag;
}
}
munge $_;
}

note -- this ain't tested, but it looks to me like it's workable;
plus it reads lines at a time and uses the powerful perl muscles
to help you do your job... of course, tmtowtdi...

 I had trouble with your idea, but I went back to the original script I posted
 and discovered that the problem is it dies whenever a numerical '0' is
 encountered! Apart from that it works fine. It just so happened I had a '0' in
 the first few lines of my SGML, but I didn't get the implication.
 
 So zero makes the condition '$char = getc(INPUT)' evaluate to false, dumping
 the flow down to closing the file. What's the perl equivalent of WHILE NOT
 EOF? g

while (FILEHANDLE) { ... }
i.e.
while ($_ = FILEHANDLE) { munge $_; }