Re: what's the easiest way to de-html-ize files?

2007-05-15 Thread Bram Schoenmakers
Op zaterdag 12 mei 2007, schreef Gary Kline:
   This is for those of us who appreciate ASCII or straight
   ISO_8859-15 rather than marked up files.  I have slapped together
   a crude C program that does scotch (or *cleanse*) text of
   B/B and so on.   Still... is there some standalone converter
   that gets rids of markup more elegantly?   Something where i
   can say

   % cmd file_1.html ... file_N.html and output file_1.text ...
   file_N.text?

   thanks, gents,


   gary

textproc/html2text

Kind regards,

-- 
Bram Schoenmakers

What is mind? No matter. What is matter? Never mind.
(Punch, 1855)
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: what's the easiest way to de-html-ize files?

2007-05-15 Thread Gary Kline
On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote:
 On Sat, 12 May 2007 14:34:52 -0700 Gary Kline [EMAIL PROTECTED] wrote:
   On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
On May 12, 2007, at 12:54 PM, Gary Kline wrote:
This is for those of us who appreciate ASCII or straight
ISO_8859-15 rather than marked up files.  I have slapped 
 together
a crude C program that does scotch (or *cleanse*) text of
B/B and so on.   Still... is there some standalone converter
that gets rids of markup more elegantly?   Something where i
can say

% cmd file_1.html ... file_N.html and output file_1.text ...
file_N.text?

Perhaps:

  lynx -dump file1.html ...  file.text

...?
   
  Hm, maybe Ineed Bill Campbell's -force_html switch.  
   
  Yes, seems that way.  USing just -dump got most of them, but
  using the -force_html caught all.  Need to script something to
  reformat, but the worst of it's done!
 
 Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As'
 dialog offers a picklist for 'Files of Type' that includes 'Text Files'.
 
 This does a pretty decent job of producing text from HTML files, and is
 quicker than firing up lynx (or links) if you're already viewing a page.


Oh sure; I've been saving html in text, ascii/8859-1 for years.
But what I've got, and there are more saved **somewhere**, are
files that are saved by default in markup.  I have a slew of
these on different boxen and have been moving then to one place.  
Problem is: how to de-html the bunch.  

I'm too lazy to write something that would automate what Can be
automated--markup like foo; are problematic.  So probably the 
easiest way would be to create a dehtml.sh script that is just a 
wrapper around lynx.  

I don't think I'm the only hacker who wants just-plain-ascii, so
this might mak a good project for somebody who's new to C or
perl.   That's my two pennies' worth!

gary

 
 Cheers, Ian
 

-- 
  Gary Kline  [EMAIL PROTECTED]   www.thought.org  Public Service Unix

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: what's the easiest way to de-html-ize files?

2007-05-15 Thread Garrett Cooper

Gary Kline wrote:

On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote:

On Sat, 12 May 2007 14:34:52 -0700 Gary Kline [EMAIL PROTECTED] wrote:
  On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
   On May 12, 2007, at 12:54 PM, Gary Kline wrote:
   This is for those of us who appreciate ASCII or straight
ISO_8859-15 rather than marked up files.  I have slapped together
a crude C program that does scotch (or *cleanse*) text of
B/B and so on.   Still... is there some standalone converter
that gets rids of markup more elegantly?   Something where i
can say
   
% cmd file_1.html ... file_N.html and output file_1.text ...
file_N.text?
   
   Perhaps:
   
 lynx -dump file1.html ...  file.text
   
   ...?
  
  	Hm, maybe Ineed Bill Campbell's -force_html switch.  
  
  	Yes, seems that way.  USing just -dump got most of them, but

using the -force_html caught all.  Need to script something to
reformat, but the worst of it's done!

Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As'
dialog offers a picklist for 'Files of Type' that includes 'Text Files'.

This does a pretty decent job of producing text from HTML files, and is
quicker than firing up lynx (or links) if you're already viewing a page.



Oh sure; I've been saving html in text, ascii/8859-1 for years.
But what I've got, and there are more saved **somewhere**, are
files that are saved by default in markup.  I have a slew of
	these on different boxen and have been moving then to one place.  
	Problem is: how to de-html the bunch.  


I'm too lazy to write something that would automate what Can be
	automated--markup like foo; are problematic.  So probably the 
	easiest way would be to create a dehtml.sh script that is just a 
	wrapper around lynx.  


I don't think I'm the only hacker who wants just-plain-ascii, so
this might mak a good project for somebody who's new to C or
perl.   That's my two pennies' worth!

gary


Cheers, Ian





If you don't want formatting and the number of tags is trivial, the 
solution is fairly simple in Perl (less than 150 lines, if even that).


-Garrett

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: what's the easiest way to de-html-ize files?

2007-05-15 Thread Gary Kline
On Tue, May 15, 2007 at 09:08:32AM +0200, Bram Schoenmakers wrote:
 Op zaterdag 12 mei 2007, schreef Gary Kline:
  This is for those of us who appreciate ASCII or straight
  ISO_8859-15 rather than marked up files.  I have slapped together
  a crude C program that does scotch (or *cleanse*) text of
  B/B and so on.   Still... is there some standalone converter
  that gets rids of markup more elegantly?   Something where i
  can say
 
  % cmd file_1.html ... file_N.html and output file_1.text ...
  file_N.text?
 
  thanks, gents,
 
 
  gary
 
 textproc/html2text


So!  this I'll check out.  bedankt:-)

gary

PS: Ask and thou shall receice. If you're lucky.


 
 Kind regards,
 
 -- 
 Bram Schoenmakers
 
 What is mind? No matter. What is matter? Never mind.
 (Punch, 1855)
 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to [EMAIL PROTECTED]

-- 
  Gary Kline  [EMAIL PROTECTED]   www.thought.org  Public Service Unix

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: what's the easiest way to de-html-ize files?

2007-05-14 Thread Bill Campbell
On Sat, May 12, 2007, Gary Kline wrote:


   This is for those of us who appreciate ASCII or straight
   ISO_8859-15 rather than marked up files.  I have slapped together
   a crude C program that does scotch (or *cleanse*) text of
   B/B and so on.   Still... is there some standalone converter
   that gets rids of markup more elegantly?   Something where i
   can say

The ``lynx'' text browser can generate plain text from HTML.

lynx -dump -force_html filename

Bill
--
INTERNET:   [EMAIL PROTECTED]  Bill Campbell; Celestial Software LLC
URL: http://www.celestial.com/  PO Box 820; 6641 E. Mercer Way
FAX:(206) 232-9186  Mercer Island, WA 98040-0820; (206) 236-1676

In Germany they first came for the Communists and I didn't speak up because
I wasn't a Communist.  Then they came for the Jews, and I didn't speak up
because I wasn't a Jew.  Then they came for the trade unionists, and I
didn't speak up because I wasn't a trade unionist.  Then they came for the
Catholics, and I didn't speak up because I was a Protestant.  Then they came
for me -- and by that time no one was left to speak up.
-- Pastor Martin Niemoller
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: what's the easiest way to de-html-ize files?

2007-05-14 Thread Chuck Swiger

On May 12, 2007, at 12:54 PM, Gary Kline wrote:

This is for those of us who appreciate ASCII or straight
ISO_8859-15 rather than marked up files.  I have slapped together
a crude C program that does scotch (or *cleanse*) text of
B/B and so on.   Still... is there some standalone converter
that gets rids of markup more elegantly?   Something where i
can say

% cmd file_1.html ... file_N.html and output file_1.text ...
file_N.text?


Perhaps:

  lynx -dump file1.html ...  file.text

...?

--
-Chuck

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: what's the easiest way to de-html-ize files?

2007-05-14 Thread Gary Kline
On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
 On May 12, 2007, at 12:54 PM, Gary Kline wrote:
 This is for those of us who appreciate ASCII or straight
  ISO_8859-15 rather than marked up files.  I have slapped together
  a crude C program that does scotch (or *cleanse*) text of
  B/B and so on.   Still... is there some standalone converter
  that gets rids of markup more elegantly?   Something where i
  can say
 
  % cmd file_1.html ... file_N.html and output file_1.text ...
  file_N.text?
 
 Perhaps:
 
   lynx -dump file1.html ...  file.text
 
 ...?


Hm, maybe Ineed Bill Campbell's -force_html switch.  


Yes, seems that way.  USing just -dump got most of them, but
using the -force_html caught all.  Need to script something to
reformat, but the worst of it's done!

thanks, guys,

gary


 
 -- 
 -Chuck
 

-- 
  Gary Kline  [EMAIL PROTECTED]   www.thought.org  Public Service Unix

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: what's the easiest way to de-html-ize files?

2007-05-14 Thread Ian Smith
On Sat, 12 May 2007 14:34:52 -0700 Gary Kline [EMAIL PROTECTED] wrote:
  On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote:
   On May 12, 2007, at 12:54 PM, Gary Kline wrote:
   This is for those of us who appreciate ASCII or straight
 ISO_8859-15 rather than marked up files.  I have slapped together
 a crude C program that does scotch (or *cleanse*) text of
 B/B and so on.   Still... is there some standalone converter
 that gets rids of markup more elegantly?   Something where i
 can say
   
 % cmd file_1.html ... file_N.html and output file_1.text ...
 file_N.text?
   
   Perhaps:
   
 lynx -dump file1.html ...  file.text
   
   ...?
  
   Hm, maybe Ineed Bill Campbell's -force_html switch.  
  
   Yes, seems that way.  USing just -dump got most of them, but
   using the -force_html caught all.  Need to script something to
   reformat, but the worst of it's done!

Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As'
dialog offers a picklist for 'Files of Type' that includes 'Text Files'.

This does a pretty decent job of producing text from HTML files, and is
quicker than firing up lynx (or links) if you're already viewing a page.

Cheers, Ian

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]