Re: what's the easiest way to de-html-ize files?
On Tue, May 15, 2007 at 09:08:32AM +0200, Bram Schoenmakers wrote: > Op zaterdag 12 mei 2007, schreef Gary Kline: > > This is for those of us who appreciate ASCII or straight > > ISO_8859-15 rather than marked up files. I have slapped together > > a crude C program that does scotch (or *cleanse*) text of > > and so on. Still... is there some standalone converter > > that gets rids of markup more elegantly? Something where i > > can say > > > > % cmd file_1.html ... file_N.html and output file_1.text ... > > file_N.text? > > > > thanks, gents, > > > > > > gary > > textproc/html2text So! this I'll check out. bedankt:-) gary PS: "Ask and thou shall receice." If you're lucky. > > Kind regards, > > -- > Bram Schoenmakers > > What is mind? No matter. What is matter? Never mind. > (Punch, 1855) > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- Gary Kline [EMAIL PROTECTED] www.thought.org Public Service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: what's the easiest way to de-html-ize files?
Gary Kline wrote: On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote: On Sat, 12 May 2007 14:34:52 -0700 Gary Kline <[EMAIL PROTECTED]> wrote: > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote: > > On May 12, 2007, at 12:54 PM, Gary Kline wrote: > > >This is for those of us who appreciate ASCII or straight > > > ISO_8859-15 rather than marked up files. I have slapped together > > > a crude C program that does scotch (or *cleanse*) text of > > > and so on. Still... is there some standalone converter > > > that gets rids of markup more elegantly? Something where i > > > can say > > > > > > % cmd file_1.html ... file_N.html and output file_1.text ... > > > file_N.text? > > > > Perhaps: > > > > lynx -dump file1.html ... > file.text > > > > ...? > > Hm, maybe Ineed Bill Campbell's -force_html switch. > > Yes, seems that way. USing just -dump got most of them, but > using the -force_html caught all. Need to script something to > reformat, but the worst of it's done! Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As' dialog offers a picklist for 'Files of Type' that includes 'Text Files'. This does a pretty decent job of producing text from HTML files, and is quicker than firing up lynx (or links) if you're already viewing a page. Oh sure; I've been saving html in text, ascii/8859-1 for years. But what I've got, and there are more saved **somewhere**, are files that are saved by default in markup. I have a slew of these on different boxen and have been moving then to one place. Problem is: how to de-html the bunch. I'm too lazy to write something that would automate what Can be automated--markup like "&foo;" are problematic. So probably the easiest way would be to create a dehtml.sh script that is just a wrapper around lynx. I don't think I'm the only hacker who wants just-plain-ascii, so this might mak a good project for somebody who's new to C or perl. That's my two pennies' worth! gary Cheers, Ian If you don't want formatting and the number of tags is trivial, the solution is fairly simple in Perl (less than 150 lines, if even that). -Garrett ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: what's the easiest way to de-html-ize files?
On Tue, May 15, 2007 at 03:34:14PM +1000, Ian Smith wrote: > On Sat, 12 May 2007 14:34:52 -0700 Gary Kline <[EMAIL PROTECTED]> wrote: > > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote: > > > On May 12, 2007, at 12:54 PM, Gary Kline wrote: > > > >This is for those of us who appreciate ASCII or straight > > > >ISO_8859-15 rather than marked up files. I have slapped > together > > > >a crude C program that does scotch (or *cleanse*) text of > > > > and so on. Still... is there some standalone converter > > > >that gets rids of markup more elegantly? Something where i > > > >can say > > > > > > > >% cmd file_1.html ... file_N.html and output file_1.text ... > > > >file_N.text? > > > > > > Perhaps: > > > > > > lynx -dump file1.html ... > file.text > > > > > > ...? > > > >Hm, maybe Ineed Bill Campbell's -force_html switch. > > > >Yes, seems that way. USing just -dump got most of them, but > >using the -force_html caught all. Need to script something to > >reformat, but the worst of it's done! > > Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As' > dialog offers a picklist for 'Files of Type' that includes 'Text Files'. > > This does a pretty decent job of producing text from HTML files, and is > quicker than firing up lynx (or links) if you're already viewing a page. Oh sure; I've been saving html in text, ascii/8859-1 for years. But what I've got, and there are more saved **somewhere**, are files that are saved by default in markup. I have a slew of these on different boxen and have been moving then to one place. Problem is: how to de-html the bunch. I'm too lazy to write something that would automate what Can be automated--markup like "&foo;" are problematic. So probably the easiest way would be to create a dehtml.sh script that is just a wrapper around lynx. I don't think I'm the only hacker who wants just-plain-ascii, so this might mak a good project for somebody who's new to C or perl. That's my two pennies' worth! gary > > Cheers, Ian > -- Gary Kline [EMAIL PROTECTED] www.thought.org Public Service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: what's the easiest way to de-html-ize files?
Op zaterdag 12 mei 2007, schreef Gary Kline: > This is for those of us who appreciate ASCII or straight > ISO_8859-15 rather than marked up files. I have slapped together > a crude C program that does scotch (or *cleanse*) text of >and so on. Still... is there some standalone converter > that gets rids of markup more elegantly? Something where i > can say > > % cmd file_1.html ... file_N.html and output file_1.text ... > file_N.text? > > thanks, gents, > > > gary textproc/html2text Kind regards, -- Bram Schoenmakers What is mind? No matter. What is matter? Never mind. (Punch, 1855) ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: what's the easiest way to de-html-ize files?
On Sat, 12 May 2007 14:34:52 -0700 Gary Kline <[EMAIL PROTECTED]> wrote: > On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote: > > On May 12, 2007, at 12:54 PM, Gary Kline wrote: > > >This is for those of us who appreciate ASCII or straight > > > ISO_8859-15 rather than marked up files. I have slapped together > > > a crude C program that does scotch (or *cleanse*) text of > > > and so on. Still... is there some standalone converter > > > that gets rids of markup more elegantly? Something where i > > > can say > > > > > > % cmd file_1.html ... file_N.html and output file_1.text ... > > > file_N.text? > > > > Perhaps: > > > > lynx -dump file1.html ... > file.text > > > > ...? > > Hm, maybe Ineed Bill Campbell's -force_html switch. > > Yes, seems that way. USing just -dump got most of them, but > using the -force_html caught all. Need to script something to > reformat, but the worst of it's done! Also, if using Mozilla (so, I would assume, Firefox) the 'Save Page As' dialog offers a picklist for 'Files of Type' that includes 'Text Files'. This does a pretty decent job of producing text from HTML files, and is quicker than firing up lynx (or links) if you're already viewing a page. Cheers, Ian ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: what's the easiest way to de-html-ize files?
On Mon, May 14, 2007 at 12:09:07PM -0700, Chuck Swiger wrote: > On May 12, 2007, at 12:54 PM, Gary Kline wrote: > >This is for those of us who appreciate ASCII or straight > > ISO_8859-15 rather than marked up files. I have slapped together > > a crude C program that does scotch (or *cleanse*) text of > > and so on. Still... is there some standalone converter > > that gets rids of markup more elegantly? Something where i > > can say > > > > % cmd file_1.html ... file_N.html and output file_1.text ... > > file_N.text? > > Perhaps: > > lynx -dump file1.html ... > file.text > > ...? Hm, maybe Ineed Bill Campbell's -force_html switch. Yes, seems that way. USing just -dump got most of them, but using the -force_html caught all. Need to script something to reformat, but the worst of it's done! thanks, guys, gary > > -- > -Chuck > -- Gary Kline [EMAIL PROTECTED] www.thought.org Public Service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: what's the easiest way to de-html-ize files?
On May 12, 2007, at 12:54 PM, Gary Kline wrote: This is for those of us who appreciate ASCII or straight ISO_8859-15 rather than marked up files. I have slapped together a crude C program that does scotch (or *cleanse*) text of and so on. Still... is there some standalone converter that gets rids of markup more elegantly? Something where i can say % cmd file_1.html ... file_N.html and output file_1.text ... file_N.text? Perhaps: lynx -dump file1.html ... > file.text ...? -- -Chuck ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: what's the easiest way to de-html-ize files?
On Sat, May 12, 2007, Gary Kline wrote: > > > This is for those of us who appreciate ASCII or straight > ISO_8859-15 rather than marked up files. I have slapped together > a crude C program that does scotch (or *cleanse*) text of >and so on. Still... is there some standalone converter > that gets rids of markup more elegantly? Something where i > can say The ``lynx'' text browser can generate plain text from HTML. lynx -dump -force_html filename Bill -- INTERNET: [EMAIL PROTECTED] Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way FAX:(206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676 In Germany they first came for the Communists and I didn't speak up because I wasn't a Communist. Then they came for the Jews, and I didn't speak up because I wasn't a Jew. Then they came for the trade unionists, and I didn't speak up because I wasn't a trade unionist. Then they came for the Catholics, and I didn't speak up because I was a Protestant. Then they came for me -- and by that time no one was left to speak up. -- Pastor Martin Niemoller ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
what's the easiest way to de-html-ize files?
This is for those of us who appreciate ASCII or straight ISO_8859-15 rather than marked up files. I have slapped together a crude C program that does scotch (or *cleanse*) text of and so on. Still... is there some standalone converter that gets rids of markup more elegantly? Something where i can say % cmd file_1.html ... file_N.html and output file_1.text ... file_N.text? thanks, gents, gary -- Gary Kline [EMAIL PROTECTED] www.thought.org Public Service Unix ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"