Re: which HTML parser is better?

2005-02-04 Thread Karl Koch
The link does not work. One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we

Re: which HTML parser is better?

2005-02-04 Thread Ian Soboroff
Oops. It's in the Google cache and also the Internet Archive Wayback machine. I'll drop the original author a note to let him know that his links are stale. http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ Ian Karl Koch [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
-Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I am

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on

Re: which HTML parser is better?

2005-02-03 Thread Dawid Weiss
Karl, Two things, try to experiment with both: 1) I would try to write a lexical scanner that strips HTML tags, much like the regular expression does. Java lexical scanner packages produce nice pure Java classes that seldom use any advanced API, so they should work on Java 1.1. They are simple

Re: which HTML parser is better? - Thread closed

2005-02-03 Thread Karl Koch
Thank you, I will do that. Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I

Re: which HTML parser is better?

2005-02-03 Thread aurora
For all parser suggestion I think there is one important attribute. Some parsers returns data provide that the input HTML is sensible. Some parsers is designed to be most flexible as tolerant as it can be. If the input is clean and controlled the former class is sufficient. Even some regular

Re: which HTML parser is better?

2005-02-03 Thread Ian Soboroff
One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we started our effort. The above

RE: which HTML parser is better?

2005-02-02 Thread Karl Koch
-Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function

Re: which HTML parser is better?

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames,

Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save

Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo

Re: which HTML parser is better?

2005-02-02 Thread Otis Gospodnetic
Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which

Re: which HTML parser is better?

2005-02-02 Thread Luke Shannon
)*?\\s?; HTH Luke - Original Message - From: sergiu gordea [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Wednesday, February 02, 2005 1:23 PM Subject: Re: which HTML parser is better? Karl Koch wrote: I am in control of the html, which means

RE: which HTML parser is better?

2005-02-02 Thread Kauler, Leto S
, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS

Re: which HTML parser is better?

2005-02-02 Thread Bill Tschumy
No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? CONFIDENTIALITY

which HTML parser is better?

2005-02-01 Thread Jingkang Zhang
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3

Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea
Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? maybe you can try this library...

Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
When I tested parsers a year or so ago for intensive use in Furl, the best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page) parser by far was TagSoup ( http://www.tagsoup.info ). It is actively maintained and improved and I have never had any problems with it. -Mike Jingkang Zhang

RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function