Re: HTML Parsing problems...

2003-09-22 Thread Michael Giles
Yeah, I was using HTMLParser for a few days until I tried to parse a 400K 
document and it spun at 100% CPU for a very long time.  It is tolerant of 
bad HTML, but does not appear to scale.  TagSoup processed the same 
document in a second or less at <25% CPU.

-Mike

At 02:42 PM 9/22/2003 +0200, you wrote:

TagSoup is great - however, it is not maintained nor developed (the same 
could be said about JTidy as well, but TagSoup's history is much 
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) for 
my application, and it also works very well, even for ill-formed input. 
It's also very actively developed.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML Parsing problems...

2003-09-22 Thread Andrzej Bialecki
Michael Giles wrote:
Erik,

Probably a good idea to swap something else in, although Neko introduces 
a dependency on Xerces.  I didn't play with Neko because I am currently 
using a different XML parser and didn't want to deal with the conflicts 
(and also find dependencies on specific parsers annoying).  However, 
yesterday I downloaded 
TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great!  
It is small and fast and so far has parsed every page I've thrown at 
it.  I wrote a SAX ContentHandler that only grabs the text and does a 
few other little things (like inserting spaces, removing tabs/line 
feeds, grabbing title) and it seems to be a perfect fit for the job.  It 
requires the SAX framework, but is parser independent.  The only tweak I 
made to the TagSoup code was to add an "else" to deal with a bug where 
it was consuming ";" after entities that it did not deal with.
TagSoup is great - however, it is not maintained nor developed (the same 
could be said about JTidy as well, but TagSoup's history is much 
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) 
for my application, and it also works very well, even for ill-formed 
input. It's also very actively developed.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML Parsing problems...

2003-09-20 Thread Michael Giles
Erik,

Probably a good idea to swap something else in, although Neko introduces a 
dependency on Xerces.  I didn't play with Neko because I am currently using 
a different XML parser and didn't want to deal with the conflicts (and also 
find dependencies on specific parsers annoying).  However, yesterday I 
downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is 
great!  It is small and fast and so far has parsed every page I've thrown 
at it.  I wrote a SAX ContentHandler that only grabs the text and does a 
few other little things (like inserting spaces, removing tabs/line feeds, 
grabbing title) and it seems to be a perfect fit for the job.  It requires 
the SAX framework, but is parser independent.  The only tweak I made to the 
TagSoup code was to add an "else" to deal with a bug where it was consuming 
";" after entities that it did not deal with.

If Neko is potentially headed into the Apache fold, that probably makes 
sense.  But if you are interested in my TagSoup ContentHandler for testing 
it out, just let me know.

-Mike

At 08:08 PM 9/19/2003 -0400, you wrote:
I'm going to swap in the neko HTML parser for the demo refactorings I'm
doing.  I would be all for replacing the demo HTML parser with this.
If you look at the Ant  task in the sandbox, you'll see that I
used JTidy for it and it works well, but I've heard that neko is faster
and better so I'll give it a try.
Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML Parsing problems...

2003-09-19 Thread Erik Hatcher
I'm going to swap in the neko HTML parser for the demo refactorings I'm  
doing.  I would be all for replacing the demo HTML parser with this.

If you look at the Ant  task in the sandbox, you'll see that I  
used JTidy for it and it works well, but I've heard that neko is faster  
and better so I'll give it a try.

	Erik

On Thursday, September 18, 2003, at 04:50  PM, Michael Giles wrote:

I know, I know, the HTML Parser in the demo is just that (i.e. a  
demo), but I also know that it is updated from time to time and  
performs much better than the other ones that I have tested.   
Frustratingly, the very first page I tried to parse failed  
(http:// 
www.theregister.co.uk/content/54/32593.html). It seems to be choking  
on tags that are being written inside of JavaScript code (i.e.  
document.write('');.  Obviously, the simple solution  
(that I am using with another parser) is to just ignore everything  
inside of 

Re: HTML Parsing problems...

2003-09-19 Thread Michael Giles
Tatu,

Thanks for the reply.  See below for comments.

> just ignore everything inside of 

Re: HTML Parsing problems...

2003-09-18 Thread Peter Becker
Tatu Saloranta wrote:

On Thursday 18 September 2003 14:50, Michael Giles wrote:
 

I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
I also know that it is updated from time to time and performs much better
than the other ones that I have tested.  Frustratingly, the very first page
I tried to parse failed
(http://www.theregister
.co.uk/content/54/32593.html). It seems to be choking on tags that are being
written inside of JavaScript code (i.e. document.write('');. 
Obviously, the simple solution (that I am using with another parser) is to
just ignore everything inside of 

Re: HTML Parsing problems...

2003-09-18 Thread Tatu Saloranta
On Thursday 18 September 2003 14:50, Michael Giles wrote:
> I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
> I also know that it is updated from time to time and performs much better
> than the other ones that I have tested.  Frustratingly, the very first page
> I tried to parse failed
> (http://www.theregister
>.co.uk/content/54/32593.html). It seems to be choking on tags that are being
> written inside of JavaScript code (i.e. document.write('');. 
> Obviously, the simple solution (that I am using with another parser) is to
> just ignore everything inside of 

HTML Parsing problems...

2003-09-18 Thread Michael Giles
I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but 
I also know that it is updated from time to time and performs much better 
than the other ones that I have tested.  Frustratingly, the very first page 
I tried to parse failed 
(http://www.theregister.co.uk/content/54/32593.html). 
It seems to be choking on tags that are being written inside of JavaScript 
code (i.e. document.write('');.  Obviously, the simple 
solution (that I am using with another parser) is to just ignore everything 
inside of