Hi Talat,

... and there's also parse-tika which uses tagsoup.

There are subtle differences, eg., regarding upper/lower case of element
and attribute names in the DOM, see NUTCH-1592.

There has been a discussion [1] @user about parser benchmarks,
and there is the more general o.a.n.tools.Benchmark class [2].
But I don't know about a reliable HTML parser benchmark.
Would be nice to have one including
- all 3 possible parsers (parse-html with neko or tagsoup, parse-tika)
- quality/correctnes (eg., when parsing HTML5)
- speed

Sebastian

[1] 
http://lucene.472066.n3.nabble.com/Parse-benchmark-performance-tt4045827.html
[2] http://lucene.472066.n3.nabble.com/Benchmark-of-Nutch-trunk-td1010283.html

On 03/20/2014 09:12 AM, Talat Uyarer wrote:
> Hi all,
> We have two parsers library to parse HTML content: neko and tagsoup, could
> you
> explain which one should be preferred to the other and why? Or isn't there
> any difference at all?
> 
> Do we have benchmarks each one ?
> 
> Thanks
> 

Reply via email to