Hi Talat, ... and there's also parse-tika which uses tagsoup.
There are subtle differences, eg., regarding upper/lower case of element and attribute names in the DOM, see NUTCH-1592. There has been a discussion [1] @user about parser benchmarks, and there is the more general o.a.n.tools.Benchmark class [2]. But I don't know about a reliable HTML parser benchmark. Would be nice to have one including - all 3 possible parsers (parse-html with neko or tagsoup, parse-tika) - quality/correctnes (eg., when parsing HTML5) - speed Sebastian [1] http://lucene.472066.n3.nabble.com/Parse-benchmark-performance-tt4045827.html [2] http://lucene.472066.n3.nabble.com/Benchmark-of-Nutch-trunk-td1010283.html On 03/20/2014 09:12 AM, Talat Uyarer wrote: > Hi all, > We have two parsers library to parse HTML content: neko and tagsoup, could > you > explain which one should be preferred to the other and why? Or isn't there > any difference at all? > > Do we have benchmarks each one ? > > Thanks >

