Hi Jukka,
I've heard anecdotally that NekoHTML is better at extracting outlinks
than TagSoup, but in my experience they are roughly equivalent - some
broken docs are handled better by TagSoup, and some by NekoHTML.
I wish I'd saved the results of a recent crawl/parse that I did with
the previous version, as that would have been useful for comparison.
-- Ken
On Oct 14, 2009, at 12:57pm, Jukka Zitting wrote:
Hi,
As noted in TIKA-310, I've replaced the NekoHTML dependency (and the
transitive Xerces one) with the TagSoup library.
Based on quick testing TagSoup works just as well (if not better) for
our needs than NekoHTML, and the dependency change helped cut the
tika-app jar size from 27MB to 25MB. Most notably this change removes
the Xerces dependency that is troublesome for many environments that
depend on some specific XML parser being picked up by JAXP.
However, since this is a pretty notable change to a core feature,
please try out the latest trunk and report any problems if you use
Tika for parsing lots of HTML.
BR,
Jukka Zitting
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378