Re: FYI: NekoHTML/Xerces dependency replaced with TagSoup

Ken Krugler Wed, 14 Oct 2009 13:27:00 -0700

Hi Jukka,

I've heard anecdotally that NekoHTML is better at extracting outlinksthan TagSoup, but in my experience they are roughly equivalent - somebroken docs are handled better by TagSoup, and some by NekoHTML.

I wish I'd saved the results of a recent crawl/parse that I did withthe previous version, as that would have been useful for comparison.


-- Ken

On Oct 14, 2009, at 12:57pm, Jukka Zitting wrote:

Hi,

As noted in TIKA-310, I've replaced the NekoHTML dependency (and the
transitive Xerces one) with the TagSoup library.

Based on quick testing TagSoup works just as well (if not better) for
our needs than NekoHTML, and the dependency change helped cut the
tika-app jar size from 27MB to 25MB. Most notably this change removes
the Xerces dependency that is troublesome for many environments that
depend on some specific XML parser being picked up by JAXP.

However, since this is a pretty notable change to a core feature,
please try out the latest trunk and report any problems if you use
Tika for parsing lots of HTML.

BR,

Jukka Zitting


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Re: FYI: NekoHTML/Xerces dependency replaced with TagSoup

Reply via email to