Hi, As noted in TIKA-310, I've replaced the NekoHTML dependency (and the transitive Xerces one) with the TagSoup library.
Based on quick testing TagSoup works just as well (if not better) for our needs than NekoHTML, and the dependency change helped cut the tika-app jar size from 27MB to 25MB. Most notably this change removes the Xerces dependency that is troublesome for many environments that depend on some specific XML parser being picked up by JAXP. However, since this is a pretty notable change to a core feature, please try out the latest trunk and report any problems if you use Tika for parsing lots of HTML. BR, Jukka Zitting