[ https://issues.apache.org/jira/browse/TIKA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-310. -------------------------------- Resolution: Fixed Fix Version/s: 0.5 Replaced NekoHTML with TagSoup in revision 825239. > Use TagSoup to parse HTML > ------------------------- > > Key: TIKA-310 > URL: https://issues.apache.org/jira/browse/TIKA-310 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Jukka Zitting > Assignee: Jukka Zitting > Fix For: 0.5 > > > The NekoHTML library we currently use for parsing HTML has a transitive > dependency on Apache Xerces. The Xerces library is pretty big (1.2MB) and is > known to cause various problems when included in the classpath of an > application or a container that expects some other XML parser library. > The TagSoup library (http://home.ccil.org/~cowan/XML/tagsoup/) provides an > alternative HTML parsing library that works pretty much like NekoHTML but > doesn't depend on Xerces. I suggest we switch from NekoHTML to TagSoup unless > this change causes major regressions in HTML parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.