Hi, On Mon, Aug 2, 2010 at 8:55 AM, Kaspar Fischer <[email protected]> wrote: > I am experiencing very slow performance with Tika 0.7 on some large HTML > documents. > [...] > > com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(java.io.InputStream, > org.xml.sax.helpers.DefaultHandler) line: 198 > org.apache.tika.detect.XmlRootExtractor.extractRootElement(byte[]) > line: 60
Tika uses the default XML parser in the classpath, which in your case seems to be Piccolo [1]. It looks like Piccolo is having trouble with the way Tika feeds just the beginning of the input file to the XML parser when trying to parse only the root element of the document. > Is there anything I can do to speed this up? If you don't need Piccolo specifically, you may want to try switching to another XML parser library. Even the default one included in your Java installation should work just fine for Tika. [1] http://piccolo.sourceforge.net/ BR, Jukka Zitting
