Dear Jukka, Thanks for the quick response!
On 02.08.2010, at 09:42, Jukka Zitting wrote: > On Mon, Aug 2, 2010 at 8:55 AM, Kaspar Fischer > <[email protected]> wrote: >> I am experiencing very slow performance with Tika 0.7 on some large HTML >> documents. >> [...] >> >> com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(java.io.InputStream, >> org.xml.sax.helpers.DefaultHandler) line: 198 >> org.apache.tika.detect.XmlRootExtractor.extractRootElement(byte[]) >> line: 60 > > Tika uses the default XML parser in the classpath, which in your case > seems to be Piccolo [1]. It looks like Piccolo is having trouble with > the way Tika feeds just the beginning of the input file to the XML > parser when trying to parse only the root element of the document. Ah, right. (I just tried with Piccolo 1.0.4 and there it works.) >> Is there anything I can do to speed this up? > > If you don't need Piccolo specifically, you may want to try switching > to another XML parser library. Even the default one included in your > Java installation should work just fine for Tika. I will do that as I indeed do not have a particular need for Piccolo. Again, thanks for your help. Kaspar
