Hi,

On Mon, Aug 2, 2010 at 8:55 AM, Kaspar Fischer
<[email protected]> wrote:
> I am experiencing very slow performance with Tika 0.7 on some large HTML 
> documents.
> [...]
>        
> com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(java.io.InputStream,
>  org.xml.sax.helpers.DefaultHandler) line: 198
>        org.apache.tika.detect.XmlRootExtractor.extractRootElement(byte[]) 
> line: 60

Tika uses the default XML parser in the classpath, which in your case
seems to be Piccolo [1]. It looks like Piccolo is having trouble with
the way Tika feeds just the beginning of the input file to the XML
parser when trying to parse only the root element of the document.

> Is there anything I can do to speed this up?

If you don't need Piccolo specifically, you may want to try switching
to another XML parser library. Even the default one included in your
Java installation should work just fine for Tika.

[1] http://piccolo.sourceforge.net/

BR,

Jukka Zitting

Reply via email to