Dear Jukka,

Thanks for the quick response!

On 02.08.2010, at 09:42, Jukka Zitting wrote:

> On Mon, Aug 2, 2010 at 8:55 AM, Kaspar Fischer
> <[email protected]> wrote:
>> I am experiencing very slow performance with Tika 0.7 on some large HTML 
>> documents.
>> [...]
>>        
>> com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(java.io.InputStream,
>>  org.xml.sax.helpers.DefaultHandler) line: 198
>>        org.apache.tika.detect.XmlRootExtractor.extractRootElement(byte[]) 
>> line: 60
> 
> Tika uses the default XML parser in the classpath, which in your case
> seems to be Piccolo [1]. It looks like Piccolo is having trouble with
> the way Tika feeds just the beginning of the input file to the XML
> parser when trying to parse only the root element of the document.

Ah, right. (I just tried with Piccolo 1.0.4 and there it works.)

>> Is there anything I can do to speed this up?
> 
> If you don't need Piccolo specifically, you may want to try switching
> to another XML parser library. Even the default one included in your
> Java installation should work just fine for Tika.

I will do that as I indeed do not have a particular need for Piccolo.

Again, thanks for your help.

Kaspar

Reply via email to