Actually, after 15 minutes it seems that Tika is in an infinite loop: The call

 com.bluecast.xml.XMLStreamReader.read(char[], int, int)

does not seem to complete. In Eclipse's Debuggger I see that the arguments are 
an array of size 16384, the second argument is 0, and the third 16384. Could 
the problem be an overflow?

Kaspar

On 02.08.2010, at 08:55, Kaspar Fischer wrote:

> Dear list,
> 
> I am experiencing very slow performance with Tika 0.7 on some large HTML 
> documents. For example, when calling Tika.detect() on
> 
>  http://www.lumerias.com/browse/2220001
> 
> which is 500k of HTML, it takes several minutes for the call to complete.
> 
> I have suspended the VM several times and every time, the stack trace (see 
> below) shows Tika is within MimeTypes.getMimeType()'s call to 
> XmlRootExtractor.extractRootElement(). So it seems Tika is busy parsing the 
> file.
> 
> Is there anything I can do to speed this up?
> 
> Many thanks,
> Kaspar
> 
> --
> Thread [pool-1-thread-99] (Suspended) 
>       com.bluecast.xml.XMLStreamReader.read(char[], int, int) line: not 
> available     
>       com.bluecast.xml.PiccoloLexer.yy_refill() line: not available   
>       com.bluecast.xml.PiccoloLexer.yylex() line: not available       
>       com.bluecast.xml.Piccolo.yylex() line: not available    
>       com.bluecast.xml.Piccolo.yyparse() line: not available  
>       com.bluecast.xml.Piccolo.parse(org.xml.sax.InputSource) line: not 
> available     
>       
> com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(org.xml.sax.InputSource,
>  org.xml.sax.helpers.DefaultHandler) line: 395   
>       
> com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(java.io.InputStream,
>  org.xml.sax.helpers.DefaultHandler) line: 198       
>       org.apache.tika.detect.XmlRootExtractor.extractRootElement(byte[]) 
> line: 60     
>       org.apache.tika.mime.MimeTypes.getMimeType(byte[]) line: 232    
>       org.apache.tika.mime.MimeTypes.detect(java.io.InputStream, 
> org.apache.tika.metadata.Metadata) line: 530 
>       org.apache.tika.Tika.detect(java.io.InputStream, 
> org.apache.tika.metadata.Metadata) line: 109   
>       
> org.myorg.is.document.TikaMetadataExtractor.extractMetadata(java.io.Serializable,
>  org.myorg.is.engine.api.document.Metadata, java.io.InputStream) line: 73     
>  
>       
> org.myorg.is.document.MemoryDocumentStore.add(org.myorg.is.engine.api.document.Link,
>  org.myorg.is.engine.api.document.MetadataExtractor, 
> java.util.Map<java.lang.String,java.lang.Object>) line: 149
>       ...

Reply via email to