Actually, after 15 minutes it seems that Tika is in an infinite loop: The call
com.bluecast.xml.XMLStreamReader.read(char[], int, int) does not seem to complete. In Eclipse's Debuggger I see that the arguments are an array of size 16384, the second argument is 0, and the third 16384. Could the problem be an overflow? Kaspar On 02.08.2010, at 08:55, Kaspar Fischer wrote: > Dear list, > > I am experiencing very slow performance with Tika 0.7 on some large HTML > documents. For example, when calling Tika.detect() on > > http://www.lumerias.com/browse/2220001 > > which is 500k of HTML, it takes several minutes for the call to complete. > > I have suspended the VM several times and every time, the stack trace (see > below) shows Tika is within MimeTypes.getMimeType()'s call to > XmlRootExtractor.extractRootElement(). So it seems Tika is busy parsing the > file. > > Is there anything I can do to speed this up? > > Many thanks, > Kaspar > > -- > Thread [pool-1-thread-99] (Suspended) > com.bluecast.xml.XMLStreamReader.read(char[], int, int) line: not > available > com.bluecast.xml.PiccoloLexer.yy_refill() line: not available > com.bluecast.xml.PiccoloLexer.yylex() line: not available > com.bluecast.xml.Piccolo.yylex() line: not available > com.bluecast.xml.Piccolo.yyparse() line: not available > com.bluecast.xml.Piccolo.parse(org.xml.sax.InputSource) line: not > available > > com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(org.xml.sax.InputSource, > org.xml.sax.helpers.DefaultHandler) line: 395 > > com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(java.io.InputStream, > org.xml.sax.helpers.DefaultHandler) line: 198 > org.apache.tika.detect.XmlRootExtractor.extractRootElement(byte[]) > line: 60 > org.apache.tika.mime.MimeTypes.getMimeType(byte[]) line: 232 > org.apache.tika.mime.MimeTypes.detect(java.io.InputStream, > org.apache.tika.metadata.Metadata) line: 530 > org.apache.tika.Tika.detect(java.io.InputStream, > org.apache.tika.metadata.Metadata) line: 109 > > org.myorg.is.document.TikaMetadataExtractor.extractMetadata(java.io.Serializable, > org.myorg.is.engine.api.document.Metadata, java.io.InputStream) line: 73 > > > org.myorg.is.document.MemoryDocumentStore.add(org.myorg.is.engine.api.document.Link, > org.myorg.is.engine.api.document.MetadataExtractor, > java.util.Map<java.lang.String,java.lang.Object>) line: 149 > ...
