Dear list, I am experiencing very slow performance with Tika 0.7 on some large HTML documents. For example, when calling Tika.detect() on
http://www.lumerias.com/browse/2220001 which is 500k of HTML, it takes several minutes for the call to complete. I have suspended the VM several times and every time, the stack trace (see below) shows Tika is within MimeTypes.getMimeType()'s call to XmlRootExtractor.extractRootElement(). So it seems Tika is busy parsing the file. Is there anything I can do to speed this up? Many thanks, Kaspar -- Thread [pool-1-thread-99] (Suspended) com.bluecast.xml.XMLStreamReader.read(char[], int, int) line: not available com.bluecast.xml.PiccoloLexer.yy_refill() line: not available com.bluecast.xml.PiccoloLexer.yylex() line: not available com.bluecast.xml.Piccolo.yylex() line: not available com.bluecast.xml.Piccolo.yyparse() line: not available com.bluecast.xml.Piccolo.parse(org.xml.sax.InputSource) line: not available com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(org.xml.sax.InputSource, org.xml.sax.helpers.DefaultHandler) line: 395 com.bluecast.xml.JAXPSAXParserFactory$JAXPSAXParser(javax.xml.parsers.SAXParser).parse(java.io.InputStream, org.xml.sax.helpers.DefaultHandler) line: 198 org.apache.tika.detect.XmlRootExtractor.extractRootElement(byte[]) line: 60 org.apache.tika.mime.MimeTypes.getMimeType(byte[]) line: 232 org.apache.tika.mime.MimeTypes.detect(java.io.InputStream, org.apache.tika.metadata.Metadata) line: 530 org.apache.tika.Tika.detect(java.io.InputStream, org.apache.tika.metadata.Metadata) line: 109 org.myorg.is.document.TikaMetadataExtractor.extractMetadata(java.io.Serializable, org.myorg.is.engine.api.document.Metadata, java.io.InputStream) line: 73 org.myorg.is.document.MemoryDocumentStore.add(org.myorg.is.engine.api.document.Link, org.myorg.is.engine.api.document.MetadataExtractor, java.util.Map<java.lang.String,java.lang.Object>) line: 149 ...
