Hi, two questions regarding thread-safety and locking in Tika's MIME type detectors while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1].
First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe? I've found an answer from 2011 about Tika.detect(...) https://www.mail-archive.com/[email protected]/msg00296.html but want to make sure whether this is still true and also applies to MimeType.detect(...)? Second, there is a lock (on the jar file) when detecting the MIME type of XML or HTML documents: "FetcherThread" #146 daemon ... waiting for monitor entry [0x00007f21b3f45000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.zip.ZipFile.getEntry(ZipFile.java:315) - waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile) at java.util.jar.JarFile.getEntry(JarFile.java:240) at java.util.jar.JarFile.getJarEntry(JarFile.java:223) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042) ... at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232) at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source) at org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source) at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source) at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source) at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source) at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source) at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown Source) at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62) at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42) at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494) at org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193) at org.apache.nutch.protocol.Content.getContentType(Content.java:310) at org.apache.nutch.protocol.Content.<init>(Content.java:107) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321) at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341) >From 120 threads I've found up to 30 waiting for this lock. For the stack line o.a.xerces.parsers.ObjectFactory.createObject(...) I've found the following discussion https://www.mail-archive.com/[email protected]/msg03825.html which recommends either to reuse the parser (probably hard to get it thread-safe) or to explicitly set the property "org.apache.xerces.xni.parser.XMLParserConfiguration". Did anyone see a similar problem? Thanks, Sebastian [1] https://issues.apache.org/jira/browse/NUTCH-2578
