Hi, Based on the Xerces discussion it sounds like using a pool of parsers would be the best approach.
Best, Jukka On Thu, May 17, 2018 at 11:51 AM, Sebastian Nagel <[email protected]> wrote: > Hi, > > two questions regarding thread-safety and locking in Tika's MIME type > detectors > while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1]. > > First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe? > I've found an answer from 2011 about Tika.detect(...) > https://www.mail-archive.com/[email protected]/msg00296.html > but want to make sure whether this is still true and also applies to > MimeType.detect(...)? > > > Second, there is a lock (on the jar file) when detecting the MIME type > of XML or HTML documents: > > "FetcherThread" #146 daemon ... waiting for monitor entry > [0x00007f21b3f45000] > java.lang.Thread.State: BLOCKED (on object monitor) > at java.util.zip.ZipFile.getEntry(ZipFile.java:315) > - waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile) > at java.util.jar.JarFile.getEntry(JarFile.java:240) > at java.util.jar.JarFile.getJarEntry(JarFile.java:223) > at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042) > ... > at > java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232) > at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at > org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source) > at > org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source) > at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown > Source) > at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown > Source) > at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source) > at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown > Source) > at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source) > at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown > Source) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62) > at > org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42) > at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494) > at > org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193) > at org.apache.nutch.protocol.Content.getContentType(Content.java:310) > at org.apache.nutch.protocol.Content.<init>(Content.java:107) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321) > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341) > > From 120 threads I've found up to 30 waiting for this lock. > For the stack line > o.a.xerces.parsers.ObjectFactory.createObject(...) > I've found the following discussion > https://www.mail-archive.com/[email protected]/msg03825.html > which recommends either to reuse the parser (probably hard to get it > thread-safe) > or to explicitly set the property > "org.apache.xerces.xni.parser.XMLParserConfiguration". > > Did anyone see a similar problem? > > > Thanks, > Sebastian > > > [1] https://issues.apache.org/jira/browse/NUTCH-2578
