AutoDetectParser not thread-safe?
---------------------------------

                 Key: TIKA-374
                 URL: https://issues.apache.org/jira/browse/TIKA-374
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 0.5
         Environment: Dell E6400 (dual-core) running 64-bit Windows 7.  Also 
reproduced on an 8-processor Mac OS/X server.
            Reporter: Adam Rauch


We are using Tika 0.5 to parse files that are added to a Lucene index.  If we 
assign multiple threads to the parsing task we find that the 
AutoDetectParser.parse() method occasionally fails to return.  In our case, it 
appears that a HashMap inside Xerces gets corrupted, causing an infinite loop 
inside HashMap.get().  This seems to be a concurrency problem; we have not seen 
the issue when running single threaded.

Other posts have stated that AutoDetectParser is thread-safe.  A quick look at 
the source code shows that an AutoDetectParser holds a MimeTypes which holds an 
XmlRootExtractor which holds a SAXParser.  As a result, a single SAXParser 
instance can end up simultaneously parsing documents in multiple threads.  The 
Java 1.4 SAXParser JavaDoc clearly states that "An implementation of SAXParser 
is NOT guaranteed to behave as per the specification if it is used concurrently 
by two or more threads."  More recent versions of the JavaDoc have removed the 
warning, though the presence of "setProperty()" certainly means that a 
SAXParser is not immutable.  As you can see from the stack trace below, 
properties seem to be the issue in this case.

We've tried to work around the issue by constructing a new AutoDetectParser for 
each file we parse, but this doesn't solve the problem.  Multiple 
AutoDectectParsers can still end up sharing a single instance of MimeTypes, 
because TikaConfig holds a MimeTypes instance statically (??) and updates it 
without synchronization (??).

java.lang.Thread.State: RUNNABLE
             at java.util.HashMap.get(HashMap.java:303)
             at 
org.apache.xerces.util.ParserConfigurationSettings.getProperty(ParserConfigurationSettings.java:224)
             at 
org.apache.xerces.impl.dtd.XMLDTDProcessor.reset(XMLDTDProcessor.java:344)
             at 
org.apache.xerces.parsers.XML11Configuration.reset(XML11Configuration.java:984)
             at 
org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:806)
             at 
org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
             at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
             at 
org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
             at 
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
             at 
org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
             at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
             at 
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:63)
             at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:237)
             at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:534)
             at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:92)
             at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
             at 
org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:170)
             at 
org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:664)
             at 
org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:737)
             at 
org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:773)
             at java.lang.Thread.run(Thread.java:637)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to