[ 
https://issues.apache.org/jira/browse/TIKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-374.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7
         Assignee: Jukka Zitting

Thanks for the accurate analysis of the problem. I fixed this in revision 
903775 by making each call to XmlRootExtractor.extractRootElement() use a new 
SAXParser instance.

> AutoDetectParser not thread-safe?
> ---------------------------------
>
>                 Key: TIKA-374
>                 URL: https://issues.apache.org/jira/browse/TIKA-374
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>         Environment: Dell E6400 (dual-core) running 64-bit Windows 7.  Also 
> reproduced on an 8-processor Mac OS/X server.
>            Reporter: Adam Rauch
>            Assignee: Jukka Zitting
>             Fix For: 0.7
>
>
> We are using Tika 0.5 to parse files that are added to a Lucene index.  If we 
> assign multiple threads to the parsing task we find that the 
> AutoDetectParser.parse() method occasionally fails to return.  In our case, 
> it appears that a HashMap inside Xerces gets corrupted, causing an infinite 
> loop inside HashMap.get().  This seems to be a concurrency problem; we have 
> not seen the issue when running single threaded.
> Other posts have stated that AutoDetectParser is thread-safe.  A quick look 
> at the source code shows that an AutoDetectParser holds a MimeTypes which 
> holds an XmlRootExtractor which holds a SAXParser.  As a result, a single 
> SAXParser instance can end up simultaneously parsing documents in multiple 
> threads.  The Java 1.4 SAXParser JavaDoc clearly states that "An 
> implementation of SAXParser is NOT guaranteed to behave as per the 
> specification if it is used concurrently by two or more threads."  More 
> recent versions of the JavaDoc have removed the warning, though the presence 
> of "setProperty()" certainly means that a SAXParser is not immutable.  As you 
> can see from the stack trace below, properties seem to be the issue in this 
> case.
> We've tried to work around the issue by constructing a new AutoDetectParser 
> for each file we parse, but this doesn't solve the problem.  Multiple 
> AutoDectectParsers can still end up sharing a single instance of MimeTypes, 
> because TikaConfig holds a MimeTypes instance statically (??) and updates it 
> without synchronization (??).
> java.lang.Thread.State: RUNNABLE
>              at java.util.HashMap.get(HashMap.java:303)
>              at 
> org.apache.xerces.util.ParserConfigurationSettings.getProperty(ParserConfigurationSettings.java:224)
>              at 
> org.apache.xerces.impl.dtd.XMLDTDProcessor.reset(XMLDTDProcessor.java:344)
>              at 
> org.apache.xerces.parsers.XML11Configuration.reset(XML11Configuration.java:984)
>              at 
> org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:806)
>              at 
> org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
>              at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
>              at 
> org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
>              at 
> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
>              at 
> org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
>              at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>              at 
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:63)
>              at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:237)
>              at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:534)
>              at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:92)
>              at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
>              at 
> org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:170)
>              at 
> org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:664)
>              at 
> org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:737)
>              at 
> org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:773)
>              at java.lang.Thread.run(Thread.java:637)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to