On Thu, 8 Dec 2011, Kevin Krouse wrote:
Anyone?
I'd suggest you ignore them for now
The issue is that they have the same name as the real file (plus ._), so
the extension looks to be different to what the file actually is
We should probably add mime magic to detect them, if anyone knows which
bits of the header area stable? Looking at a few files I have to hand,
they all seem to start with
00000000 00 05 16 07 00 02 00 00 4d 61 63 20 4f 53 20 58 |........Mac
OS X|
00000010 20 20 20 20 20 20 20 20 00 02 00 00 00 09 00 00 |
........|
Nick
On Fri, Dec 2, 2011 at 11:34 AM, Kevin Krouse <[email protected]> wrote:
Hello Tikas,
We are getting XML parse exceptions when Tika tries to index Mac hidden
metadata files that start with a "._" prefix. I don't know much about
these hidden files, but they are binary files and won't
parse as XML.
Should we be filtering these out before Tika tries to process them or
is it a bug in the AutoDetectParser?
org.labkey.search.model.LuceneSearchServiceImpl$PreProcessingException:/Users/kevink/data/._somefile.xml
at
org.labkey.search.model.LuceneSearchServiceImpl.logAsPreProcessingException(LuceneSearchServiceImpl.java:701)
at
org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:499)
at
org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:883)
at
org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:967)
at
org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:1003)
at
java.lang.Thread.run(Thread.java:680)org.apache.tika.exception.TikaException:
XML parse error at
org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
at
org.labkey.search.model.LuceneSearchServiceImpl.parse(LuceneSearchServiceImpl.java:575)
at
org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:339)
... 4 moreCaused by: org.xml.sax.SAXParseException: Content is not
allowed in prolog. at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:196)
at
org.apache.xerces.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:175)
at
org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:394)
at
org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:322)
at
org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:281)
at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1459)
at
org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(XMLDocumentScannerImpl.java:870)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324)
at
org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845)
at
org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
at
org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at
org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65) ...
10 more
Kevin