Michael McCandless created TIKA-1033:
----------------------------------------

             Summary: Tika doesn't parse embedded OLE Chart/Graph objects
                 Key: TIKA-1033
                 URL: https://issues.apache.org/jira/browse/TIKA-1033
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor
         Attachments: emb.ppt

I have an example ppt that embeds a chart, but Tika mis-identifies it
as an XLS document.

The progID (oleShape.getProgID() in
HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
we seem to detect it as Excel (application/vnd.ms-excel) but then the
ExcelExtractor hits this exception:

{noformat}
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
instance
        at 
org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
        at 
org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
        at 
org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
        at 
org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
        at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
        at 
org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
        at 
org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
        at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
{noformat}

Since DelegatingParser silently suppresses all exceptions, when you
run TikaCLI you won't see any exception nor text extracted, but if you
run with -z, it will save 1.xls which if you then try to parse with
TikaCLI hits the above exception.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to