Nick,

Thanks for the idea to try it from tika-app, having come from Solr, I was 
unaware of this (great) tool. I opened a vsd file I knew was parsing fine and 
it loaded fine, as expected. I then opened the problem file and finally saw the 
stack trace occurring. As you suspected, it looks perhaps more like POI is the 
offender. 

Apache Tika was unable to parse the document
at ...\myfile.vsd.

The full exception stack trace is included below:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@5b202f4d
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
        at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
        at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
        at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
        at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
        at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
        at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
        at javax.swing.AbstractButton.doClick(Unknown Source)
        at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
        at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown 
Source)
        at java.awt.Component.processMouseEvent(Unknown Source)
        at javax.swing.JComponent.processMouseEvent(Unknown Source)
        at java.awt.Component.processEvent(Unknown Source)
        at java.awt.Container.processEvent(Unknown Source)
        at java.awt.Component.dispatchEventImpl(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
        at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
        at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
        at java.awt.Container.dispatchEventImpl(Unknown Source)
        at java.awt.Window.dispatchEventImpl(Unknown Source)
        at java.awt.Component.dispatchEvent(Unknown Source)
        at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
        at java.awt.EventQueue.access$000(Unknown Source)
        at java.awt.EventQueue$1.run(Unknown Source)
        at java.awt.EventQueue$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown 
Source)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown 
Source)
        at java.awt.EventQueue$2.run(Unknown Source)
        at java.awt.EventQueue$2.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown 
Source)
        at java.awt.EventQueue.dispatchEvent(Unknown Source)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
        at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Illegal offset 8 (String 
data is of length 8)
        at org.apache.poi.util.StringUtil.getFromUnicodeLE(StringUtil.java:70)
        at org.apache.poi.hdgf.chunks.Chunk.processCommands(Chunk.java:203)
        at 
org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:180)
        at 
org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59)
        at 
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93)
        at 
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
        at 
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
        at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:106)
        at 
org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:55)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:214)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:177)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 43 more

Reply via email to