Nick,
Thanks for the idea to try it from tika-app, having come from Solr, I was
unaware of this (great) tool. I opened a vsd file I knew was parsing fine and
it loaded fine, as expected. I then opened the problem file and finally saw the
stack trace occurring. As you suspected, it looks perhaps more like POI is the
offender.
Apache Tika was unable to parse the document
at ...\myfile.vsd.
The full exception stack trace is included below:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@5b202f4d
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
at javax.swing.AbstractButton.doClick(Unknown Source)
at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown
Source)
at java.awt.Component.processMouseEvent(Unknown Source)
at javax.swing.JComponent.processMouseEvent(Unknown Source)
at java.awt.Component.processEvent(Unknown Source)
at java.awt.Container.processEvent(Unknown Source)
at java.awt.Component.dispatchEventImpl(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
at java.awt.Container.dispatchEventImpl(Unknown Source)
at java.awt.Window.dispatchEventImpl(Unknown Source)
at java.awt.Component.dispatchEvent(Unknown Source)
at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
at java.awt.EventQueue.access$000(Unknown Source)
at java.awt.EventQueue$1.run(Unknown Source)
at java.awt.EventQueue$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown
Source)
at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown
Source)
at java.awt.EventQueue$2.run(Unknown Source)
at java.awt.EventQueue$2.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.AccessControlContext$1.doIntersectionPrivilege(Unknown
Source)
at java.awt.EventQueue.dispatchEvent(Unknown Source)
at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
at java.awt.EventDispatchThread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Illegal offset 8 (String
data is of length 8)
at org.apache.poi.util.StringUtil.getFromUnicodeLE(StringUtil.java:70)
at org.apache.poi.hdgf.chunks.Chunk.processCommands(Chunk.java:203)
at
org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:180)
at
org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59)
at
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93)
at
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
at
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:106)
at
org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:55)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:214)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:177)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 43 more