Hello,
When converting a bunch of Microsoft Word documents using the command,
java -jar tika-app-1.1-SNAPSHOT.jar -v -t
, I'm getting the following exception.
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@5d3ac0
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 487
at org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174)
at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80)
at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:48)
at
org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:67)
at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:103)
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:943)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:146)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 4 more
Any idea how to avoid getting this error?
Because these are internal business documents, I may not be able to share them
with you guys so would greatly appreciate a fix or a workaround.
Noticed that with 'tika-app-1.0.jar', an even greater number of files would fail
to convert. So, things definitely seem to have improved with version 1.1.
Regards,
/HS