On Dec 19, 2007 7:43 PM, Mats Norén <[EMAIL PROTECTED]> wrote: > Hello, > I've been trying to extract text from a couple of different MS-Word > files and I'm getting mixed results. > Almost by random (as I see it) I get this error: > java.lang.StringIndexOutOfBoundsException: String index out of range: -21047 > at > java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:886) > at java.lang.StringBuffer.substring(StringBuffer.java:417) > at org.apache.poi.hwpf.model.TextPiece.substring(TextPiece.java:88) > at > org.apache.tika.parser.microsoft.WordParser.extractText(WordParser.java:163) > > Looking at the TextPiece in POI I can see that the substring method is > called with a negative value for end > > public String substring(int start, int end) > { > int denominator = _usesUnicode ? 2 : 1; > > return ((StringBuffer)_buf).substring(start/denominator, > end/denominator); > } > > I just can't see why / how runEnd - currentTextStart can end up being > a negative value.
>From my reading of the code I can't see how it can be anything other than zero or negative if/when it gets to line 163 of Tika's WordParser - since before that it loops until runEnd is less than or equal to currentTextEnd: while (runEnd > currentTextEnd) { ... } String str = currentPiece.substring(0, runEnd - currentTextStart); IMO this is a Tika bug and you should file a bug report (preferrably with an attached example Word document that causes the issue): https://issues.apache.org/jira/browse/TIKA Niall > String str = currentPiece.substring(0, runEnd - currentTextStart); > > Any ideas? > > Regards Mats >