Done. It seems that the error occurs on the last TextPiece if it occurs at all..
On Dec 20, 2007 8:11 PM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > Hi, > > On Dec 20, 2007 1:54 PM, Mats Norén <[EMAIL PROTECTED]> wrote: > > On Dec 20, 2007 11:03 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > > > You may want to contact the POI mailing lists, as I don't think many > > > of us have too much experience with POI internals. > > > > The thing is that it's the WordParser.java that calls > > TextPiece.substring with a negative value, so my guess is that it's > > the algorithm itself that for some corner case does the wrong thing. > > I'm sorry, you're of course right. I somehow misunderstood you > referring to a class within POI. Too much on my mind lately... > > As Niall already pointed out, this seems like a bug in our parser code. > > > It's my understaning that the text extraction in Jackrabbit is based > > on textmining.org which is the basis for the WordParser in Tika, is > > that correct or have got it wrong? > > Yes. The code actually ended in Tika through Nutch and Lius, but > originates from the same textmining.org codebase that also Jackrabbit > is currently using. Unfortunately Ryan Ackley is no longer maintaining > the code, which leaves us with few options other than embedding the > code in Tika. It would IMHO be best if we could push all the complex > file format logic out to separate parser libraries (preferably POI in > this case), where there would likely be people with much better > understanding of that specific format and the related parsing code. > Anyway, until we get rid of the code we should try to maintain it the > best we can, so please file a bug report for this issue. :-) > > BR, > > Jukka Zitting >