Hi, On Dec 20, 2007 1:54 PM, Mats Norén <[EMAIL PROTECTED]> wrote: > On Dec 20, 2007 11:03 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > > You may want to contact the POI mailing lists, as I don't think many > > of us have too much experience with POI internals. > > The thing is that it's the WordParser.java that calls > TextPiece.substring with a negative value, so my guess is that it's > the algorithm itself that for some corner case does the wrong thing.
I'm sorry, you're of course right. I somehow misunderstood you referring to a class within POI. Too much on my mind lately... As Niall already pointed out, this seems like a bug in our parser code. > It's my understaning that the text extraction in Jackrabbit is based > on textmining.org which is the basis for the WordParser in Tika, is > that correct or have got it wrong? Yes. The code actually ended in Tika through Nutch and Lius, but originates from the same textmining.org codebase that also Jackrabbit is currently using. Unfortunately Ryan Ackley is no longer maintaining the code, which leaves us with few options other than embedding the code in Tika. It would IMHO be best if we could push all the complex file format logic out to separate parser libraries (preferably POI in this case), where there would likely be people with much better understanding of that specific format and the related parsing code. Anyway, until we get rid of the code we should try to maintain it the best we can, so please file a bug report for this issue. :-) BR, Jukka Zitting