Re: Problem with WordParser

Jukka Zitting Thu, 20 Dec 2007 11:11:40 -0800

Hi,

On Dec 20, 2007 1:54 PM, Mats Norén <[EMAIL PROTECTED]> wrote:
> On Dec 20, 2007 11:03 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> > You may want to contact the POI mailing lists, as I don't think many
> > of us have too much experience with POI internals.
>
> The thing is that it's the WordParser.java that calls
> TextPiece.substring with a negative value, so my guess is that it's
> the algorithm itself that for some corner case does the wrong thing.


I'm sorry, you're of course right. I somehow misunderstood you
referring to a class within POI. Too much on my mind lately...

As Niall already pointed out, this seems like a bug in our parser code.

> It's my understaning that the text extraction in Jackrabbit is based
> on textmining.org which is the basis for the WordParser in Tika, is
> that correct or have got it wrong?

Yes. The code actually ended in Tika through Nutch and Lius, but
originates from the same textmining.org codebase that also Jackrabbit
is currently using. Unfortunately Ryan Ackley is no longer maintaining
the code, which leaves us with few options other than embedding the
code in Tika. It would IMHO be best if we could push all the complex
file format logic out to separate parser libraries (preferably POI in
this case), where there would likely be people with much better
understanding of that specific format and the related parsing code.
Anyway, until we get rid of the code we should try to maintain it the
best we can, so please file a bug report for this issue. :-)

BR,

Jukka Zitting

Re: Problem with WordParser

Reply via email to