Re: PDFTextStripper.processTextPosition

Aaron Kaplan Wed, 03 Mar 2010 05:28:40 -0800

The 1.0 API change, has moved further away from user-based API to a
functional API, which is a very bad thing to do. And that is why
there are lot of complaints about the API now being "broken". From a
use-case point of view, the API has suffered a very serious
regression.

Your argument about how the API ought to work is well-reasoned, and Idon't take issue with it. However, you're wrong to say that there hasbeen a regression in pdfbox. The pdfbox API never promised thatprocessTextPosition() would be called once per word. It sounds like youand others observed empirically, on particular documents, that thecallback was called once per word (or once per table cell in someoneelse's case), and you incorrectly inferred that this was guaranteed.But in fact, even with older versions of pdfbox there are documents forwhich it is called with one character at a time. It depends on thesoftware that created the PDF.

In other words, software that expected processTextPosition to be calledonce per word was always broken. Pdfbox 1.0 just makes the breakageapparent on a wider range of documents.

You can certainly request an improvement to make it work the way youpreviously thought it worked. But the correct implementation of thatfeature would be to calculate the average inter-character spacing, andinfer a word break when a spacing significantly larger than the averageis observed. That's not what pdfbox 0.8 did.


-Aaron

Re: PDFTextStripper.processTextPosition

Reply via email to