Thanks for the clarification. So, if there is no good way to do word
grouping, then what good would text extraction be at all if the output 
were just a stream of nonblank charcharters? Also, Google
seems to be able to extract words from PDFs properly for search
indexes. So, there must be some fairly robust method of word
grouping possible.

While each pdfbox client could write a method to
implement word grouping, the implementations would likely vary
greatly in quality. It would be nice if someone very familiar with PDF
formats wrote an implementation with the documented caveat that
it is error-prone. It would probably be far less error-prone than
one written by me.

-George


----- Original Message ----
From: "[email protected]" <[email protected]>
To: [email protected]
Sent: Wed, March 3, 2010 10:21:41 AM
Subject: Re: PDFTextStripper.processTextPosition

>> The 1.0 API change, has moved further away from user-based API to a
>> functional API, which is a very bad thing to do. And that is why
>> there are lot of complaints about the API now being "broken". From a
>> use-case point of view, the API has suffered a very serious
>> regression.
>
>Your argument about how the API ought to work is well-reasoned, and I 
>don't take issue with it.  However, you're wrong to say that there has 
>been a regression in pdfbox.  The pdfbox API never promised that 
>processTextPosition() would be called once per word.  It sounds like you 
>and others observed empirically, on particular documents, that the 
>callback was called once per word (or once per table cell in someone 
>else's case), and you incorrectly inferred that this was guaranteed. 
>But in fact, even with older versions of pdfbox there are documents for 
>which it is called with one character at a time.  It depends on the 
>software that created the PDF.
>
>In other words, software that expected processTextPosition to be called 
>once per word was always broken.  Pdfbox 1.0 just makes the breakage 
>apparent on a wider range of documents.
>
>You can certainly request an improvement to make it work the way you 
>previously thought it worked.  But the correct implementation of that 
>feature would be to calculate the average inter-character spacing, and 
>infer a word break when a spacing significantly larger than the average 
>is observed.  That's not what pdfbox 0.8 did.
>
>-Aaron

I agree it is really up to how the PDF was created.  I have run into some 
PDFs where the text was constructed as one block, "Uniform Residential 
Appraisal Report", I'll call that the "good way" and others that are 
constructed as multiple blocks like "Uniform Res", "idential Ap", 
"praisa", "l Report" and I'll call that the bad way.  The PDFs that are 
constructed the good way, I can easily pull data out and have a high 
confidence the data is correct.  The PDFs that are constructed the bad 
way, I have to "glue" the blocks together looking at the spacing between 
words.  Unfortunately these bad PDFs don't always have correct spacing and 
when I glue I will sometimes over select and join blocks that shouldn't be 
joined.  Forcing everything down to a charactor level throws away 
intrinsic data making every PDF a bad PDF.

So yes, returning words instead of characters isn't going to be a magic 
bullet but it does make some PDFs alot easier to deal with.  The remainder 
of the pdfs can be sent a person for quick review to check for any miss 
joins.
Andrew

PS.  How many people are working on CDD?

Reply via email to