Re: PDFTextStripper.processTextPosition

Thomas Fischer Thu, 04 Mar 2010 03:55:03 -0800

Hello,

actually, it seems that PDFBox 1.0.1 does a much better job in separating words 
than any previous version (I admit I only checked 0.8 and 0.7.3).
But I get this experience from the text output using org.pdfbox.ExtractText. 
But if the spacing of the words in the extracted text is correct, there should 
be a way for TextStripper to get it right as well.
Or am I mistaken?


Thomas Fischer


Am 04.03.2010 um 01:21 schrieb George Van Treeck:

> Thanks for the clarification. So, if there is no good way to do word
> grouping, then what good would text extraction be at all if the output 
> were just a stream of nonblank charcharters? Also, Google
> seems to be able to extract words from PDFs properly for search
> indexes. So, there must be some fairly robust method of word
> grouping possible.
> 
> While each pdfbox client could write a method to
> implement word grouping, the implementations would likely vary
> greatly in quality. It would be nice if someone very familiar with PDF
> formats wrote an implementation with the documented caveat that
> it is error-prone. It would probably be far less error-prone than
> one written by me.
> 
> -George
> 
> 
> ----- Original Message ----
> From: "[email protected]" <[email protected]>
> To: [email protected]
> Sent: Wed, March 3, 2010 10:21:41 AM
> Subject: Re: PDFTextStripper.processTextPosition
> 
>>> The 1.0 API change, has moved further away from user-based API to a
>>> functional API, which is a very bad thing to do. And that is why
>>> there are lot of complaints about the API now being "broken". From a
>>> use-case point of view, the API has suffered a very serious
>>> regression.
>> 
>> Your argument about how the API ought to work is well-reasoned, and I 
>> don't take issue with it.  However, you're wrong to say that there has 
>> been a regression in pdfbox.  The pdfbox API never promised that 
>> processTextPosition() would be called once per word.  It sounds like you 
>> and others observed empirically, on particular documents, that the 
>> callback was called once per word (or once per table cell in someone 
>> else's case), and you incorrectly inferred that this was guaranteed. 
>> But in fact, even with older versions of pdfbox there are documents for 
>> which it is called with one character at a time.  It depends on the 
>> software that created the PDF.
>> 
>> In other words, software that expected processTextPosition to be called 
>> once per word was always broken.  Pdfbox 1.0 just makes the breakage 
>> apparent on a wider range of documents.
>> 
>> You can certainly request an improvement to make it work the way you 
>> previously thought it worked.  But the correct implementation of that 
>> feature would be to calculate the average inter-character spacing, and 
>> infer a word break when a spacing significantly larger than the average 
>> is observed.  That's not what pdfbox 0.8 did.
>> 
>> -Aaron
> 
> I agree it is really up to how the PDF was created.  I have run into some 
> PDFs where the text was constructed as one block, "Uniform Residential 
> Appraisal Report", I'll call that the "good way" and others that are 
> constructed as multiple blocks like "Uniform Res", "idential Ap", 
> "praisa", "l Report" and I'll call that the bad way.  The PDFs that are 
> constructed the good way, I can easily pull data out and have a high 
> confidence the data is correct.  The PDFs that are constructed the bad 
> way, I have to "glue" the blocks together looking at the spacing between 
> words.  Unfortunately these bad PDFs don't always have correct spacing and 
> when I glue I will sometimes over select and join blocks that shouldn't be 
> joined.  Forcing everything down to a charactor level throws away 
> intrinsic data making every PDF a bad PDF.
> 
> So yes, returning words instead of characters isn't going to be a magic 
> bullet but it does make some PDFs alot easier to deal with.  The remainder 
> of the pdfs can be sent a person for quick review to check for any miss 
> joins.
> Andrew
> 
> PS.  How many people are working on CDD?

smime.p7s
Description: S/MIME cryptographic signature

Re: PDFTextStripper.processTextPosition

Reply via email to