Hello, actually, it seems that PDFBox 1.0.1 does a much better job in separating words than any previous version (I admit I only checked 0.8 and 0.7.3). But I get this experience from the text output using org.pdfbox.ExtractText. But if the spacing of the words in the extracted text is correct, there should be a way for TextStripper to get it right as well. Or am I mistaken?
Thomas Fischer Am 04.03.2010 um 01:21 schrieb George Van Treeck: > Thanks for the clarification. So, if there is no good way to do word > grouping, then what good would text extraction be at all if the output > were just a stream of nonblank charcharters? Also, Google > seems to be able to extract words from PDFs properly for search > indexes. So, there must be some fairly robust method of word > grouping possible. > > While each pdfbox client could write a method to > implement word grouping, the implementations would likely vary > greatly in quality. It would be nice if someone very familiar with PDF > formats wrote an implementation with the documented caveat that > it is error-prone. It would probably be far less error-prone than > one written by me. > > -George > > > ----- Original Message ---- > From: "[email protected]" <[email protected]> > To: [email protected] > Sent: Wed, March 3, 2010 10:21:41 AM > Subject: Re: PDFTextStripper.processTextPosition > >>> The 1.0 API change, has moved further away from user-based API to a >>> functional API, which is a very bad thing to do. And that is why >>> there are lot of complaints about the API now being "broken". From a >>> use-case point of view, the API has suffered a very serious >>> regression. >> >> Your argument about how the API ought to work is well-reasoned, and I >> don't take issue with it. However, you're wrong to say that there has >> been a regression in pdfbox. The pdfbox API never promised that >> processTextPosition() would be called once per word. It sounds like you >> and others observed empirically, on particular documents, that the >> callback was called once per word (or once per table cell in someone >> else's case), and you incorrectly inferred that this was guaranteed. >> But in fact, even with older versions of pdfbox there are documents for >> which it is called with one character at a time. It depends on the >> software that created the PDF. >> >> In other words, software that expected processTextPosition to be called >> once per word was always broken. Pdfbox 1.0 just makes the breakage >> apparent on a wider range of documents. >> >> You can certainly request an improvement to make it work the way you >> previously thought it worked. But the correct implementation of that >> feature would be to calculate the average inter-character spacing, and >> infer a word break when a spacing significantly larger than the average >> is observed. That's not what pdfbox 0.8 did. >> >> -Aaron > > I agree it is really up to how the PDF was created. I have run into some > PDFs where the text was constructed as one block, "Uniform Residential > Appraisal Report", I'll call that the "good way" and others that are > constructed as multiple blocks like "Uniform Res", "idential Ap", > "praisa", "l Report" and I'll call that the bad way. The PDFs that are > constructed the good way, I can easily pull data out and have a high > confidence the data is correct. The PDFs that are constructed the bad > way, I have to "glue" the blocks together looking at the spacing between > words. Unfortunately these bad PDFs don't always have correct spacing and > when I glue I will sometimes over select and join blocks that shouldn't be > joined. Forcing everything down to a charactor level throws away > intrinsic data making every PDF a bad PDF. > > So yes, returning words instead of characters isn't going to be a magic > bullet but it does make some PDFs alot easier to deal with. The remainder > of the pdfs can be sent a person for quick review to check for any miss > joins. > Andrew > > PS. How many people are working on CDD?
smime.p7s
Description: S/MIME cryptographic signature

