On 03/04/2010 01:21 AM, George Van Treeck wrote:
So, if there is no good way to do word grouping, then what good would
text extraction be at all if the output were just a stream of
nonblank charcharters?

Take a look at what org.apache.pdfbox.ExtractText generates for your
documents.  Is it just a stream of nonblank characters, or does it
recover the correct word boundaries?

Rereading my previous message, I see that I said something confusing:

the correct implementation of that feature would be to calculate
the average inter-character spacing, and infer a word break when a
spacing significantly larger than the average is observed.  That's
not what pdfbox 0.8 did.

What I meant was that it didn't take spacing into account when deciding
how much text to pass to processTextPosition. It actually *does* take spacing into account later on in the process, namely in
PDFTextStripper.writePage().

I don't know why you chose processTextPosition as the place to add your
application-specific functionality, but if you don't need to know the
x,y coordinates of the characters then it's probably not the right
choice.  I would suggest that you try overriding writeString instead.
The string it receives has word separator characters (as defined by
getWordSeparator()) between words.

-Aaron

Reply via email to