On 03/04/2010 01:21 AM, George Van Treeck wrote:
So, if there is no good way to do word grouping, then what good would
text extraction be at all if the output were just a stream of
nonblank charcharters?
Take a look at what org.apache.pdfbox.ExtractText generates for your
documents. Is it just a stream of nonblank characters, or does it
recover the correct word boundaries?
Rereading my previous message, I see that I said something confusing:
the correct implementation of that feature would be to calculate
the average inter-character spacing, and infer a word break when a
spacing significantly larger than the average is observed. That's
not what pdfbox 0.8 did.
What I meant was that it didn't take spacing into account when deciding
how much text to pass to processTextPosition. It actually *does* take
spacing into account later on in the process, namely in
PDFTextStripper.writePage().
I don't know why you chose processTextPosition as the place to add your
application-specific functionality, but if you don't need to know the
x,y coordinates of the characters then it's probably not the right
choice. I would suggest that you try overriding writeString instead.
The string it receives has word separator characters (as defined by
getWordSeparator()) between words.
-Aaron