If I understand this correctly, the 0.8 API invoked
processTextPosition() either once-per-word or at some other frequency,
but often enough to create the unadvertised expectation that it would be
called once-per-word. In the change to the 1.0 API, this unfortunate
misunderstanding was brought to a head when processTextPosition() became
a one-per-character phenomenon. If processTextPosition() were ever
advertised as once-per-word, then it isn't cricket to change its
semantics in a later version.
If this were a simple matter, one could create a facade function that
gathers up multiple processTextPosition() calls, one letter at a time,
and invoke a new function once-per-word. However, this does not seem to
be a simple matter. It appears that the "is this a complete word"
decision needs to be taken upon consideration of the placement of the
next letter with respect to the last letter(s). This is likely to be a
decision that cannot be taken without the context of the typefaces and
sizes of the letters involved. Or worse.
Perhaps the API could be expanded to accept a callback that solves this
argument. The client programmer might register with pdfbox a callback
function that takes as its input the context of previously-grouped
letters and the current letter. This callback would return true/false
depending upon whether the latest letter should be classified as grouped
with previous letters or not. If this callback always returns true, we
get the 1.0 API behavior. If this callback can solve the problem of
grouping letters into words, you get the hoped-for behavior. (A clever
enough callback might even un-hyphenate words!)
However, letters may also be grouped into clauses, or sentences. Such
considerations seem (to me) to be outside the scope of pdfbox. Honoring
the suggestion of my last paragraph may put us on a slippery slope of
scope creep. Perhaps a better solution is for processTextPosition() to
provide all the information about letter-placement that the putative
callback function would have, and let the letter/word/etc. grouping be
done in client code.
Cordially,
steve