If I understand this correctly, the 0.8 API invoked processTextPosition() either once-per-word or at some other frequency, but often enough to create the unadvertised expectation that it would be called once-per-word. In the change to the 1.0 API, this unfortunate misunderstanding was brought to a head when processTextPosition() became a one-per-character phenomenon. If processTextPosition() were ever advertised as once-per-word, then it isn't cricket to change its semantics in a later version.

If this were a simple matter, one could create a facade function that gathers up multiple processTextPosition() calls, one letter at a time, and invoke a new function once-per-word. However, this does not seem to be a simple matter. It appears that the "is this a complete word" decision needs to be taken upon consideration of the placement of the next letter with respect to the last letter(s). This is likely to be a decision that cannot be taken without the context of the typefaces and sizes of the letters involved. Or worse.

Perhaps the API could be expanded to accept a callback that solves this argument. The client programmer might register with pdfbox a callback function that takes as its input the context of previously-grouped letters and the current letter. This callback would return true/false depending upon whether the latest letter should be classified as grouped with previous letters or not. If this callback always returns true, we get the 1.0 API behavior. If this callback can solve the problem of grouping letters into words, you get the hoped-for behavior. (A clever enough callback might even un-hyphenate words!)

However, letters may also be grouped into clauses, or sentences. Such considerations seem (to me) to be outside the scope of pdfbox. Honoring the suggestion of my last paragraph may put us on a slippery slope of scope creep. Perhaps a better solution is for processTextPosition() to provide all the information about letter-placement that the putative callback function would have, and let the letter/word/etc. grouping be done in client code.

Cordially,

steve

Reply via email to