Dear Joël,

that’s a very much needed example - thank you very much for your time and 
effort on this.

WDYT about integrating your example into the pdfbox example package and include 
it in the - still to extend - documentation? Might also provide a base for a 
new helper utility.

With kind regards

Maruan


Am 10.09.2014 um 00:56 schrieb Joël Kuiper <[email protected]>:

> So I figured it out. Those were not a pleasant 6 hours ;-) 
> 
> I’ve subclassed the PDFTextStripper to build a cache (called textCache) that 
> maintains (per page) a mapping between the characters and the TextPositions, 
> instead of just returning the final string. 
> Using a regular expression you can then find the TextPositions in the cache 
> that match the pattern. 
> From that list of TextPositions the bounding boxes can then be calculated 
> which can be put in as PDAnnotationTextMarkup's. 
> 
> The code is not pretty (haven’t done Java in a while and it was a rush job) 
> but it may provide a nice starting point for more serious stuff! 
> 
> https://gist.github.com/joelkuiper/9eb52555e02edb653dcf
> 
> Hopefully this is useful to someone else as well! 
> 
> Joël

Reply via email to