Dear Joël, that’s a very much needed example - thank you very much for your time and effort on this.
WDYT about integrating your example into the pdfbox example package and include it in the - still to extend - documentation? Might also provide a base for a new helper utility. With kind regards Maruan Am 10.09.2014 um 00:56 schrieb Joël Kuiper <[email protected]>: > So I figured it out. Those were not a pleasant 6 hours ;-) > > I’ve subclassed the PDFTextStripper to build a cache (called textCache) that > maintains (per page) a mapping between the characters and the TextPositions, > instead of just returning the final string. > Using a regular expression you can then find the TextPositions in the cache > that match the pattern. > From that list of TextPositions the bounding boxes can then be calculated > which can be put in as PDAnnotationTextMarkup's. > > The code is not pretty (haven’t done Java in a while and it was a rush job) > but it may provide a nice starting point for more serious stuff! > > https://gist.github.com/joelkuiper/9eb52555e02edb653dcf > > Hopefully this is useful to someone else as well! > > Joël

