So I figured it out. Those were not a pleasant 6 hours ;-) 

I’ve subclassed the PDFTextStripper to build a cache (called textCache) that 
maintains (per page) a mapping between the characters and the TextPositions, 
instead of just returning the final string. 
Using a regular expression you can then find the TextPositions in the cache 
that match the pattern. 
From that list of TextPositions the bounding boxes can then be calculated which 
can be put in as PDAnnotationTextMarkup's. 

The code is not pretty (haven’t done Java in a while and it was a rush job) but 
it may provide a nice starting point for more serious stuff! 

https://gist.github.com/joelkuiper/9eb52555e02edb653dcf

Hopefully this is useful to someone else as well! 

Joël

Reply via email to