So I figured it out. Those were not a pleasant 6 hours ;-) I’ve subclassed the PDFTextStripper to build a cache (called textCache) that maintains (per page) a mapping between the characters and the TextPositions, instead of just returning the final string. Using a regular expression you can then find the TextPositions in the cache that match the pattern. From that list of TextPositions the bounding boxes can then be calculated which can be put in as PDAnnotationTextMarkup's.
The code is not pretty (haven’t done Java in a while and it was a rush job) but it may provide a nice starting point for more serious stuff! https://gist.github.com/joelkuiper/9eb52555e02edb653dcf Hopefully this is useful to someone else as well! Joël

