Hello there, > To get it right one would have to use a general replacement of non-combining > into combining diacritics (and probably a normalisation process for unicode > to replace combinations by single characters). By the way, you might also > have to look out for ligatures (e.g. ff ffi fi fl).
The need for text post-processing depends on the class you're using for the job. Class org.apache.pdfbox.util.PDFTextStripper does it for you, because all texts are filtered through org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String) before they are exposed to the application programmer via methods like PDFTextStripper#writeString(String). However, it must be borne in mind that TextNormalize relies on external ICU4J dependency - if it is not properly installed, then the original string is returned unchanged. Other classes such as org.apache.pdfbox.pdfviewer.PageDrawer do not do it for you. For example, when overriding PageDrawer#processTextPosition(TextPosition) with the intent of capturing the text before it is painted, you must filter it through TextNormalize manually to get the "correct" characters. VR

