Hello there,

> To get it right one would have to use a general replacement of non-combining 
> into combining diacritics (and probably a normalisation process for unicode 
> to replace combinations by single characters). By the way, you might also 
> have to look out for ligatures (e.g. ff ffi fi fl).

The need for text post-processing depends on the class you're using for the job.

Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
all texts are filtered through
org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String)
before they are exposed to the application programmer via methods like
PDFTextStripper#writeString(String). However, it must be borne in mind
that TextNormalize relies on external ICU4J dependency - if it is not
properly installed, then the original string is returned unchanged.

Other classes such as org.apache.pdfbox.pdfviewer.PageDrawer do not do
it for you. For example, when overriding
PageDrawer#processTextPosition(TextPosition) with the intent of
capturing the text before it is painted, you must filter it through
TextNormalize manually to get the "correct" characters.


VR

Reply via email to