On 12.08.2023 16:03, t...@cid.is wrote:
Hi all,
[PDFBOX-371] was about the treatment of soft hyphens by PDFbox in the
context of extracting text from PDF.
It looks like there is _no_ treatment of soft hyphens by PDFbox, at
least I did not found any information about it.
Please prove me wrong or give me a hint how to get soft hyphens out of
a PDF as soft hyphens (which means as an "excentric" unicode or an
"excentric" string).
Thanks
Walter Claassen
There were some issues over the years, see
https://issues.apache.org/jira/browse/TIKA-3314 (which I just resolved
but was fixed long ago)
and
https://issues.apache.org/jira/browse/PDFBOX-5115
please test with the file there or with your own; if you're unsatisfied,
upload it to a sharehoster and post the URL.
Tilman