On 12.08.2023 16:03, t...@cid.is wrote:
Hi all,

[PDFBOX-371] was about the treatment of soft hyphens by PDFbox in the context of extracting text from PDF. It looks like there is _no_ treatment of soft hyphens by PDFbox, at least I did not found any information about it. Please prove me wrong or give me a hint how to get soft hyphens out of a PDF as soft hyphens (which means as an "excentric" unicode or an "excentric" string).
Thanks
Walter Claassen


There were some issues over the years, see

https://issues.apache.org/jira/browse/TIKA-3314 (which I just resolved but was fixed long ago)

and

https://issues.apache.org/jira/browse/PDFBOX-5115

please test with the file there or with your own; if you're unsatisfied, upload it to a sharehoster and post the URL.

Tilman

Reply via email to