Hello, we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs is searchable via a full-text search. Also the terms are used to make search suggestions.
Unfortunately pdfbox seems to insert a space character, when there are soft-hyphens in the content of the PDF Thus the extracted text is sometimes very fragmented. For example the word Medizin is extracted as Me di zin. As a consequence the suggestions are often unusable and the search does not work as expected. Has anyone a suggestion how to extract the content of PDF containing sof-hyphens withpout fragmenting it? Best Dirk