Hi, Am 20.08.11 18:17, schrieb James Wilson:
When I run PDFTextStripper on some PDFs created by a certain PDF writer I get non-printable characters for the spaces. This�is�the�Main�Document�to�be�filed�in�the�TEST�database.� Try�finding�Nelson�Mandela�likes�*apples*� Fruit�names:� Pineapples� Grapes� Bing�Cherries� Pears� Peaches� Does anybody know why this is happening? To me it looks like an encoding problem. Maybe the encoding of the text within the PDF is slightly different than the default encoding on the server that is running PDFTextStripper against it? I have verified that the problematic PDFs are being created on a Windows machine and that the PDF is having its text extracted on a Linux machine. Any ideas how to fix this? Is there a pdfbox resource file I can modify in order to teach pdfbox to use a UTF space instead of � ?
Did you override the default encoding? Can you copy&paste the text using acrobat reader? If not, the pdf most likely uses some fonts which don't provide any mapping to extract the text. If c&p works there maybe an issue with pdfbox.
Thanks in advance! James
BR Andreas Lehmkühler

