Re: Text extraction problems...

Andreas Lehmkühler Sun, 21 Aug 2011 01:00:41 -0700

Hi,

Am 20.08.11 18:17, schrieb James Wilson:


When I run PDFTextStripper on some PDFs created by a certain PDF writer
I get non-printable characters for the spaces.

Thisï¿½isï¿½theï¿½Mainï¿½Documentï¿½toï¿½beï¿½filedï¿½inï¿½theï¿½TESTï¿½database.ï¿½

Tryï¿½findingï¿½Nelsonï¿½Mandelaï¿½likesï¿½*apples*ï¿½
Fruitï¿½names:ï¿½
Pineapplesï¿½
Grapesï¿½
Bingï¿½Cherriesï¿½
Pearsï¿½
Peachesï¿½

Does anybody know why this is happening? To me it looks like an encoding
problem. Maybe the encoding of the text within the PDF is slightly
different than the default encoding on the server that is running
PDFTextStripper against it? I have verified that the problematic PDFs
are being created on a Windows machine and that the PDF is having its
text extracted on a Linux machine.

Any ideas how to fix this? Is there a pdfbox resource file I can modify
in order to teach pdfbox to use a UTF space instead of ï¿½ ?

Did you override the default encoding?

Can you copy&paste the text using acrobat reader? If not, the pdf most
likely uses some fonts which don't provide any mapping to extract the
text. If c&p works there maybe an issue with pdfbox.

Thanks in advance!

James


BR
Andreas Lehmkühler

Re: Text extraction problems...

Reply via email to