When I run PDFTextStripper on some PDFs created by a certain PDF writer I get non-printable characters for the spaces.

This�is�the�Main�Document�to�be�filed�in�the�TEST�database.�
Try�finding�Nelson�Mandela�likes�*apples*�
Fruit�names:�
Pineapples�
Grapes�
Bing�Cherries�
Pears�
Peaches�

Does anybody know why this is happening? To me it looks like an encoding problem. Maybe the encoding of the text within the PDF is slightly different than the default encoding on the server that is running PDFTextStripper against it? I have verified that the problematic PDFs are being created on a Windows machine and that the PDF is having its text extracted on a Linux machine.

Any ideas how to fix this? Is there a pdfbox resource file I can modify in order to teach pdfbox to use a UTF space instead of � ?

Thanks in advance!

James

--
James J. Wilson II
Systems Engineer
U.S. District Court
District of New Mexico
333 Lomas Blvd., NW
Albuquerque, NM 87102
Phone:  (505) 348-2081
Fax:    (505) 348-2028

Reply via email to