When I run PDFTextStripper on some PDFs created by a certain PDF writer
I get non-printable characters for the spaces.
This�is�the�Main�Document�to�be�filed�in�the�TEST�database.�
Try�finding�Nelson�Mandela�likes�*apples*�
Fruit�names:�
Pineapples�
Grapes�
Bing�Cherries�
Pears�
Peaches�
Does anybody know why this is happening? To me it looks like an
encoding problem. Maybe the encoding of the text within the PDF is
slightly different than the default encoding on the server that is
running PDFTextStripper against it? I have verified that the
problematic PDFs are being created on a Windows machine and that the PDF
is having its text extracted on a Linux machine.
Any ideas how to fix this? Is there a pdfbox resource file I can modify
in order to teach pdfbox to use a UTF space instead of � ?
Thanks in advance!
James
--
James J. Wilson II
Systems Engineer
U.S. District Court
District of New Mexico
333 Lomas Blvd., NW
Albuquerque, NM 87102
Phone: (505) 348-2081
Fax: (505) 348-2028