Text extraction problems...

James Wilson Sat, 20 Aug 2011 09:17:57 -0700

When I run PDFTextStripper on some PDFs created by a certain PDF writerI get non-printable characters for the spaces.


Thisï¿½isï¿½theï¿½Mainï¿½Documentï¿½toï¿½beï¿½filedï¿½inï¿½theï¿½TESTï¿½database.ï¿½
Tryï¿½findingï¿½Nelsonï¿½Mandelaï¿½likesï¿½*apples*ï¿½
Fruitï¿½names:ï¿½
Pineapplesï¿½
Grapesï¿½
Bingï¿½Cherriesï¿½
Pearsï¿½
Peachesï¿½

Does anybody know why this is happening? To me it looks like anencoding problem. Maybe the encoding of the text within the PDF isslightly different than the default encoding on the server that isrunning PDFTextStripper against it? I have verified that theproblematic PDFs are being created on a Windows machine and that the PDFis having its text extracted on a Linux machine.

Any ideas how to fix this? Is there a pdfbox resource file I can modifyin order to teach pdfbox to use a UTF space instead of ï¿½ ?


Thanks in advance!

James

--
James J. Wilson II
Systems Engineer
U.S. District Court
District of New Mexico
333 Lomas Blvd., NW
Albuquerque, NM 87102
Phone:  (505) 348-2081
Fax:    (505) 348-2028

Text extraction problems...

Reply via email to