Hmmm.  Looks like attachments are not allowed on this newsgroup.

Could it be that this is an Identity-H issue?  Both of the PDFs
in question include at least one font with Identity-H encoding.

James Wilson wrote:
Andreas Lehmkühler wrote:
Hi,

Am 20.08.11 18:17, schrieb James Wilson:

When I run PDFTextStripper on some PDFs created by a certain PDF writer
I get non-printable characters for the spaces.

This�is�the�Main�Document�to�be�filed�in�the�TEST�database.�

Try�finding�Nelson�Mandela�likes�*apples*�
Fruit�names:�
Pineapples�
Grapes�
Bing�Cherries�
Pears�
Peaches�

Does anybody know why this is happening? To me it looks like an encoding
problem. Maybe the encoding of the text within the PDF is slightly
different than the default encoding on the server that is running
PDFTextStripper against it? I have verified that the problematic PDFs
are being created on a Windows machine and that the PDF is having its
text extracted on a Linux machine.

Any ideas how to fix this? Is there a pdfbox resource file I can modify
in order to teach pdfbox to use a UTF space instead of � ?
Did you override the default encoding?

No, I use the PDFTextStripper class with the empty constructor.

Can you copy&paste the text using acrobat reader? If not, the pdf most
likely uses some fonts which don't provide any mapping to extract the
text. If c&p works there maybe an issue with pdfbox.

Interesting, copy and paste works perfectly.  Which is to say, no weird
space characters. However, PDFTextStripper does produce weird space characters. I use od (Octal Dummer) to view the weird space characters
as sometimes your can't see them.

od for copy and paste:

lennon:~/tmp # od -t c CopyAndPaste-wp.txt
0000000   T   h   i   s       i   s       t   h   e       M   a   i   n
0000020       D   o   c   u   m   e   n   t       t   o       b   e
0000040   f   i   l   e   d       i   n       t   h   e       T   E   S
0000060   T       d   a   t   a   b   a   s   e   .  \n   T   r   y
0000100   f   i   n   d   i   n   g       N   e   l   s   o   n       M
0000120   a   n   d   e   l   a       l   i   k   e   s       *   a   p
0000140   p   l   e   s   *  \n   F   r   u   i   t       n   a   m   e
0000160   s   :  \n   P   i   n   e   a   p   p   l   e   s  \n   G   r
0000200   a   p   e   s  \n   B   i   n   g       C   h   e   r   r   i
0000220   e   s  \n   P   e   a   r   s  \n   P   e   a   c   h   e   s
0000240  \n   t   h   i   s       w   a   s       p   r   i   n   t   e
0000260   d       t   o       P   D   F       w   i   t   h   i   n
0000300   W   o   r   d   P   e   r   f   e   c   t  \n
0000314


od for PDFTextStripper:
[james@localhost tmp]$ od -t c PDFTextStripper-wp.txt
0000000   T   h   i   s 302 240   i   s 302 240   t   h   e 302 240   M
0000020   a   i   n 302 240   D   o   c   u   m   e   n   t 302 240   t
0000040   o 302 240   b   e 302 240   f   i   l   e   d 302 240   i   n
0000060 302 240   t   h   e 302 240   T   E   S   T 302 240   d   a   t
0000100   a   b   a   s   e   .  \n   T   r   y 302 240   f   i   n   d
0000120   i   n   g 302 240   N   e   l   s   o   n 302 240   M   a   n
0000140   d   e   l   a 302 240   l   i   k   e   s 302 240   *   a   p
0000160   p   l   e   s   *  \n   F   r   u   i   t 302 240   n   a   m
0000200   e   s   :  \n   P   i   n   e   a   p   p   l   e   s  \n   G
0000220   r   a   p   e   s  \n   B   i   n   g 302 240   C   h   e   r
0000240   r   i   e   s  \n   P   e   a   r   s  \n   P   e   a   c   h
0000260   e   s  \n   t   h   i   s       w   a   s       p   r   i   n
0000300   t   e   d       t   o       P   D   F       w   i   t   h   i
0000320   n       W   o   r   d   P   e   r   f   e   c   t  \n
0000336
[james@localhost tmp]$


Attached are the PDFs in question.

FYI -- I have tested this with the following versions of PDFBox (0.8 and 1.6.0). Both have the same problem.


Thanks in advance!

James


BR
Andreas Lehmkühler

Reply via email to