A page of the file that shows the issue is attached.
Most of the text is OK, the extra spaces show up in the '3117'
on the transaction on 04/08 and the '1273' on the transaction on 04/12. They
are being inserted by PDFTextStripper.writePage and show up in
PDFTextStripper.writeString.

PDFTextStripper was run with setSpacingTolerance(.75)and
setAverageCharTolerance(.45) in order for the rest of the text to be
well behaved.

If there is something I could be doing wrong I'd be happy to find that,
but the more I think about it from my experience with graphic pipelines, I
am starting to thing that values like spacing ought to be multiplied by
the matrix in effect at the time they are encountered in the stream
and that the
resultant value be saved, and then used without further transformation.

For the problematic text in this file, the Tc values are 1.3141 for
the '3117' and
0.8731 for the '1273' and in both cases the following Tm has a scale of 8.
PDFStreamEngine calculates a character width of 4.4 with a character
height of 5.6,
but looking at the file in Acrobat, that aspect ratio does not appear
to be what Acrobat is using.

Thanks for whatever you can suggest.

JH

Hi,

> Joel Hirsh <[email protected]> hat am 4. Mai 2014 um 21:03 geschrieben:
>
>
>  I am using PDFTextStripper and getting odd results on some strings that I
> tracked down to something that I think may be a bug in PDFStreamEngine.
>
> The PDF file has some text that looks like "1234" in Acrobat, but comes
> through as "1 2 3 4" from PDFTextStripper.  The logic in PDFTextStripper is
> putting in spaces because of a large inter-character spacing.
>
> Tracing it down, the PDF file has a 'Tc' (spacing operator) followed by a
> 'Tm' (matrix operator) with a scale of 8.  Other PDF files that I could
> find with 'Tc' operators had the 'Tc' after the matrix operator.
Both parameters are optional, so that their usage is maybe completely different
when comparing two pdfs.

> What strikes me as incorrect is that PDFStreamEngine does not distinguish
> between a 'Tc' followed by 'Tm' versus a 'Tm' followed by 'Tc' .  In either
> case the spacing in the 'Tc' is multiplied by the scale factor in the
> matrix.   There is nothing in the Adobe PDF spec that specifically
> addresses order of transforms, but in normal mathematics there is big
> difference.  And in the case that looks incorrect, the spacing is being
> multiplied by the scale in the matrix, and the results would be more like
> Acrobat if it didn't.
I guess there is a misunderstanding. Both operator don't do any calculations,
they just set/replace some values. Other operators like 'Tj' uses those values
for calculations, so that the order of those operators isn't relevant.
Furthermore
in your case it's a simple scaling using scalar values, which is a commutative
operation and the order of the operands doesn't matter.

> Can someone who might have more knowledge about PDFStreamEngine/
> PDFTextStripper comment on this?  The code that does the multiply is in
> PDFStreamEngine.processEncodedText when it is operating on the value in
> characterSpacingText.
Can you share the pdf with us, so that we can have a look to see what might be
wrong?

> Thanks

BR
Andreas Lehmkühler

Reply via email to