I am using PDFTextStripper and getting odd results on some strings that I tracked down to something that I think may be a bug in PDFStreamEngine.
The PDF file has some text that looks like "1234" in Acrobat, but comes through as "1 2 3 4" from PDFTextStripper. The logic in PDFTextStripper is putting in spaces because of a large inter-character spacing. Tracing it down, the PDF file has a 'Tc' (spacing operator) followed by a 'Tm' (matrix operator) with a scale of 8. Other PDF files that I could find with 'Tc' operators had the 'Tc' after the matrix operator. What strikes me as incorrect is that PDFStreamEngine does not distinguish between a 'Tc' followed by 'Tm' versus a 'Tm' followed by 'Tc' . In either case the spacing in the 'Tc' is multiplied by the scale factor in the matrix. There is nothing in the Adobe PDF spec that specifically addresses order of transforms, but in normal mathematics there is big difference. And in the case that looks incorrect, the spacing is being multiplied by the scale in the matrix, and the results would be more like Acrobat if it didn't. Can someone who might have more knowledge about PDFStreamEngine/ PDFTextStripper comment on this? The code that does the multiply is in PDFStreamEngine.processEncodedText when it is operating on the value in characterSpacingText. Thanks

