Am 27.08.2015 um 00:39 schrieb Roberto Nibali:
Hi

I'm looking at a PDF using PDFDebugger and the text I'd like to extract
from the PDF is inside the Content of node Root/Pages/Kids/[0]/Contents,
according to PDFDebugger. How do I programmatically dig down to this node
to extract the flatdecoded ASCII stream hiding there inside object [3 0 R]?

The stream's content first bytes look as follows in the PDFDebugger:

q
   1 1 1 rg
   /a0 gs
   14.16 827.76 566.879 -824.879 re
   f
   BT
     9.9984 0 0 9.9984 70.8 806.64 Tm
     /f-0-0 1 Tf
     [ ($) 6 ($) 6 (Do) 6 (s) 20 (s) 20 (i) -17 (er) -7 (n) 6 (r=) -16 (3)
26 ('3) 6 (9) 6 (4'5) 6 (98) ] TJ

Any pointers would be most welcome. In the above example, I'd like to
extract the text "$$DossierNr"

You could use PDFTextStripper.

I hope you don't ask how to replace $$DossierNr in the PDF. Because that would be really tricky, if not impossible.

Tilman


As a sidenote: a wonderful enhancement to the PDFDebugger would be to
obtain working PDFBox code for a given node upon right-click on certain
nodes inside the left-hand side pane.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to