Re: Problems Using PDFBox To Manually Track TextPosition

johnw Sat, 15 Aug 2015 20:44:05 -0700

John,

Thanks for the response.

PDFStreamEngine looks promising.  My use case is a bit weird.  Just on the off 
chance that I can't extract the info I need with PDFStreamEngine, I had some 
follow up questions about the operations:

I thought that Tm replaces the current text matrix completely (unlike cm), and 
that therefore, if I'm only concerned about text position, I could just treat 
the Tx and Ty members of the new matrix as the new text position.  Is this not 
accurate? Or is it just that I have to watch for cm's and other operations 
after Tm that transform (not replace) the current text matrix?

With the q operations, does graphics state include text position?  What about 
path clipping?  

Sorry for the dense-ness, I'm in a bit over my head on this one.  (And I 
realize that PDFStreamEngine is the cleaner way to go if I can -- thank you for 
that recommendation!)

-John

-----Original Message-----
From: "John Hewson" <[email protected]>
Sent: Saturday, August 15, 2015 11:29pm
To: [email protected]
Subject: Re: Problems Using PDFBox To Manually Track TextPosition

> On 14 Aug 2015, at 17:06, John Walker <[email protected]> wrote:
> 
> Hello,
> 
> 
> 
> I'm using PDFBox to parse the contentstream for a page in a PDF.   Based on
> the list of operations, there are two lines of text that I expect to be in
> very different places on the page vertically.  However, when the page is
> displayed in Sumatra or Acrobat, this text is vertically aligned.

I’d recommend subclassing PDFStreamEngine if you want to hook into the PDF 
operators, specifically showTextString(s) and associated methods, such as 
showGlyph.

Parsing the stream yourself brings many challenges.

> 
> The method I'm using to predict text position has been accurate in the past.
> I'm not sure if the method is faulty, or if I'm mis-understanding the
> operation list I'm getting from PDFBox.
> 
> 
> 
> Here is the list of operations, with annotations explaining how I think they
> should impact vertical position of text cursor: 
> 
> 
> 
> http://pastebin.com/GUWWX3Kv
> 
> 
> 
> As you can see, I'm basically only moving my model of the cursor in reaction
> to Tm's and Td's.  (TJ's aren't relevant because text is horizontal and the
> y position is the one I'm tracking.)   I also ignored the cm, because
> there's a Tm right after it.

You’re definitely misunderstanding the operators. Tm doesn’t set the x and y 
values, it specifies a matrix which is multiplied with the current Tm matrix in 
the graphics state. In addition, the graphics state itself can be 
saved/restored via the q and Q operators. You’ll also need to take the CTM into 
account (that’s the cm operator).

Anyway, don’t do that, use PDFStreamEngine instead.

— John

> 
> Am I mis-interpreting the PDF Operators (as I suspect)?  Is there any
> potential that this is a PDFBox issue?  
> 
> 
> 
> Thanks in advance!
> 
> 
> 
> -John 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Problems Using PDFBox To Manually Track TextPosition

Reply via email to