> On 15 Aug 2015, at 20:43, [email protected] wrote:
> 
> John,
> 
> Thanks for the response.
> 
> PDFStreamEngine looks promising.  My use case is a bit weird.  Just on the 
> off chance that I can't extract the info I need with PDFStreamEngine, I had 
> some follow up questions about the operations:

PDFStreamEngine powers all of the various text extraction and rendering classes 
in PDFBox so it should do everything you need. Take a look at PageDrawer to see 
how it handles text using (only a subset of) PDFStreamEngine’s APIs.

> I thought that Tm replaces the current text matrix completely (unlike cm), 
> and that therefore, if I'm only concerned about text position, I could just 
> treat the Tx and Ty members of the new matrix as the new text position.  Is 
> this not accurate? Or is it just that I have to watch for cm's and other 
> operations after Tm that transform (not replace) the current text matrix?

Sorry, yes that’s right, tm replaces the entire matrix. It’s cm which multiples 
against the existing matrix. The text position depends on both of those 
matrices though. Both matrices are also part of the graphics state.

Note that the tx and ty don’t give you an x and y position, but specify the x 
and y translation of the matrix. The scale and rotation elements will also 
affect the final x and y position, which is why you need to perform the proper 
matrix operation instead of extracting just those elements.

— John

> With the q operations, does graphics state include text position?  What about 
> path clipping?  

Yes, it includes the text matrix and the CTM, as well as the clipping path. See 
PDGraphicsState.

> 
> Sorry for the dense-ness, I'm in a bit over my head on this one.  (And I 
> realize that PDFStreamEngine is the cleaner way to go if I can -- thank you 
> for that recommendation!)
> 
> -John
> 
> 
> 
> -----Original Message-----
> From: "John Hewson" <[email protected]>
> Sent: Saturday, August 15, 2015 11:29pm
> To: [email protected]
> Subject: Re: Problems Using PDFBox To Manually Track TextPosition
> 
> 
>> On 14 Aug 2015, at 17:06, John Walker <[email protected]> wrote:
>> 
>> Hello,
>> 
>> 
>> 
>> I'm using PDFBox to parse the contentstream for a page in a PDF.   Based on
>> the list of operations, there are two lines of text that I expect to be in
>> very different places on the page vertically.  However, when the page is
>> displayed in Sumatra or Acrobat, this text is vertically aligned.
> 
> I’d recommend subclassing PDFStreamEngine if you want to hook into the PDF 
> operators, specifically showTextString(s) and associated methods, such as 
> showGlyph.
> 
> Parsing the stream yourself brings many challenges.
> 
>> 
>> The method I'm using to predict text position has been accurate in the past.
>> I'm not sure if the method is faulty, or if I'm mis-understanding the
>> operation list I'm getting from PDFBox.
>> 
>> 
>> 
>> Here is the list of operations, with annotations explaining how I think they
>> should impact vertical position of text cursor: 
>> 
>> 
>> 
>> http://pastebin.com/GUWWX3Kv
>> 
>> 
>> 
>> As you can see, I'm basically only moving my model of the cursor in reaction
>> to Tm's and Td's.  (TJ's aren't relevant because text is horizontal and the
>> y position is the one I'm tracking.)   I also ignored the cm, because
>> there's a Tm right after it.
> 
> You’re definitely misunderstanding the operators. Tm doesn’t set the x and y 
> values, it specifies a matrix which is multiplied with the current Tm matrix 
> in the graphics state. In addition, the graphics state itself can be 
> saved/restored via the q and Q operators. You’ll also need to take the CTM 
> into account (that’s the cm operator).
> 
> Anyway, don’t do that, use PDFStreamEngine instead.
> 
> — John
> 
>> 
>> Am I mis-interpreting the PDF Operators (as I suspect)?  Is there any
>> potential that this is a PDFBox issue?  
>> 
>> 
>> 
>> Thanks in advance!
>> 
>> 
>> 
>> -John 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to