John, Thanks for the response.
PDFStreamEngine looks promising. My use case is a bit weird. Just on the off chance that I can't extract the info I need with PDFStreamEngine, I had some follow up questions about the operations: I thought that Tm replaces the current text matrix completely (unlike cm), and that therefore, if I'm only concerned about text position, I could just treat the Tx and Ty members of the new matrix as the new text position. Is this not accurate? Or is it just that I have to watch for cm's and other operations after Tm that transform (not replace) the current text matrix? With the q operations, does graphics state include text position? What about path clipping? Sorry for the dense-ness, I'm in a bit over my head on this one. (And I realize that PDFStreamEngine is the cleaner way to go if I can -- thank you for that recommendation!) -John -----Original Message----- From: "John Hewson" <[email protected]> Sent: Saturday, August 15, 2015 11:29pm To: [email protected] Subject: Re: Problems Using PDFBox To Manually Track TextPosition > On 14 Aug 2015, at 17:06, John Walker <[email protected]> wrote: > > Hello, > > > > I'm using PDFBox to parse the contentstream for a page in a PDF. Based on > the list of operations, there are two lines of text that I expect to be in > very different places on the page vertically. However, when the page is > displayed in Sumatra or Acrobat, this text is vertically aligned. I’d recommend subclassing PDFStreamEngine if you want to hook into the PDF operators, specifically showTextString(s) and associated methods, such as showGlyph. Parsing the stream yourself brings many challenges. > > The method I'm using to predict text position has been accurate in the past. > I'm not sure if the method is faulty, or if I'm mis-understanding the > operation list I'm getting from PDFBox. > > > > Here is the list of operations, with annotations explaining how I think they > should impact vertical position of text cursor: > > > > http://pastebin.com/GUWWX3Kv > > > > As you can see, I'm basically only moving my model of the cursor in reaction > to Tm's and Td's. (TJ's aren't relevant because text is horizontal and the > y position is the one I'm tracking.) I also ignored the cm, because > there's a Tm right after it. You’re definitely misunderstanding the operators. Tm doesn’t set the x and y values, it specifies a matrix which is multiplied with the current Tm matrix in the graphics state. In addition, the graphics state itself can be saved/restored via the q and Q operators. You’ll also need to take the CTM into account (that’s the cm operator). Anyway, don’t do that, use PDFStreamEngine instead. — John > > Am I mis-interpreting the PDF Operators (as I suspect)? Is there any > potential that this is a PDFBox issue? > > > > Thanks in advance! > > > > -John > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

