Hi,

Michael Howard schrieb:
I have a question about the coordinate system orientation for text.

Using PrintTextLocations and the ExtractTextByArea example, I have
observed that the coordinate system for the position of the text has
the Y coordinate running down the page.

I was surprised by this because ExtractImageLocations reports images
with the origin being at the lower left.
I'm not sure if the PrintImageLocations example works correct, see PDFBOX-585
for further details. [1]

My quick browsing of the pdf spec at
http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf shows
examples of text with the origin being defined from the lower left
corner of the page.
That's correct.

I tried it on multiple .pdf documents from different sources to ensure
that there wasn't something strange with my .pdf files.

I didn't find any discussion of this in the email archives.

Any comments or explanation about why the Y coordinate system runs
down the page would be helpful.
Both PrintTextLocations and ExtractTextByArea are using the PDFTextStripper
class. It uses the rendering code to extract the text and the renderer itself
uses Java2D to show each page. As Java2D uses the upper left corner as
0,0 reference the Y coordinate runs down the page.

Probably we should just improve the mentioned examples to calculate/process both
coordinate systems (PDF and Java2D).

WDYT?

BR
Andreas Lehmkühler


[1] https://issues.apache.org/jira/browse/PDFBOX-585

Reply via email to