Re: Coordinate system for text

Michael Howard Sat, 01 May 2010 06:26:50 -0700

On Sat, May 1, 2010 at 7:45 AM, Andreas Lehmkuehler <[email protected]> wrote:
> Hi,
>
> Michael Howard schrieb:
>>
>> I have a question about the coordinate system orientation for text.
>>
>> Using PrintTextLocations and the ExtractTextByArea example, I have
>> observed that the coordinate system for the position of the text has
>> the Y coordinate running down the page.
>>
>> I was surprised by this because ExtractImageLocations reports images
>> with the origin being at the lower left.
>
> I'm not sure if the PrintImageLocations example works correct, see
> PDFBOX-585
> for further details. [1]


It is true that PrintImageLocations has problems with the width and
height. But it correctly reports the x,y coordinates in the lower left
of the embedded images.

I have become familiar with PrintImageLocations and have some
understanding of the errors in the image width and height
calculations. We can discuss that in a separate thread if you would
like.

<snip>

>> Any comments or explanation about why the Y coordinate system runs
>> down the page would be helpful.
>
> Both PrintTextLocations and ExtractTextByArea are using the PDFTextStripper
> class. It uses the rendering code to extract the text and the renderer
> itself
> uses Java2D to show each page. As Java2D uses the upper left corner as
> 0,0 reference the Y coordinate runs down the page.

OK, that is a good explanation as to why the text extraction routines
have the coordinate system running down the page.

I observe that the units are still 72 dpi PDF units ... just the Y
axis runs down the page.

> Probably we should just improve the mentioned examples to calculate/process
> both
> coordinate systems (PDF and Java2D).
>
> WDYT?

Yes, I think it would be best if the text extraction routines could
work with the PDF coordinate system orientation.

I think that if the ordinate unit size is 72 dpi then it would be more
clear to users if we consistently maintained the PDF coordinate
system, for both text and images.

At this point pdfbox needs to support the existing users who have
built text extraction code with the Y-down orientation ... we should
support both.

I am not yet familiar enough with the pdfbox code base to recommend
whether supporting both Y-axis orientations should be done through
different methods or with a flag/setting that changes the behavior of
the existing methods.

Another related factor is the coordinate system that is used when pdf
documents are generated using pdfbox. Thus far I have only been
reading pdf documents, not generating them. Therefore I do not know
which coordinate system is used when pdf documents are generated.
However it seems to me that the generation and retrieval sides should
use a consistent coordinate system orientation.



Thanks for all of your work. Let me know how I can help.

Michael

> BR
> Andreas Lehmkühler
>
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-585

Re: Coordinate system for text

Reply via email to