Hi,
Am 03.01.2012 12:25, schrieb Ilija Pavlic:
On Tue, Jan 3, 2012 at 10:50 AM, Ilija Pavlic<[email protected]> wrote:
Can somebody explain how the pdfbox library really works; how the parts fit
together, and how it is used -- something of a bird's eye-view on various
parts of the pdfbox and connections between them?
http://pdfbox.apache.org/userguide/index.html is a good starting point.
Sadly, it is not.
On-and-off, I spent two weeks with pdfbox, reading what documentation
exists on the website, skimming through source files and
try-miss-repeat programming using pdfbox. I am willing to write
tutorials/documentation on what little I learned along the way, but
Patches are always welcome, especially those adressing the docs
don't know how the library really works in the background or why the
things I wrote really work. The documentation on the website (which I
did read, thoroughly) is inadequate and very much incomplete.
I'm afraid that's correct.
I realize that you do not have time for to write answers to all
beginners' questions. I also realize that you are trying to be
helpful. Thank you for that.
You have to get a digital copy of the pdf specs [1] to understand the format of
PDFs. It'll become your new "bible".
A pdf consists of different objects like simple strings, booleans but also
dictionaries or streams. A special kind of streams is the content stream which
contains a sequence of instructions describing the graphical elements to be
painted on a page. I guess this is what you're looking for.
The class PDFStreamEngine processes those streams it executes every operator as
long as it is supported/needed for the given usecase. The mapping from an
operator to the implementing class is done within a propertiy file, e.g.
PageDrawer.properties contains the mapping for all operators which are used for
rendering. PDFTextStripper.properties contains a smaller subset of mappings as
some of the supported operators aren't useful for text extraction.
The operators for graphics objects are explained in chapter 8.2. There is no
simple command like drawLine, it's a little bit more complicated:
- 0 G -> set the stroking color to black
- x y m -> move to the starting point (x,y)
- x y l -> draw a line to the endpoint (x,y)
- s -> close and stroke the path
But be aware path objects can be used to stroke a path, to fill a path or as a
clipping path. There is a transformation matrix which has to be taken into
account for scaling or translation and last but not least PDFs are using a
graphics stack with different states holding different graphics parameters.
That sounds really complicated, but IMHO if you get used to it it won't be that
hard anymore. :-)
Still, I am not any closer to determining line positions.
BR,
Ilija.
BR
Andreas Lehmkühler
[1] http://www.adobe.com/devnet/pdf/pdf_reference.html