Here's something that comes close:
https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions
https://stackoverflow.com/questions/55166990/pdfbox-line-rectangle-extraction

it collects lines for later use. You need to alter that code so it also collects no rectangular paths.

Tilman

On 07.12.2024 07:58, Rob McDonald wrote:
I am considering starting a new project and I'm looking at using PDFBox to
do it.  I would appreciate any thoughts on the appropriateness of PDFBox
vs. other PDF libraries -- Java or C++.

My program will be similar in many ways to a vector drawing program
(Inkscape, Illustrator, etc.).  I want to be able to parse a page of a PDF
document and work with the entities in memory.  I am mostly interested in
vector graphics paths -- I don't really care about the text or raster
images.

A user will need to be able to click on a given path to select it.  They
should then be able to manipulate that path -- perhaps suppress it from
display, change the stroke width, color, re-ordering, etc.  A particular
path needs to be uniquely identifiable and manipulated.  The program needs
to be interactive -- there is not enough information available apriori to
process a file or a page in a batch manner.


 From what I can tell, PDFBox mostly treats a PDF file as a stream.  It
reads a file incrementally, processing as it goes.  Each page is processed
operator by operator, without storing anything in memory beyond the current
operator and its operands.  In this way, memory usage is kept very low --
even for documents with many pages or very complex pages.

 From what I can tell, the existing operator data structures are set up to
take action (process()  I.e. draw or print, or convert), but are not set up
for storage -- keep the data around to do something with later.


I can imagine constructing data structures to store each operator with its
operands (will need a concrete class for every possible operator).  Then, a
separate Parser would be needed to go through the Page and store the stream
of operators into a collection of some sort (vector, array, list, etc.).

Then, another pass could be made to consolidate / interpret groups of
operators into paths.  I.e. a path starts with a MoveTo consists of a bunch
of LineTo and CurveTo's and is terminated by a Close, End, Stroke, or
whatever.


I will want to be able to visualize the manipulated page -- so I'll either
need to write my own renderer to work from my page data structure, or I
will need to be able to re-serialize my data structure back into a PDF
stream and then feed the modified page to the main renderer I'm using.


Does this kind of capability already exist in PDFBox -- perhaps in one of
the examples?  Or possibly in a 3rd party open source project that uses
PDFBox?

Does this seem like the right approach with PDFBox?  Am I missing an
obviously better way?

Does anyone know of an alternate library that would be more suitable for
these use cases and abstractions?


Thanks in advance for any help.  Thanks also for all the work that has gone
into PDFBox so far.

Rob



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to