Here's something that comes close: https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions https://stackoverflow.com/questions/55166990/pdfbox-line-rectangle-extraction
it collects lines for later use. You need to alter that code so it also collects no rectangular paths.
Tilman On 07.12.2024 07:58, Rob McDonald wrote:
I am considering starting a new project and I'm looking at using PDFBox to do it. I would appreciate any thoughts on the appropriateness of PDFBox vs. other PDF libraries -- Java or C++. My program will be similar in many ways to a vector drawing program (Inkscape, Illustrator, etc.). I want to be able to parse a page of a PDF document and work with the entities in memory. I am mostly interested in vector graphics paths -- I don't really care about the text or raster images. A user will need to be able to click on a given path to select it. They should then be able to manipulate that path -- perhaps suppress it from display, change the stroke width, color, re-ordering, etc. A particular path needs to be uniquely identifiable and manipulated. The program needs to be interactive -- there is not enough information available apriori to process a file or a page in a batch manner. From what I can tell, PDFBox mostly treats a PDF file as a stream. It reads a file incrementally, processing as it goes. Each page is processed operator by operator, without storing anything in memory beyond the current operator and its operands. In this way, memory usage is kept very low -- even for documents with many pages or very complex pages. From what I can tell, the existing operator data structures are set up to take action (process() I.e. draw or print, or convert), but are not set up for storage -- keep the data around to do something with later. I can imagine constructing data structures to store each operator with its operands (will need a concrete class for every possible operator). Then, a separate Parser would be needed to go through the Page and store the stream of operators into a collection of some sort (vector, array, list, etc.). Then, another pass could be made to consolidate / interpret groups of operators into paths. I.e. a path starts with a MoveTo consists of a bunch of LineTo and CurveTo's and is terminated by a Close, End, Stroke, or whatever. I will want to be able to visualize the manipulated page -- so I'll either need to write my own renderer to work from my page data structure, or I will need to be able to re-serialize my data structure back into a PDF stream and then feed the modified page to the main renderer I'm using. Does this kind of capability already exist in PDFBox -- perhaps in one of the examples? Or possibly in a 3rd party open source project that uses PDFBox? Does this seem like the right approach with PDFBox? Am I missing an obviously better way? Does anyone know of an alternate library that would be more suitable for these use cases and abstractions? Thanks in advance for any help. Thanks also for all the work that has gone into PDFBox so far. Rob
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org