Thank's much for those pointers. Rob
On Fri, Dec 6, 2024 at 11:23 PM Tilman Hausherr <thaush...@t-online.de> wrote: > Here's something that comes close: > > https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions > > https://stackoverflow.com/questions/55166990/pdfbox-line-rectangle-extraction > > it collects lines for later use. You need to alter that code so it also > collects no rectangular paths. > > Tilman > > On 07.12.2024 07:58, Rob McDonald wrote: > > I am considering starting a new project and I'm looking at using PDFBox > to > > do it. I would appreciate any thoughts on the appropriateness of PDFBox > > vs. other PDF libraries -- Java or C++. > > > > My program will be similar in many ways to a vector drawing program > > (Inkscape, Illustrator, etc.). I want to be able to parse a page of a > PDF > > document and work with the entities in memory. I am mostly interested in > > vector graphics paths -- I don't really care about the text or raster > > images. > > > > A user will need to be able to click on a given path to select it. They > > should then be able to manipulate that path -- perhaps suppress it from > > display, change the stroke width, color, re-ordering, etc. A particular > > path needs to be uniquely identifiable and manipulated. The program > needs > > to be interactive -- there is not enough information available apriori to > > process a file or a page in a batch manner. > > > > > > From what I can tell, PDFBox mostly treats a PDF file as a stream. It > > reads a file incrementally, processing as it goes. Each page is > processed > > operator by operator, without storing anything in memory beyond the > current > > operator and its operands. In this way, memory usage is kept very low -- > > even for documents with many pages or very complex pages. > > > > From what I can tell, the existing operator data structures are set up > to > > take action (process() I.e. draw or print, or convert), but are not set > up > > for storage -- keep the data around to do something with later. > > > > > > I can imagine constructing data structures to store each operator with > its > > operands (will need a concrete class for every possible operator). > Then, a > > separate Parser would be needed to go through the Page and store the > stream > > of operators into a collection of some sort (vector, array, list, etc.). > > > > Then, another pass could be made to consolidate / interpret groups of > > operators into paths. I.e. a path starts with a MoveTo consists of a > bunch > > of LineTo and CurveTo's and is terminated by a Close, End, Stroke, or > > whatever. > > > > > > I will want to be able to visualize the manipulated page -- so I'll > either > > need to write my own renderer to work from my page data structure, or I > > will need to be able to re-serialize my data structure back into a PDF > > stream and then feed the modified page to the main renderer I'm using. > > > > > > Does this kind of capability already exist in PDFBox -- perhaps in one of > > the examples? Or possibly in a 3rd party open source project that uses > > PDFBox? > > > > Does this seem like the right approach with PDFBox? Am I missing an > > obviously better way? > > > > Does anyone know of an alternate library that would be more suitable for > > these use cases and abstractions? > > > > > > Thanks in advance for any help. Thanks also for all the work that has > gone > > into PDFBox so far. > > > > Rob > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >