Also note that text might be contained in interactive form fields and annotations. Don't know if you'd like to treat that text content as a text layer too.
BR Maruan Am Sonntag, dem 23.06.2024 um 10:32 -0700 schrieb Kevin Day: > PS I should clarify: if your PDF generator actually specifies OCG > layers, > then you may be able to use that. Just know that all PDFs do *not* > have OCG > layers. So if you are creating a general purpose tool, you will need > to > handle the content stream operations. > > I don't have experience with the OCG features in PDFBox, so I'll > leave it > to others to comment on how to do that if your source documents for > sure > have OCG data. > > On Sun, Jun 23, 2024, 10:25 AM Kevin Day <ke...@trumpetinc.com> > wrote: > > > Certainly possible. Not simple, though - and I don't think you will > > find > > sample code... PDFs don't have "layers" like you are suggesting - > > they just > > have a sequence of operations. You will need to interpret those > > operations. > > > > As a general strategy, you'll want to process the operations in the > > content stream. Anything between a BT and ET operator will be text > > related. > > Everything else will be image or vector operations. > > > > It will probably be easiest to think of this as a filtering > > operation. So > > you will want to suppress every operation between BT and ET to > > create your > > image version of the PDF - but leave everything else alone. > > > > Be aware that there can be multiple content streams for a page, so > > you'll > > need to check for that. But the PDF spec does not allow a BT in one > > stream > > to be closed by a ET in a different stream. So you should be able > > to just > > filter each stream individually. > > > > Finally, for the text-only extraction, I'm pretty sure you will > > need to > > make sure you preserve any coordinate system operators outside of > > the BT/ET > > blocks. It's been awhile since I've looked at this, so I might be > > wrong > > (i.e. it's possible that the text coordinate system is completely > > independent of the regular coordinate system operators). > > > > That should do it - you should plan on spending some time reading > > the > > coordinate system details in the PDF spec to figure out which > > operators you > > need to preserve. > > > > K > > > > > > On Sun, Jun 23, 2024, 3:47 AM PDF Developer > > <pdf...@yahoo.com.invalid> > > wrote: > > > > > Hello, > > > I have been asked to process a large number of PDF and, for > > > reasons I > > > can't go into, I need to separate the text from the graphics. I > > > know I can > > > create separate PDFs from the originals (using a variety of > > > tools) but I > > > prefer not to, mainly for speed reasons. > > > So I thought it might be possible to use OCGs (aka Layers) for > > > this. > > > Parsing the PDPageContentStream in two buckets, one for text and > > > the other > > > for graphics. > > > If this is feasible, does anyone know of any sample code that > > > might be > > > relevant that I could use to kick start things? > > > Thanks in advance. > > > PDFDev/ > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org