PS I should clarify: if your PDF generator actually specifies OCG layers, then you may be able to use that. Just know that all PDFs do *not* have OCG layers. So if you are creating a general purpose tool, you will need to handle the content stream operations.
I don't have experience with the OCG features in PDFBox, so I'll leave it to others to comment on how to do that if your source documents for sure have OCG data. On Sun, Jun 23, 2024, 10:25 AM Kevin Day <ke...@trumpetinc.com> wrote: > Certainly possible. Not simple, though - and I don't think you will find > sample code... PDFs don't have "layers" like you are suggesting - they just > have a sequence of operations. You will need to interpret those operations. > > As a general strategy, you'll want to process the operations in the > content stream. Anything between a BT and ET operator will be text related. > Everything else will be image or vector operations. > > It will probably be easiest to think of this as a filtering operation. So > you will want to suppress every operation between BT and ET to create your > image version of the PDF - but leave everything else alone. > > Be aware that there can be multiple content streams for a page, so you'll > need to check for that. But the PDF spec does not allow a BT in one stream > to be closed by a ET in a different stream. So you should be able to just > filter each stream individually. > > Finally, for the text-only extraction, I'm pretty sure you will need to > make sure you preserve any coordinate system operators outside of the BT/ET > blocks. It's been awhile since I've looked at this, so I might be wrong > (i.e. it's possible that the text coordinate system is completely > independent of the regular coordinate system operators). > > That should do it - you should plan on spending some time reading the > coordinate system details in the PDF spec to figure out which operators you > need to preserve. > > K > > > On Sun, Jun 23, 2024, 3:47 AM PDF Developer <pdf...@yahoo.com.invalid> > wrote: > >> Hello, >> I have been asked to process a large number of PDF and, for reasons I >> can't go into, I need to separate the text from the graphics. I know I can >> create separate PDFs from the originals (using a variety of tools) but I >> prefer not to, mainly for speed reasons. >> So I thought it might be possible to use OCGs (aka Layers) for this. >> Parsing the PDPageContentStream in two buckets, one for text and the other >> for graphics. >> If this is feasible, does anyone know of any sample code that might be >> relevant that I could use to kick start things? >> Thanks in advance. >> PDFDev/ >> >