Certainly possible. Not simple, though - and I don't think you will find sample code... PDFs don't have "layers" like you are suggesting - they just have a sequence of operations. You will need to interpret those operations.
As a general strategy, you'll want to process the operations in the content stream. Anything between a BT and ET operator will be text related. Everything else will be image or vector operations. It will probably be easiest to think of this as a filtering operation. So you will want to suppress every operation between BT and ET to create your image version of the PDF - but leave everything else alone. Be aware that there can be multiple content streams for a page, so you'll need to check for that. But the PDF spec does not allow a BT in one stream to be closed by a ET in a different stream. So you should be able to just filter each stream individually. Finally, for the text-only extraction, I'm pretty sure you will need to make sure you preserve any coordinate system operators outside of the BT/ET blocks. It's been awhile since I've looked at this, so I might be wrong (i.e. it's possible that the text coordinate system is completely independent of the regular coordinate system operators). That should do it - you should plan on spending some time reading the coordinate system details in the PDF spec to figure out which operators you need to preserve. K On Sun, Jun 23, 2024, 3:47 AM PDF Developer <pdf...@yahoo.com.invalid> wrote: > Hello, > I have been asked to process a large number of PDF and, for reasons I > can't go into, I need to separate the text from the graphics. I know I can > create separate PDFs from the originals (using a variety of tools) but I > prefer not to, mainly for speed reasons. > So I thought it might be possible to use OCGs (aka Layers) for this. > Parsing the PDPageContentStream in two buckets, one for text and the other > for graphics. > If this is feasible, does anyone know of any sample code that might be > relevant that I could use to kick start things? > Thanks in advance. > PDFDev/ >