Certainly possible. Not simple, though - and I don't think you will find
sample code... PDFs don't have "layers" like you are suggesting - they just
have a sequence of operations. You will need to interpret those operations.

As a general strategy, you'll want to process the operations in the content
stream. Anything between a BT and ET operator will be text related.
Everything else will be image or vector operations.

It will probably be easiest to think of this as a filtering operation. So
you will want to suppress every operation between BT and ET to create your
image version of the PDF - but leave everything else alone.

Be aware that there can be multiple content streams for a page, so you'll
need to check for that. But the PDF spec does not allow a BT in one stream
to be closed by a ET in a different stream. So you should be able to just
filter each stream individually.

Finally, for the text-only extraction, I'm pretty sure you will need to
make sure you preserve any coordinate system operators outside of the BT/ET
blocks. It's been awhile since I've looked at this, so I might be wrong
(i.e. it's possible that the text coordinate system is completely
independent of the regular coordinate system operators).

That should do it - you should plan on spending some time reading the
coordinate system details in the PDF spec to figure out which operators you
need to preserve.

K


On Sun, Jun 23, 2024, 3:47 AM PDF Developer <pdf...@yahoo.com.invalid>
wrote:

> Hello,
> I have been asked to process a large number of PDF and, for reasons I
> can't go into, I need to separate the text from the graphics. I know I can
> create separate PDFs from the originals (using a variety of tools) but I
> prefer not to, mainly for speed reasons.
> So I thought it might be possible to use OCGs (aka Layers) for this.
> Parsing the PDPageContentStream in two buckets, one for text and the other
> for graphics.
> If this is feasible, does anyone know of any sample code that might be
> relevant that I could use to kick start things?
> Thanks in advance.
> PDFDev/
>

Reply via email to