Also note that text might be contained in interactive form fields and
annotations. Don't know if you'd like to treat that text content as a
text layer too. 

BR
Maruan

Am Sonntag, dem 23.06.2024 um 10:32 -0700 schrieb Kevin Day:
> PS I should clarify: if your PDF generator actually specifies OCG
> layers,
> then you may be able to use that. Just know that all PDFs do *not*
> have OCG
> layers. So if you are creating a general purpose tool, you will need
> to
> handle the content stream operations.
> 
> I don't have experience with the OCG features in PDFBox, so I'll
> leave it
> to others to comment on how to do that if your source documents for
> sure
> have OCG data.
> 
> On Sun, Jun 23, 2024, 10:25 AM Kevin Day <ke...@trumpetinc.com>
> wrote:
> 
> > Certainly possible. Not simple, though - and I don't think you will
> > find
> > sample code... PDFs don't have "layers" like you are suggesting -
> > they just
> > have a sequence of operations. You will need to interpret those
> > operations.
> > 
> > As a general strategy, you'll want to process the operations in the
> > content stream. Anything between a BT and ET operator will be text
> > related.
> > Everything else will be image or vector operations.
> > 
> > It will probably be easiest to think of this as a filtering
> > operation. So
> > you will want to suppress every operation between BT and ET to
> > create your
> > image version of the PDF - but leave everything else alone.
> > 
> > Be aware that there can be multiple content streams for a page, so
> > you'll
> > need to check for that. But the PDF spec does not allow a BT in one
> > stream
> > to be closed by a ET in a different stream. So you should be able
> > to just
> > filter each stream individually.
> > 
> > Finally, for the text-only extraction, I'm pretty sure you will
> > need to
> > make sure you preserve any coordinate system operators outside of
> > the BT/ET
> > blocks. It's been awhile since I've looked at this, so I might be
> > wrong
> > (i.e. it's possible that the text coordinate system is completely
> > independent of the regular coordinate system operators).
> > 
> > That should do it - you should plan on spending some time reading
> > the
> > coordinate system details in the PDF spec to figure out which
> > operators you
> > need to preserve.
> > 
> > K
> > 
> > 
> > On Sun, Jun 23, 2024, 3:47 AM PDF Developer
> > <pdf...@yahoo.com.invalid>
> > wrote:
> > 
> > > Hello,
> > > I have been asked to process a large number of PDF and, for
> > > reasons I
> > > can't go into, I need to separate the text from the graphics. I
> > > know I can
> > > create separate PDFs from the originals (using a variety of
> > > tools) but I
> > > prefer not to, mainly for speed reasons.
> > > So I thought it might be possible to use OCGs (aka Layers) for
> > > this.
> > > Parsing the PDPageContentStream in two buckets, one for text and
> > > the other
> > > for graphics.
> > > If this is feasible, does anyone know of any sample code that
> > > might be
> > > relevant that I could use to kick start things?
> > > Thanks in advance.
> > > PDFDev/
> > > 
> > 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to