Re: Extract underlying PDF code from PDF file by selecting an area

Stefan Falk Wed, 14 Jan 2015 23:51:10 -0800

Hi John!

Yes, clipping the PDF is basically what I would like to do! So wouldpdfbox the best choice for this? I have looked a lot for a library butit does not seem that there are many open source tools out there.

My target is a program that allows to clip PDFs in order to create acomposed PDF out of all the clips and maybe you could tell me if pdfboxwould be the best choice for such a task.

@fairly difficult: Well yes, I was quite astonished to find out thatextracting content from a PDF is actually a scientific topic :D


Best regards,
Stefan

On 2015-01-15 03:21, John Hewson wrote:

Hi Stefan

What you’re describing is actually fairly difficult due to the complexity of 
the PDF operators, we have a special processor for text in PDFBox, but it is 
not necessarily accurate.

If you’re just trying to embed pages from existing PDFs into new PDFs then the 
SuperimposePage example which comes with PDFBox might already serve your needs. 
If you specify a custom BBox for the FormXObject, then you can use that to clip 
the page - which sounds like what you want. Please note that this technique 
still embeds all of the original page contents, so its not suitable for 
removing private or sensitive data, but otherwise it’s fine.

If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk 
version of PDFBox, where we have fixed many bugs.

Thanks

-- John

On 14 Jan 2015, at 15:14, Stefan Falk <[email protected]> wrote:

Well, basically just extract it to load it into another PDF  but it should be 
possible e.g. with the mouse.


On 2015-01-14 22:52, Maruan Sahyoun wrote:

what would you like to do with that content?

BR
Maruan

Am 14.01.2015 um 21:42 schrieb Stefan Falk <[email protected]>:

Hello pdfbox people!

I was wondering if anybody can help me with my needs. What I am looking for is 
a possibility to extract the underlying PDF code from a PDF file by simply 
selecting an area with your mouse.

After reading a few things about PDFs I have learned that anything that has to 
do with extraction anything from a PDF can be a quite hard task.

So I was wondering if pdfbox could do that somehow. I've taken a rough look at 
the PDFReader and I noticed that there is e.g. processTextPosition from the 
class PageDrawer that seem to allow me to get at least the position from Text - 
am I right in assuming that?

My concrete question would be what is possible with pdfbox regarding this matter? E.g. I 
have a PDF on my drive which text seems to be "extractable" by pdfbox on the 
one hand but on the other hand the PDFReader is not able to render any of it. It just 
renders the images (see attachment).

Thank you for your help in advance!

Best regards,
Stefan

Re: Extract underlying PDF code from PDF file by selecting an area

Reply via email to