Hi John!
Yes, clipping the PDF is basically what I would like to do! So would
pdfbox the best choice for this? I have looked a lot for a library but
it does not seem that there are many open source tools out there.
My target is a program that allows to clip PDFs in order to create a
composed PDF out of all the clips and maybe you could tell me if pdfbox
would be the best choice for such a task.
@fairly difficult: Well yes, I was quite astonished to find out that
extracting content from a PDF is actually a scientific topic :D
Best regards,
Stefan
On 2015-01-15 03:21, John Hewson wrote:
Hi Stefan
What you’re describing is actually fairly difficult due to the complexity of
the PDF operators, we have a special processor for text in PDFBox, but it is
not necessarily accurate.
If you’re just trying to embed pages from existing PDFs into new PDFs then the
SuperimposePage example which comes with PDFBox might already serve your needs.
If you specify a custom BBox for the FormXObject, then you can use that to clip
the page - which sounds like what you want. Please note that this technique
still embeds all of the original page contents, so its not suitable for
removing private or sensitive data, but otherwise it’s fine.
If you have PDFs which PDFReader can’t render, please try using the 2.0 trunk
version of PDFBox, where we have fixed many bugs.
Thanks
-- John
On 14 Jan 2015, at 15:14, Stefan Falk <[email protected]> wrote:
Well, basically just extract it to load it into another PDF but it should be
possible e.g. with the mouse.
On 2015-01-14 22:52, Maruan Sahyoun wrote:
what would you like to do with that content?
BR
Maruan
Am 14.01.2015 um 21:42 schrieb Stefan Falk <[email protected]>:
Hello pdfbox people!
I was wondering if anybody can help me with my needs. What I am looking for is
a possibility to extract the underlying PDF code from a PDF file by simply
selecting an area with your mouse.
After reading a few things about PDFs I have learned that anything that has to
do with extraction anything from a PDF can be a quite hard task.
So I was wondering if pdfbox could do that somehow. I've taken a rough look at
the PDFReader and I noticed that there is e.g. processTextPosition from the
class PageDrawer that seem to allow me to get at least the position from Text -
am I right in assuming that?
My concrete question would be what is possible with pdfbox regarding this matter? E.g. I
have a PDF on my drive which text seems to be "extractable" by pdfbox on the
one hand but on the other hand the PDFReader is not able to render any of it. It just
renders the images (see attachment).
Thank you for your help in advance!
Best regards,
Stefan