Re: Extract underlying PDF code from PDF file by selecting an area

John Hewson Thu, 15 Jan 2015 08:27:39 -0800

Yes, PDFBox can do this.

-- John


> On 14 Jan 2015, at 23:48, Stefan Falk <[email protected]> wrote:
> 
> Hi John!
> 
> Yes, clipping the PDF is basically what I would like to do! So would pdfbox 
> the best choice for this? I have looked a lot for a library but it does not 
> seem that there are many open source tools out there.
> 
> My target is a program that allows to clip PDFs in order to create a composed 
> PDF out of all the clips and maybe you could tell me if pdfbox would be the 
> best choice for such a task.
> 
> @fairly difficult: Well yes, I was quite astonished to find out that 
> extracting content from a PDF is actually a scientific topic :D
> 
> Best regards,
> Stefan
> 
>> On 2015-01-15 03:21, John Hewson wrote:
>> Hi Stefan
>> 
>> What you’re describing is actually fairly difficult due to the complexity of 
>> the PDF operators, we have a special processor for text in PDFBox, but it is 
>> not necessarily accurate.
>> 
>> If you’re just trying to embed pages from existing PDFs into new PDFs then 
>> the SuperimposePage example which comes with PDFBox might already serve your 
>> needs. If you specify a custom BBox for the FormXObject, then you can use 
>> that to clip the page - which sounds like what you want. Please note that 
>> this technique still embeds all of the original page contents, so its not 
>> suitable for removing private or sensitive data, but otherwise it’s fine.
>> 
>> If you have PDFs which PDFReader can’t render, please try using the 2.0 
>> trunk version of PDFBox, where we have fixed many bugs.
>> 
>> Thanks
>> 
>> -- John
>> 
>>> On 14 Jan 2015, at 15:14, Stefan Falk <[email protected]> wrote:
>>> 
>>> Well, basically just extract it to load it into another PDF  but it should 
>>> be possible e.g. with the mouse.
>>> 
>>> 
>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>> what would you like to do with that content?
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <[email protected]>:
>>>>> 
>>>>> Hello pdfbox people!
>>>>> 
>>>>> I was wondering if anybody can help me with my needs. What I am looking 
>>>>> for is a possibility to extract the underlying PDF code from a PDF file 
>>>>> by simply selecting an area with your mouse.
>>>>> 
>>>>> After reading a few things about PDFs I have learned that anything that 
>>>>> has to do with extraction anything from a PDF can be a quite hard task.
>>>>> 
>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough 
>>>>> look at the PDFReader and I noticed that there is e.g. 
>>>>> processTextPosition from the class PageDrawer that seem to allow me to 
>>>>> get at least the position from Text - am I right in assuming that?
>>>>> 
>>>>> My concrete question would be what is possible with pdfbox regarding this 
>>>>> matter? E.g. I have a PDF on my drive which text seems to be 
>>>>> "extractable" by pdfbox on the one hand but on the other hand the 
>>>>> PDFReader is not able to render any of it. It just renders the images 
>>>>> (see attachment).
>>>>> 
>>>>> Thank you for your help in advance!
>>>>> 
>>>>> Best regards,
>>>>> Stefan
>

Re: Extract underlying PDF code from PDF file by selecting an area

Reply via email to