Re: Extract underlying PDF code from PDF file by selecting an area

Maruan Sahyoun Thu, 15 Jan 2015 00:32:54 -0800

You're welcome - and yes we are always interested to get a hand on files which 
can not be rendered correctly.
Please try to open them using Adobe Reader/Acrobat too just to get an idea how 
they are processed there. Sometimes we get PDFs that are so corrupted that 
there is not a lot we can do about it.


For all usage questions the users mailing list is fine. If you are sure or 
think you found a bug please open an issue at 
https://issues.apache.org/jira/browse/PDFBOX with a test case to reproduce the 
issue and the PDF in question attached. If you have an idea how to overcome the 
issue you can also attach a patch for us to review.

Good luck with your project and feel free to ask additional questions as they 
arise.

BR
Maruan


Am 15.01.2015 um 09:18 schrieb Stefan Falk <[email protected]>:

> This is awesome! Thank you!
> 
> I will take a close look at it and update to the trunk version too.
> 
> Do you want me to report PDFs that could not be displayed correctly in the 
> future?
> 
> Best regards,
> Stefan
> 
> On 2015-01-15 09:03, Maruan Sahyoun wrote:
>> Hi Stefan,
>> 
>> yes, PDFBox is capable of doing this. To crop the page to the dimensions you 
>> need you can use
>> 
>> PDPage.setCropBox 
>> [http://pdfbox.apache.org/docs/1.8.8/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#setCropBox(org.apache.pdfbox.pdmodel.common.PDRectangle)]
>> As John pointed out, the SuperimposePage example will give you the basics to 
>> import and 'mount' the page into a new or existing PDF.
>> 
>> Only thing is to get the coordinates from the mouse and translate that to 
>> the dimensions for the rectangle in PDF.
>> 
>> BR
>> Maruan
>> 
>> Am 15.01.2015 um 08:48 schrieb Stefan Falk <[email protected]>:
>> 
>>> Hi John!
>>> 
>>> Yes, clipping the PDF is basically what I would like to do! So would pdfbox 
>>> the best choice for this? I have looked a lot for a library but it does not 
>>> seem that there are many open source tools out there.
>>> 
>>> My target is a program that allows to clip PDFs in order to create a 
>>> composed PDF out of all the clips and maybe you could tell me if pdfbox 
>>> would be the best choice for such a task.
>>> 
>>> @fairly difficult: Well yes, I was quite astonished to find out that 
>>> extracting content from a PDF is actually a scientific topic :D
>>> 
>>> Best regards,
>>> Stefan
>>> 
>>> On 2015-01-15 03:21, John Hewson wrote:
>>>> Hi Stefan
>>>> 
>>>> What you’re describing is actually fairly difficult due to the complexity 
>>>> of the PDF operators, we have a special processor for text in PDFBox, but 
>>>> it is not necessarily accurate.
>>>> 
>>>> If you’re just trying to embed pages from existing PDFs into new PDFs then 
>>>> the SuperimposePage example which comes with PDFBox might already serve 
>>>> your needs. If you specify a custom BBox for the FormXObject, then you can 
>>>> use that to clip the page - which sounds like what you want. Please note 
>>>> that this technique still embeds all of the original page contents, so its 
>>>> not suitable for removing private or sensitive data, but otherwise it’s 
>>>> fine.
>>>> 
>>>> If you have PDFs which PDFReader can’t render, please try using the 2.0 
>>>> trunk version of PDFBox, where we have fixed many bugs.
>>>> 
>>>> Thanks
>>>> 
>>>> -- John
>>>> 
>>>>> On 14 Jan 2015, at 15:14, Stefan Falk <[email protected]> wrote:
>>>>> 
>>>>> Well, basically just extract it to load it into another PDF  but it 
>>>>> should be possible e.g. with the mouse.
>>>>> 
>>>>> 
>>>>> On 2015-01-14 22:52, Maruan Sahyoun wrote:
>>>>>> what would you like to do with that content?
>>>>>> 
>>>>>> BR
>>>>>> Maruan
>>>>>> 
>>>>>> Am 14.01.2015 um 21:42 schrieb Stefan Falk <[email protected]>:
>>>>>> 
>>>>>>> Hello pdfbox people!
>>>>>>> 
>>>>>>> I was wondering if anybody can help me with my needs. What I am looking 
>>>>>>> for is a possibility to extract the underlying PDF code from a PDF file 
>>>>>>> by simply selecting an area with your mouse.
>>>>>>> 
>>>>>>> After reading a few things about PDFs I have learned that anything that 
>>>>>>> has to do with extraction anything from a PDF can be a quite hard task.
>>>>>>> 
>>>>>>> So I was wondering if pdfbox could do that somehow. I've taken a rough 
>>>>>>> look at the PDFReader and I noticed that there is e.g. 
>>>>>>> processTextPosition from the class PageDrawer that seem to allow me to 
>>>>>>> get at least the position from Text - am I right in assuming that?
>>>>>>> 
>>>>>>> My concrete question would be what is possible with pdfbox regarding 
>>>>>>> this matter? E.g. I have a PDF on my drive which text seems to be 
>>>>>>> "extractable" by pdfbox on the one hand but on the other hand the 
>>>>>>> PDFReader is not able to render any of it. It just renders the images 
>>>>>>> (see attachment).
>>>>>>> 
>>>>>>> Thank you for your help in advance!
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Stefan
>> 
>

Re: Extract underlying PDF code from PDF file by selecting an area

Reply via email to