Re: Extract Embedded files from pdf using pdfbox in .NET application

Ramesh Shrestha Thu, 20 Jun 2013 03:49:37 -0700

Thanks,

As per your suggestion using annotation I was able to extract the name of
the embedded file however the contents of that file could not be extracted
Please refer to the code below.


var originalDocument = PDDocument.load(_PdfFile);

var originalCatalog = originalDocument.getDocumentCatalog();

java.util.List sourceDocumentPages = originalCatalog.getAllPages();

var newDocument = new PDDocument();

//number of pages in pdf file = 2

int[] PageNumbers = { 1, 2 };



foreach (var pageNumber in PageNumbers)

{

// Page numbers are 1-based, but PDPages are contained in a zero-based
array:

int pageIndex = pageNumber - 1;

PDPage pdpage = new PDPage();

try

{

pdpage = (PDPage)sourceDocumentPages.get(pageIndex);

List anno =  pdpage.getAnnotations();

If(anno.size() > 0)

{

PDAnnotationFileAttachment pafa = (PDAnnotationFileAttachment)anno.get(0);

//FILENAME = GETCONTENTS()

string filename = pafa.getContents();

PDFileSpecification fs = pafa.getFile();

              }

       }

catch (Exception)

       { }

}
Can you help me one more time to extract and dump the embedded file in the
specified location?



On Thu, Jun 20, 2013 at 2:46 PM, Ramesh Shrestha <[email protected]>wrote:

>
> Even after trying Annotation i am not able to extract the
> embedded/attached doc file located in the page of pdf.
>
> On Tue, Jun 11, 2013 at 5:29 PM, Andreas Lehmkuehler <[email protected]>wrote:
>
>> Am 11.06.2013 07:06, schrieb Ramesh Shrestha:
>>
>>> Thanks,
>>>
>>> The java example link i provided should have been -
>>>
>>> http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java
>>>
>>> But your suggestion WORKS.
>>>
>>> Now i am able to extract the attached file located in the *attachments
>>> tab*but
>>> *haven't been able to extract the attached file located in page*. I am
>>>
>>> getting null efTree in this case.
>>>
>>>          PDDocumentNameDictionary namesDictionary = new
>>> PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
>>>          PDEmbeddedFilesNameTreeNode *efTree *=
>>>
>>> namesDictionary.getEmbeddedFiles();
>>>
>>> So now working on it.
>>>
>> Embedded files are always document related. If an embedded file is
>> referenced
>> on a single page a file attachment annotation is used. Try something like
>> this
>> to get all annotations of a single page:
>>
>> List annotations = page.getAnnotations();
>>
>> The one you are looking for has to be an instance of the class
>>
>>
>> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationFileAttachment.
>>
>>  On Mon, Jun 10, 2013 at 7:38 PM, Andreas Lehmkuehler <[email protected]
>>> >wrote:
>>>
>>>  Hi,
>>>>
>>>> Am 10.06.2013 11:22, schrieb Ramesh Shrestha:
>>>>
>>>>   Hi,
>>>>
>>>>>
>>>>>
>>>>>     I am developing .NET Application using pdfbox to extract metadata,
>>>>> content and attached file from PDF.
>>>>>
>>>>> I was able to extract metadata and content, but stuck while extracting
>>>>> attached/embedded files.
>>>>>
>>>>> I have a pdf with embedded/attached doc file and want to retrieve that
>>>>> file. I have gone through the java example -
>>>>>
>>>>> http://www.docjar.com/html/**api/org/apache/pdfbox/**examples/pdmodel/**
>>>>> EmbeddedFiles.java.html<
>>>>> http://www.docjar.com/html/api/org/apache/pdfbox/examples/pdmodel/EmbeddedFiles.java.html
>>>>> >
>>>>>
>>>>> .
>>>>>
>>>>> But while trying to use it in .Net, i got "non generic type
>>>>> 'java.util.Map'
>>>>> cannot be used with type arguments" in the following code snippet
>>>>>
>>>>> java.util.Map<String, COSObjectable> names = efTree.getNames();
>>>>>
>>>>> So, i will be grateful if anybody help me to extract the file from pdf.
>>>>>
>>>>>  I'm not a .NET expert and don't know what may cause that issue. But
>>>> maybe
>>>> it is
>>>> a good idea to just omit the generics and try something like this:
>>>>
>>>> java.util.Map names = efTree.getNames();
>>>>
>>>>   Thanks in advance.
>>>>
>>>>>
>>>>>
>>>> HTH
>>>> Andreas Lehmkühler
>>>>
>>>
>> BR
>> Andreas Lehmkühler
>>
>>
>
>
>


-- 
pasa

Re: Extract Embedded files from pdf using pdfbox in .NET application

Reply via email to