Re: Extract Text from PDF

Pooja4 G Thu, 27 May 2010 23:16:49 -0700

Daniel,

Is this (issues.apache.org) email id where I need to create an issue? As I 
am new so not aware of this email id. 
As per the understanding sending to '[email protected].


Also attaching the example file -

 Problem Description

In my application, its a Document Management application which manages PDF 
documents.

so while this PDF documents created in Version '1.4(5.x) or later' by 
Adobe Professional 9.0 then 1 revision will be uploaded and pdf generated 
successfully.
But If I am trying to create its new revision say 2, then at the time of 
creation of 'PDF difference file', the line# 11 below -

1// Text Stripper initialisation
2                       PDFTextStripper stripper = new PDFTextStripper();
3                       pdfStream = new ByteArrayInputStream(pdf_buf);
4
5                       // Open and load PDF document content.
6                       document = PDDocument.load(pdfStream);
7 // Get the document content in String format.
8                       // And suppress all non ascii characters
9                       //try
10                      //{
11                      PDF_text = stripper.getText(document);

will return null as 'PDF_text' instead of text document of the PDF file.


Thanks & Regards,
Pooja Gupta
Tata Consultancy Services
Mailto: [email protected]
Website: http://www.tcs.com
____________________________________________
Experience certainty.   IT Services
                        Business Solutions
                        Outsourcing
____________________________________________



From:
Daniel Wilson <[email protected]>
To:
[email protected]
Date:
05/27/2010 07:57 PM
Subject:
Re: Extract Text from PDF



Pooja,

Would you create an issue at issues.apache.org for this and attach an
example file?

Thanks.

Daniel

On Wed, May 26, 2010 at 12:03 AM, Pooja4 G <[email protected]> wrote:

> I tried to use the pdfbox1.1.0 but with this pdf generation failed while
> we are checking for Encryption of documents.
> Do anyone have any idea while more API we can use other than PDFbox for
> creation of PDFdiff file in DMS.
> We are uploading documents from Adobe professional 9.0 and while we 
create
> the new revision of the documents, it will fail at creation of PDF diff
> file. It returns null as below
>  using the class PDFTextStripper.class method
>  getText().
>
>        String PDF_text = new String();
>        PDFTextStripper stripper = new PDFTextStripper();
>
> PDF_text = stripper.getText(document);
>
> So please help me in solving this.
>
> Thanks & Regards,
> Pooja Gupta
> Tata Consultancy Services
> Mailto: [email protected]
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                        Business Solutions
>                        Outsourcing
> ____________________________________________
>
>
>
> From:
> Andreas Lehmkuehler <[email protected]>
> To:
> [email protected]
> Date:
> 05/20/2010 09:21 PM
> Subject:
> Re: Extract Text from PDF
>
>
>
> Hi,
>
> Thomas Fischer schrieb:
> > Hello Pooja,
> >
> > I don't have any Adobe 9.0 documents, but I know that in my tests the
> newer versions of PDFBox perform significantly better than version 7.3.
> > I would suggest you try the fairly recent version 1.1.0, this works 
very
> well at least on my Adobe Acrobat 8.1 documents.
> Which can be found at [1]
>
>
> BR
> Andreas Lehmkühler
>
> [1] http://pdfbox.apache.org/download.html
> >
> > Mit freundlichen Grüßen
> > Thomas Fischer
> >
> >
> > Am 20.05.2010 um 14:07 schrieb Pooja4 G:
> >
> >> Which version of the PDF documents are supported by PDFbox0.7.3, As 
we
> >> upload a document of version Adobe Professional writer 9.0 and while
> >> creating the difference files to compare, we will extract the text 
data
>
> >> from the PDF document using the class PDFTextStripper.class method
> >> getText().
> >>
> >>        String PDF_text = new String();
> >>        PDFTextStripper stripper = new PDFTextStripper();
> >>
> >> PDF_text = stripper.getText(document);
> >>
> >> But it will return null if the argument as document is created from
> adobe
> >> Professional 9.0 else it will run successfully.
> >> Please help or at least let us know if any upcoming new version 
PDFBox
> >> does support this.
> >>
> >> Thanks & Regards,
> >> Pooja Gupta
> >> Tata Consultancy Services
> >> Mailto: [email protected]
> >> Website: http://www.tcs.com
> >> ____________________________________________
> >> Experience certainty.   IT Services
> >>                        Business Solutions
> >>                        Outsourcing
> >> ____________________________________________
> >> =====-----=====-----=====
> >> Notice: The information contained in this e-mail
> >> message and/or attachments to it may contain
> >> confidential or privileged information. If you are
> >> not the intended recipient, any dissemination, use,
> >> review, distribution, printing or copying of the
> >> information contained in this e-mail message
> >> and/or attachments to it are strictly prohibited. If
> >> you have received this communication in error,
> >> please notify us by reply e-mail or telephone and
> >> immediately and permanently delete the message
> >> and any attachments. Thank you
> >>
> >>
> >
>
>
>
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>


=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

/**
         * Open PDF Document Stream from a byte array buffer.
         * 
         * @param pdf_buf
         * @return byte[] containing PDF text
         */
        public static byte[] fromBuffer(byte[] pdf_buf) throws 
TechnicalException, FunctionalException {
                String m_name = "PDFBOX-fromBuffer";

                PDDocument document = null;
                InputStream pdfStream = null;
                try {
                        logger.debug("Entering function " + m_name);

                        String PDF_text = new String();
                        // Text Stripper initialisation
                        PDFTextStripper stripper = new PDFTextStripper();

                        pdfStream = new ByteArrayInputStream(pdf_buf);

                        // Open and load PDF document content.
                        document = PDDocument.load(pdfStream);
                        
                        boolean isCrypt = false;
                        try {
                                PdfPdfbox.isCrypted(pdf_buf);
                        } catch (FunctionalException fu) {
                                throw new FunctionalException(m_name, 
fu.getMessage());
                        }
                        
                        logger.debug("Get the Document Content in String 
Format");
                        // Get the document content in String format.
                        // And suppress all non ascii characters
                        //try
                        //{
                        PDF_text = stripper.getText(document);
                        
                        logger.debug("Ascii Character Error " + PDF_text);
                        //}
                        //catch (Exception fu){
                        //      logger.debug("Ascii Character Error " + 
fu.getMessage());
                        //    fu.printStackTrace();
                        //      logger.debug("-------------------------- " + 
PDF_text);
                //      }
                        
                        
                        PDF_text= 
StringUtils.filterNonAsciiCharacters(PDF_text);

                        if (document != null) {
                                document.close();
                        }
                        if (pdfStream != null) {
                                pdfStream.close();
                        }

                        if ( isCrypt == false ){
                                logger.debug("Get Creator " + m_name);
                                String docCreator = 
document.getDocumentInformation().getCreator();
                                logger.info("Creator " + docCreator);

                                // Suppress all windings characters 
                                // if requested for the site and if the creator 
is PDF Creator
                                if ( 
Utils.StringIsNull(Constants.PDF_CREATOR_SUPPRESS_WINDINGS_FONT) == false
                                                && 
Utils.StringIsNull(docCreator) == false
                                                && 
docCreator.startsWith("PDFCreator") == true) {
                                                
                                        return ( 
pdfCreatorTools.pdfCreatorUpdate(PDF_text.getBytes()));
                                }

                        }

                        return PDF_text.getBytes();

                } catch (FunctionalException fu) {
                        throw new FunctionalException(m_name, fu.getMessage());
                } catch (IOException ioe) {
                        throw new TechnicalException(m_name, ioe);
                } catch (Exception e) {
                        throw new TechnicalException(m_name, e);
                } finally {
                        
                        if (document != null) {
                                try {
                                        document.close();
                                } catch (IOException ioe) {
                                        // nothing;
                                }
                        }

                        if (pdfStream != null) {
                                try {
                                        pdfStream.close();
                                } catch (IOException ioe) {
                                        // nothing;
                                }
                        }
                } // end finally


        } // end PDF_DocumentReadBuffer

Re: Extract Text from PDF

Reply via email to