Problem when extracting text from a pdf file

Mehmet Ali Abdulhayoglu Wed, 14 May 2014 07:59:30 -0700

Dear all,

As part of my research, I am trying to convert pdf files to text files. I have 
applied both itext and pdfbox but I encounter the same issue.


When I try extracting text from dnm1.pdf file (attached) both approaches work 
well. However when applying them for dnm2.pdf they fail.

I retrieve a text file with full of NULL values. Is it normal for such 
differently shaped pdfs or am I missing something else?

Thanks in advance.

Regards,
Mehmet


My code:

package retrievingfulltetxsfromweb;

import connectingurl.PlacesApi;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfBox {

    // Extract text from PDF Document
            public PdfBox(String fileName) {
                    //PDFParser parser = new PDFParser();
                    String parsedText = null;;
                    PDFTextStripper pdfStripper = null;
                    PDDocument pdDoc = null;
                    COSDocument cosDoc = null;
                    File file = new File(fileName);
                    if (!file.isFile()) {
                            System.err.println("File " + fileName + " does not 
exist.");
                            //return null;
                    }
                    try {
                            PDFParser parser = new PDFParser(new 
FileInputStream(file));
                    } catch (IOException e) {
                            System.err.println("Unable to open PDF Parser. " + 
e.getMessage());
                            //return null;
                    }
                    try {
                            PDFParser parser = new PDFParser(new 
FileInputStream(file));
                            parser.parse();
                            cosDoc = parser.getDocument();
                            pdfStripper = new PDFTextStripper();
                            pdDoc = new PDDocument(cosDoc);
                            pdfStripper.setStartPage(1);
                            pdfStripper.setEndPage(5);
                            parsedText = pdfStripper.getText(pdDoc);
                        System.out.println(parsedText);
                    } catch (Exception e) {
                            System.err
                                            .println("An exception occured in 
parsing the PDF Document."
                                                            + e.getMessage());
                    } finally {
                            try {
                                    if (cosDoc != null)
                                            cosDoc.close();
                                    if (pdDoc != null)
                                            pdDoc.close();
                            } catch (Exception e) {
                                    e.printStackTrace();
                            }
                    }
                    //return parsedText;
            }
            public static void main(String args[]){

                PdfBox pdf = new PdfBox("C:/dnm1.pdf");
                   // System.out.println(pdftoText("C:/dnm1.pdf"));
            }

}

Problem when extracting text from a pdf file

Reply via email to