problem with parsing pdf with PDFBox

Golovko Anna Thu, 26 Mar 2015 01:23:47 -0700

Hello!

My name is Anna Yakubenko. I'm a Java-developer and now support application, 
which can parse pdf to txt with PDFBox and then store data to xml file as an 
output. Early every pdf files were parsed by PDFBox properly, but now I have 
got a pdf file, which is parsed in the way I couldn't expect. It seems, that 
customer add new layer with picture, colontitul and footer to pdf. And now 
PDFBox extarct information only from colontitul and footer from every page, and 
miss important information in the middle of the page.


I use next source code to call PDFBox API:

import java.io.File;
import java.io.FileInputStream;
import java.io.PrintStream;
import java.io.PrintWriter;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;
import org.pdfbox.util.PDFTextStripper;

public class PDFTextParser
{
  PDFParser parser;
  String parsedText;
  PDFTextStripper pdfStripper;
  PDDocument pdDoc;
  COSDocument cosDoc;
  PDDocumentInformation pdDocInfo;
  
  String pdftoText(String fileName)
  {
    System.out.println("Parsing text from PDF file " + fileName + "....");
    File f = new File(fileName);
    if (!f.isFile())
    {
      System.out.println("File " + fileName + " does not exist.");
      return null;
    }
    try
    {
      System.out.println("Jetzt wird der Parser definiert: new PDFParser ");
      this.parser = new PDFParser(new FileInputStream(f));
    }
    catch (Exception e)
    {
      System.out.println("Unable to open PDF Parser.");
      return null;
    }
    try
    {
      System.out.println("Jetzt wird mit dem  Parser gearbeitet:  ");
      this.parser.parse();
      this.cosDoc = this.parser.getDocument();
      this.pdfStripper = new PDFTextStripper();
      this.pdDoc = new PDDocument(this.cosDoc);
      this.parsedText = this.pdfStripper.getText(this.pdDoc);
    }
    catch (Exception e)
    {
      System.out.println("An exception occured in parsing the PDF Document.");
      e.printStackTrace();
      try
      {
        if (this.cosDoc != null) {
          this.cosDoc.close();
        }
        if (this.pdDoc != null) {
          this.pdDoc.close();
        }
      }
      catch (Exception e1)
      {
        e.printStackTrace();
      }
      return null;
    }
    System.out.println("Done.");
    return this.parsedText;
  }
  
  void writeTexttoFile(String pdfText, String fileName)
  {
    System.out.println("\nWriting PDF text to output text file " + fileName + 
"....");
    try
    {
      PrintWriter pw = new PrintWriter(fileName);
      pw.print(pdfText);
      pw.close();
    }
    catch (Exception e)
    {
      System.out.println("An exception occured in writing the pdf text to 
file.");
      e.printStackTrace();
    }
    System.out.println("Done.");
  }
  
  public static void main(String[] args)
  {
    if (args.length != 2)
    {
      System.out.println("Usage: java PDFTextParser <InputPDFFilename> 
<OutputTextFile>");
      System.exit(1);
    }
    System.out.println(" MAIN: Beginn, alle beiden Dateien sind übergeben ");
    System.out.println(" MAIN:  PDF-Datei (arg 0) : " + args[0]);
    System.out.println(" MAIN:  Text-Datei (arg 1) : " + args[1]);
    PDFTextParser pdfTextParserObj = new PDFTextParser();
    String pdfToText = pdfTextParserObj.pdftoText(args[0]);
    if (pdfToText == null)
    {
      System.out.println("PDF to Text Conversion failed.");
    }
    else
    {
      System.out.println("\nThe text parsed from the PDF Document....\n" + 
pdfToText);
      pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
    }
  }
}


Could you advice me please, how can I extract all information from pdf file or 
at least data from the middle of page, I don't really need text in colontitul 
and footer?

I can send my pdf and txt, if it is needed?

Many thanks in advanced!!!

Best regards,
Anna Yakubenko

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

problem with parsing pdf with PDFBox

Reply via email to