Re: problem with parsing pdf with PDFBox

Maruan Sahyoun Thu, 26 Mar 2015 02:28:05 -0700

Hello Anna,

> Am 26.03.2015 um 09:11 schrieb Golovko Anna <[email protected]>:
> 
> Hello!
> 
> My name is Anna Yakubenko. I'm a Java-developer and now support application, 
> which can parse pdf to txt with PDFBox and then store data to xml file as an 
> output. Early every pdf files were parsed by PDFBox properly, but now I have 
> got a pdf file, which is parsed in the way I couldn't expect. It seems, that 
> customer add new layer with picture, colontitul and footer to pdf. And now 
> PDFBox extarct information only from colontitul and footer from every page, 
> and miss important information in the middle of the page. 
> 
> I use next source code to call PDFBox API:
> 
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintStream;
> import java.io.PrintWriter;
> import org.pdfbox.cos.COSDocument;
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.pdmodel.PDDocumentInformation;
> import org.pdfbox.util.PDFTextStripper;
> 
> public class PDFTextParser
> {
>  PDFParser parser;
>  String parsedText;
>  PDFTextStripper pdfStripper;
>  PDDocument pdDoc;
>  COSDocument cosDoc;
>  PDDocumentInformation pdDocInfo;
> 
>  String pdftoText(String fileName)
>  {
>    System.out.println("Parsing text from PDF file " + fileName + "....");
>    File f = new File(fileName);
>    if (!f.isFile())
>    {
>      System.out.println("File " + fileName + " does not exist.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird der Parser definiert: new PDFParser ");
>      this.parser = new PDFParser(new FileInputStream(f));
>    }
>    catch (Exception e)
>    {
>      System.out.println("Unable to open PDF Parser.");
>      return null;
>    }
>    try
>    {
>      System.out.println("Jetzt wird mit dem  Parser gearbeitet:  ");
>      this.parser.parse();
>      this.cosDoc = this.parser.getDocument();
>      this.pdfStripper = new PDFTextStripper();
>      this.pdDoc = new PDDocument(this.cosDoc);
>      this.parsedText = this.pdfStripper.getText(this.pdDoc);
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in parsing the PDF Document.");
>      e.printStackTrace();
>      try
>      {
>        if (this.cosDoc != null) {
>          this.cosDoc.close();
>        }
>        if (this.pdDoc != null) {
>          this.pdDoc.close();
>        }
>      }
>      catch (Exception e1)
>      {
>        e.printStackTrace();
>      }
>      return null;
>    }
>    System.out.println("Done.");
>    return this.parsedText;
>  }
> 
>  void writeTexttoFile(String pdfText, String fileName)
>  {
>    System.out.println("\nWriting PDF text to output text file " + fileName + 
> "....");
>    try
>    {
>      PrintWriter pw = new PrintWriter(fileName);
>      pw.print(pdfText);
>      pw.close();
>    }
>    catch (Exception e)
>    {
>      System.out.println("An exception occured in writing the pdf text to 
> file.");
>      e.printStackTrace();
>    }
>    System.out.println("Done.");
>  }
> 
>  public static void main(String[] args)
>  {
>    if (args.length != 2)
>    {
>      System.out.println("Usage: java PDFTextParser <InputPDFFilename> 
> <OutputTextFile>");
>      System.exit(1);
>    }
>    System.out.println(" MAIN: Beginn, alle beiden Dateien sind übergeben ");
>    System.out.println(" MAIN:  PDF-Datei (arg 0) : " + args[0]);
>    System.out.println(" MAIN:  Text-Datei (arg 1) : " + args[1]);
>    PDFTextParser pdfTextParserObj = new PDFTextParser();
>    String pdfToText = pdfTextParserObj.pdftoText(args[0]);
>    if (pdfToText == null)
>    {
>      System.out.println("PDF to Text Conversion failed.");
>    }
>    else
>    {
>      System.out.println("\nThe text parsed from the PDF Document....\n" + 
> pdfToText);
>      pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
>    }
>  }
> }
>


you could simplify your code a lot doing something similar to (haven't tested 
it - there might be typos)  - as the typical way to parse a PDF document is by 
doing PDDocument.load which does the rest in the background for you and already 
returns the PDDocument you need for the PDFTextStripper

    void pdftoText(String pdfFile, String outputFile)
    {

        System.out.println("Parsing text from PDF file " + pdfFile + "....");
        File f = new File(pdfFile);
        if (!f.isFile())
        {
            System.out.println("File " + pdfFile + " does not exist.");
        }
        
        PDDocument pdDoc = null;
        Writer output = null;
        try
        {
            pdDoc = PDDocument.load(f);
            output = new OutputStreamWriter( new FileOutputStream( outputFile 
));
            PDFTextStripper pdfStripper = new PDFTextStripper();
            pdfStripper.writeText(pdDoc, output);
        }
        catch (IOException e)
        {
            System.out.println("An exception occured in parsing the PDF 
Document.");
            e.printStackTrace();
        }
        finally
        {
            IOUtils.closeQuietly(pdDoc);
            IOUtils.closeQuietly(output);
        }

        System.out.println("Done.");
    }

In addition there is already a command line app ExtractText which does that for 
you. 



> 
> Could you advice me please, how can I extract all information from pdf file 
> or at least data from the middle of page, I don't really need text in 
> colontitul and footer?
> 
> I can send my pdf and txt, if it is needed?
> 

wrt to the PDF could you upload it to a public location so we can give it a try.

BR
Maruan


> Many thanks in advanced!!!
> 
> Best regards,
> Anna Yakubenko
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: problem with parsing pdf with PDFBox

Reply via email to