Hello Anna,
> Am 26.03.2015 um 09:11 schrieb Golovko Anna <[email protected]>:
>
> Hello!
>
> My name is Anna Yakubenko. I'm a Java-developer and now support application,
> which can parse pdf to txt with PDFBox and then store data to xml file as an
> output. Early every pdf files were parsed by PDFBox properly, but now I have
> got a pdf file, which is parsed in the way I couldn't expect. It seems, that
> customer add new layer with picture, colontitul and footer to pdf. And now
> PDFBox extarct information only from colontitul and footer from every page,
> and miss important information in the middle of the page.
>
> I use next source code to call PDFBox API:
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintStream;
> import java.io.PrintWriter;
> import org.pdfbox.cos.COSDocument;
> import org.pdfbox.pdfparser.PDFParser;
> import org.pdfbox.pdmodel.PDDocument;
> import org.pdfbox.pdmodel.PDDocumentInformation;
> import org.pdfbox.util.PDFTextStripper;
>
> public class PDFTextParser
> {
> PDFParser parser;
> String parsedText;
> PDFTextStripper pdfStripper;
> PDDocument pdDoc;
> COSDocument cosDoc;
> PDDocumentInformation pdDocInfo;
>
> String pdftoText(String fileName)
> {
> System.out.println("Parsing text from PDF file " + fileName + "....");
> File f = new File(fileName);
> if (!f.isFile())
> {
> System.out.println("File " + fileName + " does not exist.");
> return null;
> }
> try
> {
> System.out.println("Jetzt wird der Parser definiert: new PDFParser ");
> this.parser = new PDFParser(new FileInputStream(f));
> }
> catch (Exception e)
> {
> System.out.println("Unable to open PDF Parser.");
> return null;
> }
> try
> {
> System.out.println("Jetzt wird mit dem Parser gearbeitet: ");
> this.parser.parse();
> this.cosDoc = this.parser.getDocument();
> this.pdfStripper = new PDFTextStripper();
> this.pdDoc = new PDDocument(this.cosDoc);
> this.parsedText = this.pdfStripper.getText(this.pdDoc);
> }
> catch (Exception e)
> {
> System.out.println("An exception occured in parsing the PDF Document.");
> e.printStackTrace();
> try
> {
> if (this.cosDoc != null) {
> this.cosDoc.close();
> }
> if (this.pdDoc != null) {
> this.pdDoc.close();
> }
> }
> catch (Exception e1)
> {
> e.printStackTrace();
> }
> return null;
> }
> System.out.println("Done.");
> return this.parsedText;
> }
>
> void writeTexttoFile(String pdfText, String fileName)
> {
> System.out.println("\nWriting PDF text to output text file " + fileName +
> "....");
> try
> {
> PrintWriter pw = new PrintWriter(fileName);
> pw.print(pdfText);
> pw.close();
> }
> catch (Exception e)
> {
> System.out.println("An exception occured in writing the pdf text to
> file.");
> e.printStackTrace();
> }
> System.out.println("Done.");
> }
>
> public static void main(String[] args)
> {
> if (args.length != 2)
> {
> System.out.println("Usage: java PDFTextParser <InputPDFFilename>
> <OutputTextFile>");
> System.exit(1);
> }
> System.out.println(" MAIN: Beginn, alle beiden Dateien sind übergeben ");
> System.out.println(" MAIN: PDF-Datei (arg 0) : " + args[0]);
> System.out.println(" MAIN: Text-Datei (arg 1) : " + args[1]);
> PDFTextParser pdfTextParserObj = new PDFTextParser();
> String pdfToText = pdfTextParserObj.pdftoText(args[0]);
> if (pdfToText == null)
> {
> System.out.println("PDF to Text Conversion failed.");
> }
> else
> {
> System.out.println("\nThe text parsed from the PDF Document....\n" +
> pdfToText);
> pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
> }
> }
> }
>
you could simplify your code a lot doing something similar to (haven't tested
it - there might be typos) - as the typical way to parse a PDF document is by
doing PDDocument.load which does the rest in the background for you and already
returns the PDDocument you need for the PDFTextStripper
void pdftoText(String pdfFile, String outputFile)
{
System.out.println("Parsing text from PDF file " + pdfFile + "....");
File f = new File(pdfFile);
if (!f.isFile())
{
System.out.println("File " + pdfFile + " does not exist.");
}
PDDocument pdDoc = null;
Writer output = null;
try
{
pdDoc = PDDocument.load(f);
output = new OutputStreamWriter( new FileOutputStream( outputFile
));
PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.writeText(pdDoc, output);
}
catch (IOException e)
{
System.out.println("An exception occured in parsing the PDF
Document.");
e.printStackTrace();
}
finally
{
IOUtils.closeQuietly(pdDoc);
IOUtils.closeQuietly(output);
}
System.out.println("Done.");
}
In addition there is already a command line app ExtractText which does that for
you.
>
> Could you advice me please, how can I extract all information from pdf file
> or at least data from the middle of page, I don't really need text in
> colontitul and footer?
>
> I can send my pdf and txt, if it is needed?
>
wrt to the PDF could you upload it to a public location so we can give it a try.
BR
Maruan
> Many thanks in advanced!!!
>
> Best regards,
> Anna Yakubenko
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>