RE: Why would PDFTextStripper.getText() generate a NullPointerException ?

Adam Tue, 11 May 2010 14:43:39 -0700

aPage.findCropBox() in PDFStreamEngine.java line 202 (on HEAD tag) is 
returning null.  That null is passed into the PDGraphicsState constructor 
and throws the exception your seeing.


I'm guessing that the page doesn't have a cropbox, nor does its parent, 
nor does it have a media box.  This may not be a valid PDF, but I'm not 
sure as I haven't looked into the page layout part of specification. Since 
I'm not familiar with the spec, I can't really suggest a fix for this. Can 
you open the PDF with another program like Adobe Reader?  If it works 
there then there should be some sane solution for PDFBox even if the PDF 
is out of spec.

As a test you could add a quick if(page != null) in PDGraphicsState.java 
before line 83 If you want to just run a test.  That's not a proper 
solution, but it might at least give you more info and perhaps someone 
else on the will have an idea of where to go next.  Of course, you'll have 
to compile PDFBox to do this (not sure if you're doing that now or just 
using the 1.1.0 jar files).  Hope this helps you at least a little.

--Adam





From:
"Lupton, Chris B." <[email protected]>
To:
<[email protected]>
Cc:
"Lupton, Chris B." <[email protected]>
Date:
05/11/2010 14:09
Subject:
RE: Why would PDFTextStripper.getText() generate a NullPointerException ?



"RE: Why would PDFTextStripper.getText() generate a NullPointerException
?"
Follow Up on my previous post:

The Stack Trace that I am getting:
===============================================================
Exception in thread "main" java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.graphics.PDGraphicsState.<init>
(PDGraphicsState.java:83)
at org.apache.pdfbox.util.PDFStreamEngine.processStream
(PDFStreamEngine.java:201)
at org.apache.pdfbox.util.PDFTextStripper.processPage
(PDFTextStripper.java:367)
at org.apache.pdfbox.util.PDFTextStripper.processPages
(PDFTextStripper.java:291)
at org.apache.pdfbox.util.PDFTextStripper.writeText
(PDFTextStripper.java:247)
at org.apache.pdfbox.util.PDFTextStripper.getText
(PDFTextStripper.java:180)


Description of my PDF that is having the problem:
===================================================
The PDF (which I am not allowed to share) was created by a Scanner that
OCR'd data from a FAXed page.
The original FAX page was a Govt Form which had poor scan quality to
begin with.

As a result most of the lines that make up the Form's "boxes" are faded
/ incomplete when FAXed.
Fortunately, the original Text Content is clearly typed and is correctly
represented as text within the PDF Document.
(I can actually copy/paste it from Acrobat Reader,  for example).

However, the PDF does contain much of the Govt Form's original fuzzy
outline.
Most of these graphical lines are faded/incomplete.
Would these lines cause a problem with Text Extraction perhaps ?


The Sample Code that I based my simple Java Class on:
===========================================================
http://www.java-forums.org/advanced-java/8546-reading-text-using-pdfbox.
html

PDDocument pddDocument=PDDocument.load(new File("a.pdf"));
PDFTextStripper textStripper=new PDFTextStripper();
System.out.println(textStripper.getText(pddDocumen t));
pddDocument.close();

(I am using the latest version of PDFBox and its supporting jar file for
fo



?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage 
Company, Inc.  is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or the taking of any action in reliance on 
the contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call  (800) 453 7884.

RE: Why would PDFTextStripper.getText() generate a NullPointerException ?

Reply via email to