problem with parsing pdf with PDFBox
Hello! My name is Anna Yakubenko. I'm a Java-developer and now support application, which can parse pdf to txt with PDFBox and then store data to xml file as an output. Early every pdf files were parsed by PDFBox properly, but now I have got a pdf file, which is parsed in the way I couldn't expect. It seems, that customer add new layer with picture, colontitul and footer to pdf. And now PDFBox extarct information only from colontitul and footer from every page, and miss important information in the middle of the page. I use next source code to call PDFBox API: import java.io.File; import java.io.FileInputStream; import java.io.PrintStream; import java.io.PrintWriter; import org.pdfbox.cos.COSDocument; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.pdmodel.PDDocumentInformation; import org.pdfbox.util.PDFTextStripper; public class PDFTextParser { PDFParser parser; String parsedText; PDFTextStripper pdfStripper; PDDocument pdDoc; COSDocument cosDoc; PDDocumentInformation pdDocInfo; String pdftoText(String fileName) { System.out.println(Parsing text from PDF file + fileName + ); File f = new File(fileName); if (!f.isFile()) { System.out.println(File + fileName + does not exist.); return null; } try { System.out.println(Jetzt wird der Parser definiert: new PDFParser ); this.parser = new PDFParser(new FileInputStream(f)); } catch (Exception e) { System.out.println(Unable to open PDF Parser.); return null; } try { System.out.println(Jetzt wird mit dem Parser gearbeitet: ); this.parser.parse(); this.cosDoc = this.parser.getDocument(); this.pdfStripper = new PDFTextStripper(); this.pdDoc = new PDDocument(this.cosDoc); this.parsedText = this.pdfStripper.getText(this.pdDoc); } catch (Exception e) { System.out.println(An exception occured in parsing the PDF Document.); e.printStackTrace(); try { if (this.cosDoc != null) { this.cosDoc.close(); } if (this.pdDoc != null) { this.pdDoc.close(); } } catch (Exception e1) { e.printStackTrace(); } return null; } System.out.println(Done.); return this.parsedText; } void writeTexttoFile(String pdfText, String fileName) { System.out.println(\nWriting PDF text to output text file + fileName + ); try { PrintWriter pw = new PrintWriter(fileName); pw.print(pdfText); pw.close(); } catch (Exception e) { System.out.println(An exception occured in writing the pdf text to file.); e.printStackTrace(); } System.out.println(Done.); } public static void main(String[] args) { if (args.length != 2) { System.out.println(Usage: java PDFTextParser InputPDFFilename OutputTextFile); System.exit(1); } System.out.println( MAIN: Beginn, alle beiden Dateien sind übergeben ); System.out.println( MAIN: PDF-Datei (arg 0) : + args[0]); System.out.println( MAIN: Text-Datei (arg 1) : + args[1]); PDFTextParser pdfTextParserObj = new PDFTextParser(); String pdfToText = pdfTextParserObj.pdftoText(args[0]); if (pdfToText == null) { System.out.println(PDF to Text Conversion failed.); } else { System.out.println(\nThe text parsed from the PDF Document\n + pdfToText); pdfTextParserObj.writeTexttoFile(pdfToText, args[1]); } } } Could you advice me please, how can I extract all information from pdf file or at least data from the middle of page, I don't really need text in colontitul and footer? I can send my pdf and txt, if it is needed? Many thanks in advanced!!! Best regards, Anna Yakubenko - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: problem with parsing pdf with PDFBox
Hello Anna, Am 26.03.2015 um 09:11 schrieb Golovko Anna ann-golo...@yandex.ru: Hello! My name is Anna Yakubenko. I'm a Java-developer and now support application, which can parse pdf to txt with PDFBox and then store data to xml file as an output. Early every pdf files were parsed by PDFBox properly, but now I have got a pdf file, which is parsed in the way I couldn't expect. It seems, that customer add new layer with picture, colontitul and footer to pdf. And now PDFBox extarct information only from colontitul and footer from every page, and miss important information in the middle of the page. I use next source code to call PDFBox API: import java.io.File; import java.io.FileInputStream; import java.io.PrintStream; import java.io.PrintWriter; import org.pdfbox.cos.COSDocument; import org.pdfbox.pdfparser.PDFParser; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.pdmodel.PDDocumentInformation; import org.pdfbox.util.PDFTextStripper; public class PDFTextParser { PDFParser parser; String parsedText; PDFTextStripper pdfStripper; PDDocument pdDoc; COSDocument cosDoc; PDDocumentInformation pdDocInfo; String pdftoText(String fileName) { System.out.println(Parsing text from PDF file + fileName + ); File f = new File(fileName); if (!f.isFile()) { System.out.println(File + fileName + does not exist.); return null; } try { System.out.println(Jetzt wird der Parser definiert: new PDFParser ); this.parser = new PDFParser(new FileInputStream(f)); } catch (Exception e) { System.out.println(Unable to open PDF Parser.); return null; } try { System.out.println(Jetzt wird mit dem Parser gearbeitet: ); this.parser.parse(); this.cosDoc = this.parser.getDocument(); this.pdfStripper = new PDFTextStripper(); this.pdDoc = new PDDocument(this.cosDoc); this.parsedText = this.pdfStripper.getText(this.pdDoc); } catch (Exception e) { System.out.println(An exception occured in parsing the PDF Document.); e.printStackTrace(); try { if (this.cosDoc != null) { this.cosDoc.close(); } if (this.pdDoc != null) { this.pdDoc.close(); } } catch (Exception e1) { e.printStackTrace(); } return null; } System.out.println(Done.); return this.parsedText; } void writeTexttoFile(String pdfText, String fileName) { System.out.println(\nWriting PDF text to output text file + fileName + ); try { PrintWriter pw = new PrintWriter(fileName); pw.print(pdfText); pw.close(); } catch (Exception e) { System.out.println(An exception occured in writing the pdf text to file.); e.printStackTrace(); } System.out.println(Done.); } public static void main(String[] args) { if (args.length != 2) { System.out.println(Usage: java PDFTextParser InputPDFFilename OutputTextFile); System.exit(1); } System.out.println( MAIN: Beginn, alle beiden Dateien sind übergeben ); System.out.println( MAIN: PDF-Datei (arg 0) : + args[0]); System.out.println( MAIN: Text-Datei (arg 1) : + args[1]); PDFTextParser pdfTextParserObj = new PDFTextParser(); String pdfToText = pdfTextParserObj.pdftoText(args[0]); if (pdfToText == null) { System.out.println(PDF to Text Conversion failed.); } else { System.out.println(\nThe text parsed from the PDF Document\n + pdfToText); pdfTextParserObj.writeTexttoFile(pdfToText, args[1]); } } } you could simplify your code a lot doing something similar to (haven't tested it - there might be typos) - as the typical way to parse a PDF document is by doing PDDocument.load which does the rest in the background for you and already returns the PDDocument you need for the PDFTextStripper void pdftoText(String pdfFile, String outputFile) { System.out.println(Parsing text from PDF file + pdfFile + ); File f = new File(pdfFile); if (!f.isFile()) { System.out.println(File + pdfFile + does not exist.); } PDDocument pdDoc = null; Writer output = null; try { pdDoc = PDDocument.load(f); output = new OutputStreamWriter( new FileOutputStream( outputFile )); PDFTextStripper pdfStripper = new PDFTextStripper(); pdfStripper.writeText(pdDoc, output); } catch (IOException e) { System.out.println(An exception occured in parsing the PDF Document.); e.printStackTrace(); } finally { IOUtils.closeQuietly(pdDoc); IOUtils.closeQuietly(output); } System.out.println(Done.); } In addition there is
Re: org.apache.pdfbox.searchengine?
Hi, that's now part of the examples package [org.apache.pdfbox.examples.lucene;] and no longer within core. BR Maruan Am 25.03.2015 um 23:13 schrieb Yoel R. GARCIA DIAZ yr.garciad...@me.com: I am sure that this has been asked before but I am new to pdfbox and can’t reconcile its documentation with the available api after adding the 1.8.8 version to my app. The second line on this page suggests that I should be able to 'import org.apache.pdfbox.searchengine.lucene.LucenePDFDocument’ but no. This means that the Lucene Integration is also wrong 'Document luceneDocument = LucenePDFDocument.getDocument( ... );’ So, is Lucene integrated with PDFBox in version 1.8.8 and above (2.0.0-SNAPSHOT) or not at all anymore? Yoel - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: Blank page rendered with wrong xref start objid (batch 1.8)
Hi, jg...@e-nautia.com hat am 25. März 2015 um 15:25 geschrieben: Hello, bug PDFBOX-2679 entitled Blank page rendered with wrong xref start objid was recently fixed for branch 2.0.0 but this same issue is still affecting NonSequentialParser v 1.8.8 as it is also rendering a blank page with that kind of malformed pdfs (in our case these pdfs are generated by some soho scanners!!). Do you plan to fix this issue also for branch 1.8 or at least open a jira? No, we don't backport every fix from the trunk to 1.8 for different reasons. If someone wants to do so, patches are welcome :-) BR Andreas Lehmkühler thank you Jerome - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
pdfbox warnings
When I call the PDFRenderer renderPageToGraphics method I get warnings in my log. Is there a fix for these? org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats org.apache.pdfbox.rendering.font.Type1Glyph2D getPathForCharacterCode WARNING: No glyph for 52 (.notdef) in font ZapfDingbats org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats
PDF BOX Savable pdf
Hi, I have a few questions regarding the pdfbox in java. Can you please tell me how to create savable pdf using pdfbox[ie. User should be allowed to enter the text in the text filed and save the pdf form]. Thanks Regards, Sandeep Varma. M Pennant Technologies
External Tool?
Hi, I'm using pdfbox to fill fields in pdf forms. The problem is that most forms we need to fill arrive broken. Broken in the sense of we can load the file, we can find the field by name, but the first field on the page invariably gives us the nfamous exception: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.interactive.form.PDAppearance.calculateFontSize(PDAppearance.java:923) This is the sign that the form is broken. Previously we used MasterPDFEditor to save the document, build the fields on it (all over again) and everything worked. Now, apparently that tool creates equally broken documents that we cannot fill with pdfbox. 1. Is there a better tool to use, one that would actually save usable files and not cost an arm and a leg? (For instance should I use pdfbox to do this work programmatically?) 2. Regardless of the answer to question 1, could I be doing something better in terms of opening and manipulating the file? (code sample below.) 3. Anything else that probably well known but I don't know enough to ask? - sample: PDDocument pdfDocument; PDAcroForm form; try { pdfDocument = PDDocument.load(templateLocation); form = pdfDocument.getDocumentCatalog().getAcroForm(); } catch (IOException ioe) { log.log(Level.SEVERE, Unable to load a template: {0}\n{1}, new Object[]{templateLocation, ioe}); } PDField employerName = form.getField(Name_of_Employer); employerName.setValue(TEST); -- NPE here - -- Richard Johnson - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: PDDeviceN
Am 25.03.2015 um 22:52 schrieb Floris: Hey there, I am struggeling getting the PDDeviceN color space to work. I am trying to create a simple page with one rectangle having a color defined against a DeviceN space. I tried the following: PDDeviceN cs =newPDDeviceN(); cs.setAlternateColorSpace(PDDeviceCMYK.INSTANCE); cs.setColorantNames(Arrays.asList(Cyan,Magenta,Yellow,Black,Orange,Green)); PDFunction func; COSArray domains =newCOSArray(); COSArray ranges =newCOSArray(); domains.setFloatArray(new float[]{0.0f,1.0f,0.0f,1.0f,0.0f,1.0f,0.0f,1.0f,0.0f,1.0f,0.0f,1.0f}); ranges.setFloatArray(new float[]{0.0f,1.0f,0.0f,1.0f,0.0f,1.0f,0.0f,1.0f}); COSDictionary dict=newCOSDictionary(); dict.setItem(COSName.FUNCTION_TYPE, COSInteger.get(4)); dict.setItem(COSName.DOMAIN, domains); dict.setItem(COSName.RANGE, ranges); try{ func = PDFunction.create(tintdata); cs.setTintTransform(func); }catch(IOException e) { e.printStackTrace(); } This results in a NullPointerError in the first row of the try-part. The type4-function should just pop the first two values (to get from CMYKOG to CMYK). Could someone give me a jump start on how to define a PDDeviceN space? Have a nice day, Floris Hello, - tintdata is not defined - the postscript code of the type4 function is missing Here's an excerpt of a PDF file with a type 4 function: 48 0 obj /FunctionType 4 /Domain [0 1] /Range [0 1 0 1 0 1 0 1] /Length 62 stream {dup 0.37 mul exch dup 0 mul exch dup 0.34 mul exch 0.34 mul } endstream endobj i.e. your function must have a stream which contains the postscript code. Thus call PDFunction.create() with a COStream. This object should have the dictionary that you already have, and as stream the code of the function. COSDictionary dict=newCOSDictionary(); dict.setItem(COSName.FUNCTION_TYPE, COSInteger.get(4)); dict.setItem(COSName.DOMAIN, domains); dict.setItem(COSName.RANGE, ranges); String functionText = { push pop whatever }; COSStream functionStream = new COSStream(dict, new RandomAccessBuffer()); OutputStream out = functionStream.createUnfilteredStream(); out.write(functionText.getBytes(US-ASCII)); out.close(); func = PDFunction.create(functionStream); I didn't test this code. If it doesn't work, please post some complete code that creates a PDF (that fails) and I'll try to help you more (However I can't help you with the postscript code). Tilman PS: always mention the version when you ask a question (in your case probably 1.8.*) - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: pdfbox warnings
Am 26.03.2015 um 14:16 schrieb Eric Douglas: When I call the PDFRenderer renderPageToGraphics method I get warnings in my log. Is there a fix for these? org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats org.apache.pdfbox.rendering.font.Type1Glyph2D getPathForCharacterCode WARNING: No glyph for 52 (.notdef) in font ZapfDingbats org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats Yes, buy a ZapfDingbats type 1 font and copy it into your font directory. Or maybe you already have it, then the file is named ZD__.PFB. License restrictions may apply. Tilman PS: alwyays mention the version (in your case trunk or 2.0) - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: pdfbox warnings
I don't know where the fallback font comes from but it doesn't work. If I view the PDF in Abode Reader it has an editable checkbox with a check mark in it. The image rendered by pdfbox is just an empty box. I am using pdfbox 2.0.0 from a trunk. I can upload a small sample PDF that shows this error if it helps. I have a ZapfDingbats.ttf file but I can't see any way to embed that for the editable checkbox. I'm creating the PDF using iText. If I view the PDF in Adobe Reader and check Properties, on the Fonts tab it shows ZapfDingbats Type: Type 1 Encoding: Built-in Actual Font: AdobePiStd Actual Font Type: Type 1 My normal fonts for text that properly embedded which render fine with this trunk show as LucidaSans-Typewriter (Embedded Subset) Type: TrueType Encoding: Custom On Thu, Mar 26, 2015 at 12:23 PM, Tilman Hausherr thaush...@t-online.de wrote: Am 26.03.2015 um 14:16 schrieb Eric Douglas: When I call the PDFRenderer renderPageToGraphics method I get warnings in my log. Is there a fix for these? org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats org.apache.pdfbox.rendering.font.Type1Glyph2D getPathForCharacterCode WARNING: No glyph for 52 (.notdef) in font ZapfDingbats org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats Yes, buy a ZapfDingbats type 1 font and copy it into your font directory. Or maybe you already have it, then the file is named ZD__.PFB. License restrictions may apply. Tilman PS: alwyays mention the version (in your case trunk or 2.0) - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: pdfbox warnings
I want it to use embedded fonts, but it appears to be looking for installed fonts for the check mark on an iText editable check box field, so I tried installing ZapfDingbats.ttf. Now it just gets this log org.apache.pdfbox.rendering.font.Type1Glyph2D getPathForCharacterCode WARNING: No glyph for 52 (.notdef) in font ZapfDingbats Now it doesn't log Using fallback font so I don't know what it's doing. I can only guess iText is also using a fallback font which is apparently not available to pdfbox? Is there a way I can set the pdfbox fallback font for the renderPageToGraphics? On Thu, Mar 26, 2015 at 1:50 PM, Tilman Hausherr thaush...@t-online.de wrote: It doesn't work with the ttf font for some reason. I can only tell that it works for me with the type1 file. Tilman Am 26.03.2015 um 18:41 schrieb Eric Douglas: I don't know where the fallback font comes from but it doesn't work. If I view the PDF in Abode Reader it has an editable checkbox with a check mark in it. The image rendered by pdfbox is just an empty box. I am using pdfbox 2.0.0 from a trunk. I can upload a small sample PDF that shows this error if it helps. I have a ZapfDingbats.ttf file but I can't see any way to embed that for the editable checkbox. I'm creating the PDF using iText. If I view the PDF in Adobe Reader and check Properties, on the Fonts tab it shows ZapfDingbats Type: Type 1 Encoding: Built-in Actual Font: AdobePiStd Actual Font Type: Type 1 My normal fonts for text that properly embedded which render fine with this trunk show as LucidaSans-Typewriter (Embedded Subset) Type: TrueType Encoding: Custom On Thu, Mar 26, 2015 at 12:23 PM, Tilman Hausherr thaush...@t-online.de wrote: Am 26.03.2015 um 14:16 schrieb Eric Douglas: When I call the PDFRenderer renderPageToGraphics method I get warnings in my log. Is there a fix for these? org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats org.apache.pdfbox.rendering.font.Type1Glyph2D getPathForCharacterCode WARNING: No glyph for 52 (.notdef) in font ZapfDingbats org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats Yes, buy a ZapfDingbats type 1 font and copy it into your font directory. Or maybe you already have it, then the file is named ZD__.PFB. License restrictions may apply. Tilman PS: alwyays mention the version (in your case trunk or 2.0) - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: pdfbox warnings
Maybe the ttf and the type1 versions of the font have different encodings? Re set the pdfbox fallback no; what I did to test ideas, was to mess with the substitutes map in ExternalFonts.java Tilman Am 26.03.2015 um 19:10 schrieb Eric Douglas: I want it to use embedded fonts, but it appears to be looking for installed fonts for the check mark on an iText editable check box field, so I tried installing ZapfDingbats.ttf. Now it just gets this log org.apache.pdfbox.rendering.font.Type1Glyph2D getPathForCharacterCode WARNING: No glyph for 52 (.notdef) in font ZapfDingbats Now it doesn't log Using fallback font so I don't know what it's doing. I can only guess iText is also using a fallback font which is apparently not available to pdfbox? Is there a way I can set the pdfbox fallback font for the renderPageToGraphics? On Thu, Mar 26, 2015 at 1:50 PM, Tilman Hausherr thaush...@t-online.de wrote: It doesn't work with the ttf font for some reason. I can only tell that it works for me with the type1 file. Tilman Am 26.03.2015 um 18:41 schrieb Eric Douglas: I don't know where the fallback font comes from but it doesn't work. If I view the PDF in Abode Reader it has an editable checkbox with a check mark in it. The image rendered by pdfbox is just an empty box. I am using pdfbox 2.0.0 from a trunk. I can upload a small sample PDF that shows this error if it helps. I have a ZapfDingbats.ttf file but I can't see any way to embed that for the editable checkbox. I'm creating the PDF using iText. If I view the PDF in Adobe Reader and check Properties, on the Fonts tab it shows ZapfDingbats Type: Type 1 Encoding: Built-in Actual Font: AdobePiStd Actual Font Type: Type 1 My normal fonts for text that properly embedded which render fine with this trunk show as LucidaSans-Typewriter (Embedded Subset) Type: TrueType Encoding: Custom On Thu, Mar 26, 2015 at 12:23 PM, Tilman Hausherr thaush...@t-online.de wrote: Am 26.03.2015 um 14:16 schrieb Eric Douglas: When I call the PDFRenderer renderPageToGraphics method I get warnings in my log. Is there a fix for these? org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats org.apache.pdfbox.rendering.font.Type1Glyph2D getPathForCharacterCode WARNING: No glyph for 52 (.notdef) in font ZapfDingbats org.apache.pdfbox.pdmodel.font.PDType1Font init WARNING: Using fallback font ArialMT for ZapfDingbats Yes, buy a ZapfDingbats type 1 font and copy it into your font directory. Or maybe you already have it, then the file is named ZD__.PFB. License restrictions may apply. Tilman PS: alwyays mention the version (in your case trunk or 2.0) - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org
Re: convert html to pdf with pdfbox
Nope, but you can use PhantomJS. -- John On 25 Mar 2015, at 11:16, Daniel Borlean dborl...@extraview.com wrote: Does pdfbox support conversion of html files/documents to pdf format? Thanks, Daniel - To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org