RE: Make PDFBox fail on bad pdf

2017-03-30 Thread Allison, Timothy B.
If you have any recommendations for the more general case, let us know on TIKA-1443 [1]. [1] https://issues.apache.org/jira/browse/TIKA-1443 -Original Message- From: Wouter De Borger [mailto:wouter.debor...@inmanta.com] Sent: Thursday, March 30, 2017 6:00 AM To: users@pdfbox.apache.org

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Maruan Sahyoun
Hi, > Am 30.03.2017 um 14:25 schrieb Wouter De Borger : > > Hi, > > Thanks for the hint! I'll try to add some content there, as I can > definitely use a garbage detector. > > In this case, however, I was specifically trying to avoid using a > statistical detector.

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Maruan Sahyoun
> Am 30.03.2017 um 14:37 schrieb Wouter De Borger : > > Hi, > > Well, PDF box does know it can't decode the unicode characters (as it > outputs a stream of warnings). It would be nice if I could ask PDFBox how > many undecodable characters a document has. well,

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Wouter De Borger
Oh, sorry, my bad. The log lines are: 2017-46-30 14:46:04.788 [33mWARN [m --- [DefaultMessageListenerContainer-1] [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) [m : No Unicode mapping for c49 (86) in font null 2017-46-30 14:46:04.788 [33mWARN [m ---

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Wouter De Borger
Hi, Thanks for the hint! I'll try to add some content there, as I can definitely use a garbage detector. In this case, however, I was specifically trying to avoid using a statistical detector. PDFBox already knows there is a problem, so there is no need to examine the content to attempt to

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Wouter De Borger
Hi, Well, PDF box does know it can't decode the unicode characters (as it outputs a stream of warnings). It would be nice if I could ask PDFBox how many undecodable characters a document has. Wouter On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun wrote: > Hi, > > > Am

Make PDFBox fail on bad pdf

2017-03-30 Thread Wouter De Borger
Hi All, When a pdf has bad encoding, PDFBox produces garbage (as explained in the FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish). Can I make PDFBox fail in this case instead of producing garbage? (I'm working on a system that can also do OCR, so at the least sign of trouble, I would like

A Dumb Acroform Question

2017-03-30 Thread Evan Williams
I would like to change default font of every field in an existing PDF form.This suggests to me that I should change the default appearance of the acroform and, so be on the safe side (since I am not the author of these forms), change the default appearance of each PDTextField (I don't care about

Re: A Dumb Acroform Question

2017-03-30 Thread Maruan Sahyoun
Hi, > Am 30.03.2017 um 15:54 schrieb Evan Williams : > > I would like to change default font of every field in an existing PDF > form.This suggests to me that I should change the default appearance of the > acroform and, so be on the safe side (since I am not the author

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Maruan Sahyoun
an option without changing PDFBox could be to create a custom log appender and grab the org.apache.pdfbox.pdmodel.font.PDSimpleFont log messages. You could then count them afterwards and if they are above a certain threshold decide to drop the result of the text extraction. > Am 30.03.2017 um

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Wouter De Borger
I'm working in a spring/tomcat container, so I'm reluctant to mess with the logging, as I'm not quite sure if spring/tomcat ever reloads/updates the logging config. Another option would be to create a superclass of PDFTextStripper, override showText, grab the font after each call and extract the

Re: A Dumb Acroform Question

2017-03-30 Thread Evan Williams
Thank you Maruan! That solution works perfectly. At last my users will be able to enjoy illegibly tiny 4 pt text in autosized fields (which, sadly, they believe that they want). On Thu, Mar 30, 2017 at 10:38 AM, Maruan Sahyoun wrote: > Hi, > > > Am 30.03.2017 um 15:54

Re: Make PDFBox fail on bad pdf

2017-03-30 Thread Tilman Hausherr
The problem is that some files do this as an obfuscation technique. What might be detected is fonts that don't have unicode extraction. See in LegacyPDFStreamEngine "if (unicode == null)". Make your own or extend it and check for TextPosition objects with unicode null. (See PrintTextLocations

Re: There was a problem when loading a pdf file to Adobe Acrobat. document(18)

2017-03-30 Thread Tilman Hausherr
Am 31.03.2017 um 07:21 schrieb karthick g: Hi Team, Apologies for sending the previous mail to developer team, Please guide me * There was a problem when loading a pdf file to Adobe Acrobat. *"There was a problem reading this document (18) " (Error Shown by acrobat)* * When running the

There was a problem when loading a pdf file to Adobe Acrobat. document(18)

2017-03-30 Thread karthick g
Hi Team, Apologies for sending the previous mail to developer team, Please guide me * There was a problem when loading a pdf file to Adobe Acrobat. *"There was a problem reading this document (18) " (Error Shown by acrobat)* * When running the same file to the PDFBox it is working fine.