Hi, Well, PDF box does know it can't decode the unicode characters (as it outputs a stream of warnings). It would be nice if I could ask PDFBox how many undecodable characters a document has.
Wouter On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <sahy...@fileaffairs.de> wrote: > Hi, > > > Am 30.03.2017 um 14:25 schrieb Wouter De Borger < > wouter.debor...@inmanta.com>: > > > > Hi, > > > > Thanks for the hint! I'll try to add some content there, as I can > > definitely use a garbage detector. > > > > In this case, however, I was specifically trying to avoid using a > > statistical detector. PDFBox already knows there is a problem, > > that is not the case here. From PDFBox perspective everything is fine. > It's extracting the text according to the definition and information in the > PDF. That this is garbage from a users perspective would mean that PDFBox > 'understands' that the extracted text is not meaningful. > BR > Maruan > > > so there is > > no need to examine the content to attempt to detect a problem. > > I would like to be able to capture the problem when and where it is > known, > > as this is easier and more accurate. > > > > Thanks, > > Wouter > > > > On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <talli...@mitre.org > > > > wrote: > > > >> If you have any recommendations for the more general case, let us know > on > >> TIKA-1443 [1]. > >> > >> [1] https://issues.apache.org/jira/browse/TIKA-1443 > >> > >> -----Original Message----- > >> From: Wouter De Borger [mailto:wouter.debor...@inmanta.com] > >> Sent: Thursday, March 30, 2017 6:00 AM > >> To: users@pdfbox.apache.org > >> Subject: Make PDFBox fail on bad pdf > >> > >> Hi All, > >> > >> When a pdf has bad encoding, PDFBox produces garbage (as explained in > the > >> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish). > >> > >> Can I make PDFBox fail in this case instead of producing garbage? > >> > >> (I'm working on a system that can also do OCR, so at the least sign of > >> trouble, I would like PDF box to fail and try OCR.) > >> > >> Thanks, > >> Wouter > >> > > > > > > > > -- > > Wouter De Borger, PhD > > Co-founder Inmanta > > www.inmanta.com > > Email: wouter.debor...@inmanta.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > -- Wouter De Borger, PhD Co-founder Inmanta www.inmanta.com Email: wouter.debor...@inmanta.com