Hi, > Am 30.03.2017 um 14:25 schrieb Wouter De Borger <[email protected]>: > > Hi, > > Thanks for the hint! I'll try to add some content there, as I can > definitely use a garbage detector. > > In this case, however, I was specifically trying to avoid using a > statistical detector. PDFBox already knows there is a problem,
that is not the case here. From PDFBox perspective everything is fine. It's extracting the text according to the definition and information in the PDF. That this is garbage from a users perspective would mean that PDFBox 'understands' that the extracted text is not meaningful. BR Maruan > so there is > no need to examine the content to attempt to detect a problem. > I would like to be able to capture the problem when and where it is known, > as this is easier and more accurate. > > Thanks, > Wouter > > On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <[email protected]> > wrote: > >> If you have any recommendations for the more general case, let us know on >> TIKA-1443 [1]. >> >> [1] https://issues.apache.org/jira/browse/TIKA-1443 >> >> -----Original Message----- >> From: Wouter De Borger [mailto:[email protected]] >> Sent: Thursday, March 30, 2017 6:00 AM >> To: [email protected] >> Subject: Make PDFBox fail on bad pdf >> >> Hi All, >> >> When a pdf has bad encoding, PDFBox produces garbage (as explained in the >> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish). >> >> Can I make PDFBox fail in this case instead of producing garbage? >> >> (I'm working on a system that can also do OCR, so at the least sign of >> trouble, I would like PDF box to fail and try OCR.) >> >> Thanks, >> Wouter >> > > > > -- > Wouter De Borger, PhD > Co-founder Inmanta > www.inmanta.com > Email: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

