> Am 30.03.2017 um 14:37 schrieb Wouter De Borger <[email protected]>: > > Hi, > > Well, PDF box does know it can't decode the unicode characters (as it > outputs a stream of warnings). It would be nice if I could ask PDFBox how > many undecodable characters a document has.
well, that's something you didn't mention before - could you drop some of the messages here so we know which one you are talking about? BR Maruan > > Wouter > > On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <[email protected]> > wrote: > >> Hi, >> >>> Am 30.03.2017 um 14:25 schrieb Wouter De Borger < >> [email protected]>: >>> >>> Hi, >>> >>> Thanks for the hint! I'll try to add some content there, as I can >>> definitely use a garbage detector. >>> >>> In this case, however, I was specifically trying to avoid using a >>> statistical detector. PDFBox already knows there is a problem, >> >> that is not the case here. From PDFBox perspective everything is fine. >> It's extracting the text according to the definition and information in the >> PDF. That this is garbage from a users perspective would mean that PDFBox >> 'understands' that the extracted text is not meaningful. >> BR >> Maruan >> >>> so there is >>> no need to examine the content to attempt to detect a problem. >>> I would like to be able to capture the problem when and where it is >> known, >>> as this is easier and more accurate. >>> >>> Thanks, >>> Wouter >>> >>> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <[email protected] >>> >>> wrote: >>> >>>> If you have any recommendations for the more general case, let us know >> on >>>> TIKA-1443 [1]. >>>> >>>> [1] https://issues.apache.org/jira/browse/TIKA-1443 >>>> >>>> -----Original Message----- >>>> From: Wouter De Borger [mailto:[email protected]] >>>> Sent: Thursday, March 30, 2017 6:00 AM >>>> To: [email protected] >>>> Subject: Make PDFBox fail on bad pdf >>>> >>>> Hi All, >>>> >>>> When a pdf has bad encoding, PDFBox produces garbage (as explained in >> the >>>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish). >>>> >>>> Can I make PDFBox fail in this case instead of producing garbage? >>>> >>>> (I'm working on a system that can also do OCR, so at the least sign of >>>> trouble, I would like PDF box to fail and try OCR.) >>>> >>>> Thanks, >>>> Wouter >>>> >>> >>> >>> >>> -- >>> Wouter De Borger, PhD >>> Co-founder Inmanta >>> www.inmanta.com >>> Email: [email protected] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > -- > Wouter De Borger, PhD > Co-founder Inmanta > www.inmanta.com > Email: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

