Hi,

Thanks for the hint! I'll try to add some content there, as I can
definitely use a garbage detector.

In this case, however, I was specifically trying to avoid using a
statistical detector. PDFBox already knows there is a problem, so there is
no need to examine the content to attempt to detect a problem.
I would like to be able to capture the problem when and where it is known,
as this is easier and more accurate.

Thanks,
Wouter

On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <[email protected]>
wrote:

> If you have any recommendations for the more general case, let us know on
> TIKA-1443 [1].
>
> [1] https://issues.apache.org/jira/browse/TIKA-1443
>
> -----Original Message-----
> From: Wouter De Borger [mailto:[email protected]]
> Sent: Thursday, March 30, 2017 6:00 AM
> To: [email protected]
> Subject: Make PDFBox fail on bad pdf
>
> Hi All,
>
> When a pdf has bad encoding, PDFBox produces garbage (as explained in the
> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
>
> Can I make PDFBox fail in this case instead of producing garbage?
>
> (I'm working on a system that can also do OCR, so at the least sign of
> trouble, I would like PDF box to fail and try OCR.)
>
> Thanks,
> Wouter
>



-- 
Wouter De Borger, PhD
Co-founder Inmanta
www.inmanta.com
Email: [email protected]

Reply via email to