Hi,

Well, PDF box does know it can't decode the unicode characters (as it
outputs a stream of warnings). It would be nice if I could ask PDFBox how
many undecodable characters a document has.

Wouter

On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <sahy...@fileaffairs.de>
wrote:

> Hi,
>
> > Am 30.03.2017 um 14:25 schrieb Wouter De Borger <
> wouter.debor...@inmanta.com>:
> >
> > Hi,
> >
> > Thanks for the hint! I'll try to add some content there, as I can
> > definitely use a garbage detector.
> >
> > In this case, however, I was specifically trying to avoid using a
> > statistical detector. PDFBox already knows there is a problem,
>
> that is not the case here. From PDFBox perspective everything is fine.
> It's extracting the text according to the definition and information in the
> PDF. That this is garbage from a users perspective would mean that PDFBox
> 'understands' that the extracted text is not meaningful.
> BR
> Maruan
>
> > so there is
> > no need to examine the content to attempt to detect a problem.
> > I would like to be able to capture the problem when and where it is
> known,
> > as this is easier and more accurate.
> >
> > Thanks,
> > Wouter
> >
> > On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <talli...@mitre.org
> >
> > wrote:
> >
> >> If you have any recommendations for the more general case, let us know
> on
> >> TIKA-1443 [1].
> >>
> >> [1] https://issues.apache.org/jira/browse/TIKA-1443
> >>
> >> -----Original Message-----
> >> From: Wouter De Borger [mailto:wouter.debor...@inmanta.com]
> >> Sent: Thursday, March 30, 2017 6:00 AM
> >> To: users@pdfbox.apache.org
> >> Subject: Make PDFBox fail on bad pdf
> >>
> >> Hi All,
> >>
> >> When a pdf has bad encoding, PDFBox produces garbage (as explained in
> the
> >> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
> >>
> >> Can I make PDFBox fail in this case instead of producing garbage?
> >>
> >> (I'm working on a system that can also do OCR, so at the least sign of
> >> trouble, I would like PDF box to fail and try OCR.)
> >>
> >> Thanks,
> >> Wouter
> >>
> >
> >
> >
> > --
> > Wouter De Borger, PhD
> > Co-founder Inmanta
> > www.inmanta.com
> > Email: wouter.debor...@inmanta.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>


-- 
Wouter De Borger, PhD
Co-founder Inmanta
www.inmanta.com
Email: wouter.debor...@inmanta.com

Reply via email to