Re: Make PDFBox fail on bad pdf

Maruan Sahyoun Thu, 30 Mar 2017 05:43:20 -0700

> Am 30.03.2017 um 14:37 schrieb Wouter De Borger <[email protected]>:
> 
> Hi,
> 
> Well, PDF box does know it can't decode the unicode characters (as it
> outputs a stream of warnings). It would be nice if I could ask PDFBox how
> many undecodable characters a document has.


well, that's something you didn't mention before - could you drop some of the 
messages here so we know which one you are talking about?

BR
Maruan 

> 
> Wouter
> 
> On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <[email protected]>
> wrote:
> 
>> Hi,
>> 
>>> Am 30.03.2017 um 14:25 schrieb Wouter De Borger <
>> [email protected]>:
>>> 
>>> Hi,
>>> 
>>> Thanks for the hint! I'll try to add some content there, as I can
>>> definitely use a garbage detector.
>>> 
>>> In this case, however, I was specifically trying to avoid using a
>>> statistical detector. PDFBox already knows there is a problem,
>> 
>> that is not the case here. From PDFBox perspective everything is fine.
>> It's extracting the text according to the definition and information in the
>> PDF. That this is garbage from a users perspective would mean that PDFBox
>> 'understands' that the extracted text is not meaningful.
>> BR
>> Maruan
>> 
>>> so there is
>>> no need to examine the content to attempt to detect a problem.
>>> I would like to be able to capture the problem when and where it is
>> known,
>>> as this is easier and more accurate.
>>> 
>>> Thanks,
>>> Wouter
>>> 
>>> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <[email protected]
>>> 
>>> wrote:
>>> 
>>>> If you have any recommendations for the more general case, let us know
>> on
>>>> TIKA-1443 [1].
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/TIKA-1443
>>>> 
>>>> -----Original Message-----
>>>> From: Wouter De Borger [mailto:[email protected]]
>>>> Sent: Thursday, March 30, 2017 6:00 AM
>>>> To: [email protected]
>>>> Subject: Make PDFBox fail on bad pdf
>>>> 
>>>> Hi All,
>>>> 
>>>> When a pdf has bad encoding, PDFBox produces garbage (as explained in
>> the
>>>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
>>>> 
>>>> Can I make PDFBox fail in this case instead of producing garbage?
>>>> 
>>>> (I'm working on a system that can also do OCR, so at the least sign of
>>>> trouble, I would like PDF box to fail and try OCR.)
>>>> 
>>>> Thanks,
>>>> Wouter
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Wouter De Borger, PhD
>>> Co-founder Inmanta
>>> www.inmanta.com
>>> Email: [email protected]
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
> 
> 
> -- 
> Wouter De Borger, PhD
> Co-founder Inmanta
> www.inmanta.com
> Email: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Make PDFBox fail on bad pdf

Reply via email to