Re: Make PDFBox fail on bad pdf

Wouter De Borger Thu, 30 Mar 2017 05:55:19 -0700

Oh, sorry, my bad.

The log lines are:


2017-46-30 14:46:04.788   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c49 (86) in font null
2017-46-30 14:46:04.788   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c103 (87) in font null
2017-46-30 14:46:04.789   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c59 (88) in font null
2017-46-30 14:46:04.792   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c86 (89) in font null
2017-46-30 14:46:04.792   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c122 (90) in font null
2017-46-30 14:46:04.795   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c174 (32) in font null
2017-46-30 14:46:04.795   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c104 (33) in font null
2017-46-30 14:46:04.795   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c231 (34) in font null
2017-46-30 14:46:04.796   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c175 (35) in font null
2017-46-30 14:46:04.796   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c99 (36) in font null
2017-46-30 14:46:04.796   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c98 (37) in font null
2017-46-30 14:46:04.802   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c76 (32) in font null
2017-46-30 14:46:04.802   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c101 (33) in font null
2017-46-30 14:46:04.803   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c114 (34) in font null
2017-46-30 14:46:04.803   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c109 (35) in font null
2017-46-30 14:46:04.803   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c98 (36) in font null
2017-46-30 14:46:04.803   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c111 (37) in font null
2017-46-30 14:46:04.804   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c117 (38) in font null
2017-46-30 14:46:04.804   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c115 (39) in font null
2017-46-30 14:46:04.804   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c110 (40) in font null
2017-46-30 14:46:04.804   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c116 (41) in font null
2017-46-30 14:46:04.805   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c97 (42) in font null
2017-46-30 14:46:04.805   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c108 (43) in font null
2017-46-30 14:46:04.805   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c100 (44) in font null
2017-46-30 14:46:04.805   [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c39 (45) in font null

I can't forward the PDF, as it contains banking information. I can try to
get permission to pass it on, but I don't have much hope.
The PDF is quite weird. It renders well, but pdftotext and chrome are also
unable to get meaning full text out of it.
The pdf is created with PDF Converter 3.0.

The header/footer of the PDF are extracted somewhat OK, but the body looks
like this:

"#$%&'(
)*+$ ,% -,+$#$' .% /!&# $"0!'1%' 2&% 34 5"#&'+"6% %#7 .$#-!#8% 9 /!&#
6!"#%"7$' &"% +/+"6% #&'
/!7'% 6!"7'+7 .*+##&'+"6% ": ;;<=>?;;@AA=><B
C%# 6!".$7$!"# %" #%'+$%"7 "!7+11%"7 ,%# #&$/+"7%# D
!"7+"7 "!1$"+,
D >BE;;(;; FGH
... I&'8%
D ? +"#
J+&K
D ?(L?; M ,*+"
N$"78'O7# -+P+Q,%# +"7$6$-+7$/%1%"7 -+' -8'$!.% .*&" +"R
S$ ,%# 1!.+,$78# .% 6%77% +/+"6% .8/%,!--8%# .+"# 6% 6!&''$%' /!&#
6!"/$%""%"7( T% /!&# $"/$7% 9
1% '%"/!P%' ,%# -$U6%# #&$/+"7%# +/+"7 ,% @; +/'$, A;V? D
.. ,% .!&Q,% .% ,+ -'8#%"7% ,%77'%( .+78 %7 #$W"8X
/!7'% 6!"7'+7 .*+##&'+"6% %7 7!&# ,%# +/%"+"7#( 2&$ #%'!"7 6!"#%'/8# -+' 34
5"#&'+"6% -%".+"7
,+ .&'8% .% ,*+/+"6%B
IU# '86%-7$!" .% 6%# .!6&1%"7#( ,% 1!"7+"7 "!1$"+, .% ,*+/+"6%( .$1$"&8 .%#
$"78'O7# +"7$6$-8#
-!&' ,+ -'%1$U'% -8'$!.%( /!&# #%'+ -+P8B
C+ .+7% .% -'$#% .% 6!&'# .% /!7'% +/+"6% #&' 6!"7'+7 %#7 0$K8% +& V%' .&
1!$# .% ,+ '86%-7$!" -+'
34 5"#&'+"6% .& .!&Q,% #$W"8 .% ,+ -'8#%"7%B C%#

Wouter

On Thu, Mar 30, 2017 at 2:42 PM, Maruan Sahyoun <[email protected]>
wrote:

>
> > Am 30.03.2017 um 14:37 schrieb Wouter De Borger <
> [email protected]>:
> >
> > Hi,
> >
> > Well, PDF box does know it can't decode the unicode characters (as it
> > outputs a stream of warnings). It would be nice if I could ask PDFBox how
> > many undecodable characters a document has.
>
> well, that's something you didn't mention before - could you drop some of
> the messages here so we know which one you are talking about?
>
> BR
> Maruan
>
> >
> > Wouter
> >
> > On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 30.03.2017 um 14:25 schrieb Wouter De Borger <
> >> [email protected]>:
> >>>
> >>> Hi,
> >>>
> >>> Thanks for the hint! I'll try to add some content there, as I can
> >>> definitely use a garbage detector.
> >>>
> >>> In this case, however, I was specifically trying to avoid using a
> >>> statistical detector. PDFBox already knows there is a problem,
> >>
> >> that is not the case here. From PDFBox perspective everything is fine.
> >> It's extracting the text according to the definition and information in
> the
> >> PDF. That this is garbage from a users perspective would mean that
> PDFBox
> >> 'understands' that the extracted text is not meaningful.
> >> BR
> >> Maruan
> >>
> >>> so there is
> >>> no need to examine the content to attempt to detect a problem.
> >>> I would like to be able to capture the problem when and where it is
> >> known,
> >>> as this is easier and more accurate.
> >>>
> >>> Thanks,
> >>> Wouter
> >>>
> >>> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <
> [email protected]
> >>>
> >>> wrote:
> >>>
> >>>> If you have any recommendations for the more general case, let us know
> >> on
> >>>> TIKA-1443 [1].
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/TIKA-1443
> >>>>
> >>>> -----Original Message-----
> >>>> From: Wouter De Borger [mailto:[email protected]]
> >>>> Sent: Thursday, March 30, 2017 6:00 AM
> >>>> To: [email protected]
> >>>> Subject: Make PDFBox fail on bad pdf
> >>>>
> >>>> Hi All,
> >>>>
> >>>> When a pdf has bad encoding, PDFBox produces garbage (as explained in
> >> the
> >>>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
> >>>>
> >>>> Can I make PDFBox fail in this case instead of producing garbage?
> >>>>
> >>>> (I'm working on a system that can also do OCR, so at the least sign of
> >>>> trouble, I would like PDF box to fail and try OCR.)
> >>>>
> >>>> Thanks,
> >>>> Wouter
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Wouter De Borger, PhD
> >>> Co-founder Inmanta
> >>> www.inmanta.com
> >>> Email: [email protected]
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> >
> > --
> > Wouter De Borger, PhD
> > Co-founder Inmanta
> > www.inmanta.com
> > Email: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Wouter De Borger, PhD
Co-founder Inmanta
www.inmanta.com
Email: [email protected]

Re: Make PDFBox fail on bad pdf

Reply via email to