I'm working in a spring/tomcat container, so I'm reluctant to mess with the
logging, as I'm not quite sure if spring/tomcat ever reloads/updates the
logging config.

Another option would be to create a superclass of PDFTextStripper,
override showText, grab the font after each call and extract the noUnicode
field from the font.

Both are rather hacky, but possible.

Thanks
Wouter

On Thu, Mar 30, 2017 at 3:51 PM, Maruan Sahyoun <[email protected]>
wrote:

> an option without changing PDFBox could be  to create a custom log
> appender and grab the org.apache.pdfbox.pdmodel.font.PDSimpleFont log
> messages. You could then count them afterwards and if they are above a
> certain threshold decide to drop the result of the text extraction.
>
> > Am 30.03.2017 um 14:54 schrieb Wouter De Borger <
> [email protected]>:
> >
> > Oh, sorry, my bad.
> >
> > The log lines are:
> >
> > 2017-46-30 14:46:04.788   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c49 (86) in font null
> > 2017-46-30 14:46:04.788   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c103 (87) in font null
> > 2017-46-30 14:46:04.789   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c59 (88) in font null
> > 2017-46-30 14:46:04.792   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c86 (89) in font null
> > 2017-46-30 14:46:04.792   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c122 (90) in font null
> > 2017-46-30 14:46:04.795   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c174 (32) in font null
> > 2017-46-30 14:46:04.795   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c104 (33) in font null
> > 2017-46-30 14:46:04.795   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c231 (34) in font null
> > 2017-46-30 14:46:04.796   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c175 (35) in font null
> > 2017-46-30 14:46:04.796   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c99 (36) in font null
> > 2017-46-30 14:46:04.796   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c98 (37) in font null
> > 2017-46-30 14:46:04.802   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c76 (32) in font null
> > 2017-46-30 14:46:04.802   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c101 (33) in font null
> > 2017-46-30 14:46:04.803   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c114 (34) in font null
> > 2017-46-30 14:46:04.803   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c109 (35) in font null
> > 2017-46-30 14:46:04.803   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c98 (36) in font null
> > 2017-46-30 14:46:04.803   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c111 (37) in font null
> > 2017-46-30 14:46:04.804   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c117 (38) in font null
> > 2017-46-30 14:46:04.804   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c115 (39) in font null
> > 2017-46-30 14:46:04.804   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c110 (40) in font null
> > 2017-46-30 14:46:04.804   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c116 (41) in font null
> > 2017-46-30 14:46:04.805   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c97 (42) in font null
> > 2017-46-30 14:46:04.805   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c108 (43) in font null
> > 2017-46-30 14:46:04.805   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c100 (44) in font null
> > 2017-46-30 14:46:04.805   [33mWARN [m ---
> > [DefaultMessageListenerContainer-1]
> > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(
> PDSimpleFont.java:325)
> > [m : No Unicode mapping for c39 (45) in font null
> >
> > I can't forward the PDF, as it contains banking information. I can try to
> > get permission to pass it on, but I don't have much hope.
> > The PDF is quite weird. It renders well, but pdftotext and chrome are
> also
> > unable to get meaning full text out of it.
> > The pdf is created with PDF Converter 3.0.
> >
> > The header/footer of the PDF are extracted somewhat OK, but the body
> looks
> > like this:
> >
> > "#$%&'(
> > )*+$ ,% -,+$#$' .% /!&# $"0!'1%' 2&% 34 5"#&'+"6% %#7 .$#-!#8% 9 /!&#
> > 6!"#%"7$' &"% +/+"6% #&'
> > /!7'% 6!"7'+7 .*+##&'+"6% ": ;;<=>?;;@AA=><B
> > C%# 6!".$7$!"# %" #%'+$%"7 "!7+11%"7 ,%# #&$/+"7%# D
> > !"7+"7 "!1$"+,
> > D >BE;;(;; FGH
> > ... I&'8%
> > D ? +"#
> > J+&K
> > D ?(L?; M ,*+"
> > N$"78'O7# -+P+Q,%# +"7$6$-+7$/%1%"7 -+' -8'$!.% .*&" +"R
> > S$ ,%# 1!.+,$78# .% 6%77% +/+"6% .8/%,!--8%# .+"# 6% 6!&''$%' /!&#
> > 6!"/$%""%"7( T% /!&# $"/$7% 9
> > 1% '%"/!P%' ,%# -$U6%# #&$/+"7%# +/+"7 ,% @; +/'$, A;V? D
> > .. ,% .!&Q,% .% ,+ -'8#%"7% ,%77'%( .+78 %7 #$W"8X
> > /!7'% 6!"7'+7 .*+##&'+"6% %7 7!&# ,%# +/%"+"7#( 2&$ #%'!"7 6!"#%'/8# -+'
> 34
> > 5"#&'+"6% -%".+"7
> > ,+ .&'8% .% ,*+/+"6%B
> > IU# '86%-7$!" .% 6%# .!6&1%"7#( ,% 1!"7+"7 "!1$"+, .% ,*+/+"6%( .$1$"&8
> .%#
> > $"78'O7# +"7$6$-8#
> > -!&' ,+ -'%1$U'% -8'$!.%( /!&# #%'+ -+P8B
> > C+ .+7% .% -'$#% .% 6!&'# .% /!7'% +/+"6% #&' 6!"7'+7 %#7 0$K8% +& V%' .&
> > 1!$# .% ,+ '86%-7$!" -+'
> > 34 5"#&'+"6% .& .!&Q,% #$W"8 .% ,+ -'8#%"7%B C%#
> >
> > Wouter
> >
> > On Thu, Mar 30, 2017 at 2:42 PM, Maruan Sahyoun <[email protected]>
> > wrote:
> >
> >>
> >>> Am 30.03.2017 um 14:37 schrieb Wouter De Borger <
> >> [email protected]>:
> >>>
> >>> Hi,
> >>>
> >>> Well, PDF box does know it can't decode the unicode characters (as it
> >>> outputs a stream of warnings). It would be nice if I could ask PDFBox
> how
> >>> many undecodable characters a document has.
> >>
> >> well, that's something you didn't mention before - could you drop some
> of
> >> the messages here so we know which one you are talking about?
> >>
> >> BR
> >> Maruan
> >>
> >>>
> >>> Wouter
> >>>
> >>> On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <
> [email protected]>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>>> Am 30.03.2017 um 14:25 schrieb Wouter De Borger <
> >>>> [email protected]>:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> Thanks for the hint! I'll try to add some content there, as I can
> >>>>> definitely use a garbage detector.
> >>>>>
> >>>>> In this case, however, I was specifically trying to avoid using a
> >>>>> statistical detector. PDFBox already knows there is a problem,
> >>>>
> >>>> that is not the case here. From PDFBox perspective everything is fine.
> >>>> It's extracting the text according to the definition and information
> in
> >> the
> >>>> PDF. That this is garbage from a users perspective would mean that
> >> PDFBox
> >>>> 'understands' that the extracted text is not meaningful.
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>> so there is
> >>>>> no need to examine the content to attempt to detect a problem.
> >>>>> I would like to be able to capture the problem when and where it is
> >>>> known,
> >>>>> as this is easier and more accurate.
> >>>>>
> >>>>> Thanks,
> >>>>> Wouter
> >>>>>
> >>>>> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <
> >> [email protected]
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> If you have any recommendations for the more general case, let us
> know
> >>>> on
> >>>>>> TIKA-1443 [1].
> >>>>>>
> >>>>>> [1] https://issues.apache.org/jira/browse/TIKA-1443
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Wouter De Borger [mailto:[email protected]]
> >>>>>> Sent: Thursday, March 30, 2017 6:00 AM
> >>>>>> To: [email protected]
> >>>>>> Subject: Make PDFBox fail on bad pdf
> >>>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> When a pdf has bad encoding, PDFBox produces garbage (as explained
> in
> >>>> the
> >>>>>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
> >>>>>>
> >>>>>> Can I make PDFBox fail in this case instead of producing garbage?
> >>>>>>
> >>>>>> (I'm working on a system that can also do OCR, so at the least sign
> of
> >>>>>> trouble, I would like PDF box to fail and try OCR.)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Wouter
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Wouter De Borger, PhD
> >>>>> Co-founder Inmanta
> >>>>> www.inmanta.com
> >>>>> Email: [email protected]
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Wouter De Borger, PhD
> >>> Co-founder Inmanta
> >>> www.inmanta.com
> >>> Email: [email protected]
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> >
> > --
> > Wouter De Borger, PhD
> > Co-founder Inmanta
> > www.inmanta.com
> > Email: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Wouter De Borger, PhD
Co-founder Inmanta
www.inmanta.com
Email: [email protected]

Reply via email to