If you have any recommendations for the more general case, let us know on
TIKA-1443 [1].
[1] https://issues.apache.org/jira/browse/TIKA-1443
-Original Message-
From: Wouter De Borger [mailto:wouter.debor...@inmanta.com]
Sent: Thursday, March 30, 2017 6:00 AM
To: users@pdfbox.apache.org
Hi,
> Am 30.03.2017 um 14:25 schrieb Wouter De Borger :
>
> Hi,
>
> Thanks for the hint! I'll try to add some content there, as I can
> definitely use a garbage detector.
>
> In this case, however, I was specifically trying to avoid using a
> statistical detector.
> Am 30.03.2017 um 14:37 schrieb Wouter De Borger :
>
> Hi,
>
> Well, PDF box does know it can't decode the unicode characters (as it
> outputs a stream of warnings). It would be nice if I could ask PDFBox how
> many undecodable characters a document has.
well,
Oh, sorry, my bad.
The log lines are:
2017-46-30 14:46:04.788 [33mWARN [m ---
[DefaultMessageListenerContainer-1]
[1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
[m : No Unicode mapping for c49 (86) in font null
2017-46-30 14:46:04.788 [33mWARN [m ---
Hi,
Thanks for the hint! I'll try to add some content there, as I can
definitely use a garbage detector.
In this case, however, I was specifically trying to avoid using a
statistical detector. PDFBox already knows there is a problem, so there is
no need to examine the content to attempt to
Hi,
Well, PDF box does know it can't decode the unicode characters (as it
outputs a stream of warnings). It would be nice if I could ask PDFBox how
many undecodable characters a document has.
Wouter
On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun
wrote:
> Hi,
>
> > Am
Hi All,
When a pdf has bad encoding, PDFBox produces garbage (as explained in the
FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
Can I make PDFBox fail in this case instead of producing garbage?
(I'm working on a system that can also do OCR, so at the least sign of
trouble, I would like
I would like to change default font of every field in an existing PDF
form.This suggests to me that I should change the default appearance of the
acroform and, so be on the safe side (since I am not the author of these
forms), change the default appearance of each PDTextField (I don't care
about
Hi,
> Am 30.03.2017 um 15:54 schrieb Evan Williams :
>
> I would like to change default font of every field in an existing PDF
> form.This suggests to me that I should change the default appearance of the
> acroform and, so be on the safe side (since I am not the author
an option without changing PDFBox could be to create a custom log appender and
grab the org.apache.pdfbox.pdmodel.font.PDSimpleFont log messages. You could
then count them afterwards and if they are above a certain threshold decide to
drop the result of the text extraction.
> Am 30.03.2017 um
I'm working in a spring/tomcat container, so I'm reluctant to mess with the
logging, as I'm not quite sure if spring/tomcat ever reloads/updates the
logging config.
Another option would be to create a superclass of PDFTextStripper,
override showText, grab the font after each call and extract the
Thank you Maruan!
That solution works perfectly.
At last my users will be able to enjoy illegibly tiny 4 pt text in
autosized fields (which, sadly, they believe that they want).
On Thu, Mar 30, 2017 at 10:38 AM, Maruan Sahyoun
wrote:
> Hi,
>
> > Am 30.03.2017 um 15:54
The problem is that some files do this as an obfuscation technique.
What might be detected is fonts that don't have unicode extraction. See
in LegacyPDFStreamEngine "if (unicode == null)". Make your own or extend
it and check for TextPosition objects with unicode null. (See
PrintTextLocations
Am 31.03.2017 um 07:21 schrieb karthick g:
Hi Team,
Apologies for sending the previous mail to developer team, Please guide me
* There was a problem when loading a pdf file to Adobe Acrobat.
*"There was a problem reading this document (18) " (Error Shown by
acrobat)*
* When running the
Hi Team,
Apologies for sending the previous mail to developer team, Please guide me
* There was a problem when loading a pdf file to Adobe Acrobat.
*"There was a problem reading this document (18) " (Error Shown by
acrobat)*
* When running the same file to the PDFBox it is working fine.
15 matches
Mail list logo