Hi Villu,

I installed your recommended version of ICU4J 3.8 JAR in my classpath and it 
works very well on standard PDF files.
Ligatures are resolved and diacritics are displayed correctly (although as 
combined characters) for files created with Acrobat Distiller.
Files created with pdftex don't work as well: diacritical characters are 
completely lost (at least for my test file); with dvipdfmx I get "FRÉDÉRIC" 
correct, but also "D epartement de Math ematiques" for Département de 
Mathématiques" and "Z urich" for "Zürich" (the original is in small capitals, 
probably this creates problems).
One file claiming to be created with "LaTeX with hyperref package" using "dvips 
+ distiller" crashes:
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/bouncycastle/jce/provider/BouncyCastleProvider

Thanks for the advice
Thomas


Am 22.12.2009 um 09:33 schrieb Villu Ruusmann:

> Hello there,
> 
>>> 
>>> The need for text post-processing depends on the class you're using for the 
>>> job.
>>> 
>>> Class org.apache.pdfbox.util.PDFTextStripper does it for you, because
>>> all texts are filtered through
>>> org.apache.pdfbox.util.TextNormalize#normalizeDiac(String)/#normalizePres(String)
>>> before they are exposed to the application programmer via methods like
>>> PDFTextStripper#writeString(String). However, it must be borne in mind
>>> that TextNormalize relies on external ICU4J dependency - if it is not
>>> properly installed, then the original string is returned unchanged.
>>> 
>> This is interesting. I use PDFBox as a command line tool on my Mac:
>> java org.apache.pdfbox.ExtractText -encoding UTF-8 file.pdf file.txt
>> Is there a way to activate some post-processing if I do it this way?
>> Or shouldn't it be included automatically?
>> 
> 
> The command-line application org.apache.pdfbox.ExtractText uses class
> org.apache.pdfbox.util.PDFTextStripper internally. So, in principle,
> there shouldn't be any need for text post-processing if the ICU4J
> dependency is properly installed.
> 
> Since PDFBox JAR comes in many flavours, it is very hard for me to
> tell if you have it all right or not. I guess the easiest solution
> would be to download ICU4J 3.8 JAR manually and append it to you
> command-line application's classpath. You can find the said JAR for
> example here:
> http://www.jarvana.com/jarvana/browse/com/ibm/icu/icu4j/3.8/
> 
> 
> VR

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to