The problem here is the ASCII encoding can't encode the German Umlauts
and therefore they are replaced with the question marks you see in the
output.
Any ideas on how we can improve this? Anyway, if we can't do much about it
we should at least document the work around to manually set the encoding via
file.encoding.
Jörn
On 02/28/2013 06:29 PM, Stefan Matheis wrote:
On Thursday, February 28, 2013 at 5:26 PM, Jörn Kottmann wrote:
Hmm, pretty sure there is an encoding mismatch, do you know which
encoding is used by
your JVM? I would guess that is not UTF-8. You can probably get around
the issue by re-encoding the input
file to the encoding the JVM is using.
Have a look here:
http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java
Would be nice if you can run the println statements there.
Jörn
Where ever this comes from ..
$ java CharsetTest
Default Charset=US-ASCII
file.encoding=Latin-1
Default Charset=US-ASCII
Default Charset in Use=ASCII
$ echo $JAVA_TOOL_OPTIONS
(empty)
$ export JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF8'
$ java CharsetTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Default Charset=UTF-8
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=UTF8
But this change itself didn't help .. output remains unchanged, so i took the
road down to dirty-hack-land, applying the following change to bin/opennlp -
for sure not how it should be .. but works at least for the moment:
-$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@
+$JAVACMD -Xmx1024m -Dfile.encoding=UTF8 -jar
$OPENNLP_HOME/lib/opennlp-tools-*.jar $@