Windows has similar issues... but, redirecting the output to a file works. There is a way to address that; but, we are waiting until we get 1.5.3 out and focus on 1.6.0 series before doing the radical changes. There is a way to set the output encoding in Java to fix this ... but, it isn't pretty and involves spawning a new System.out stream to support the new encoding. The code is doing it now on the input side due to similar issues with the conversion process before training... the reason for the -encoding parameter.

On 3/1/2013 10:46 PM, Lance Norskog wrote:
The Mac defaults to the proprietary MacRoman (I think?) encoding from decades past (really). Technical decisions can haunt your entire career.

On 03/01/2013 10:08 AM, Leonel de Alencar wrote:
Running Mac OS 10.4 and the original opennlp bash script, I've saved the file input.txt in the utf-8 encoding and got the correct output both on the Terminal and in an ouptut file, which was also saved in unicode utf-8. My Terminal display is configured for unicode utf-8. I don't know if these facts are of any help for Linux users...

  $ opennlp SimpleTokenizer < input.txt
Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig bekannt gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als Kavinsky , über sein Debütalbum , seine Musik und die 80 er Jahre unterhalten .


Average: 33,3 sent/s
Total: 1 sent
Runtime: 0.03s

$ opennlp SimpleTokenizer < input.txt > output.txt


Average: 111,1 sent/s
Total: 1 sent
Runtime: 0.0090s

$ cat output.txt
Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst so richtig bekannt gemacht . Wir haben uns mit Vincent Belorgey , besser bekannt als Kavinsky , über sein Debütalbum , seine Musik und die 80 er Jahre unterhalten .







________________________________
  De: Jörn Kottmann <[email protected]>
Para: [email protected]
Enviadas: Sexta-feira, 1 de Março de 2013 5:32
Assunto: Re: German Umlauts broken while using Command Line?
  The problem here is the ASCII encoding can't encode the German Umlauts
and therefore they are replaced with the question marks you see in the
output.

Any ideas on how we can improve this? Anyway, if we can't do much about it we should at least document the work around to manually set the encoding via
file.encoding.

Jörn

On 02/28/2013 06:29 PM, Stefan Matheis wrote:
On Thursday, February 28, 2013 at 5:26 PM, Jörn Kottmann wrote:

Hmm, pretty sure there is an encoding mismatch, do you know which
encoding is used by
your JVM? I would guess that is not UTF-8. You can probably get around
the issue by re-encoding the input
file to the encoding the JVM is using.
   Have a look here:
http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java
   Would be nice if you can run the println statements there.
   Jörn
Where ever this comes from ..

$ java CharsetTest
Default Charset=US-ASCII
file.encoding=Latin-1
Default Charset=US-ASCII
Default Charset in Use=ASCII

$ echo $JAVA_TOOL_OPTIONS
(empty)

$ export JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF8'

$ java CharsetTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Default Charset=UTF-8
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=UTF8



But this change itself didn't help .. output remains unchanged, so i took the road down to dirty-hack-land, applying the following change to bin/opennlp - for sure not how it should be .. but works at least for the moment:

-$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@
+$JAVACMD -Xmx1024m -Dfile.encoding=UTF8 -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@






Reply via email to