Windows has similar issues... but, redirecting the output to a file
works. There is a way to address that; but, we are waiting until we get
1.5.3 out and focus on 1.6.0 series before doing the radical changes.
There is a way to set the output encoding in Java to fix this ... but,
it isn't pretty and involves spawning a new System.out stream to support
the new encoding.
The code is doing it now on the input side due to similar issues with
the conversion process before training... the reason for the -encoding
parameter.
On 3/1/2013 10:46 PM, Lance Norskog wrote:
The Mac defaults to the proprietary MacRoman (I think?) encoding from
decades past (really). Technical decisions can haunt your entire career.
On 03/01/2013 10:08 AM, Leonel de Alencar wrote:
Running Mac OS 10.4 and the original opennlp bash script, I've saved
the file input.txt in the utf-8 encoding and got the correct output
both on the Terminal and in an ouptut file, which was also saved in
unicode utf-8. My Terminal display is configured for unicode utf-8. I
don't know if these facts are of any help for Linux users...
$ opennlp SimpleTokenizer < input.txt
Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst
so richtig bekannt gemacht . Wir haben uns mit Vincent Belorgey ,
besser bekannt als Kavinsky , über sein Debütalbum , seine Musik und
die 80 er Jahre unterhalten .
Average: 33,3 sent/s
Total: 1 sent
Runtime: 0.03s
$ opennlp SimpleTokenizer < input.txt > output.txt
Average: 111,1 sent/s
Total: 1 sent
Runtime: 0.0090s
$ cat output.txt
Sein Song " Nightcall " hat den Film " Drive " mit Ryan Gosling erst
so richtig bekannt gemacht . Wir haben uns mit Vincent Belorgey ,
besser bekannt als Kavinsky , über sein Debütalbum , seine Musik und
die 80 er Jahre unterhalten .
________________________________
De: Jörn Kottmann <[email protected]>
Para: [email protected]
Enviadas: Sexta-feira, 1 de Março de 2013 5:32
Assunto: Re: German Umlauts broken while using Command Line?
The problem here is the ASCII encoding can't encode the German Umlauts
and therefore they are replaced with the question marks you see in the
output.
Any ideas on how we can improve this? Anyway, if we can't do much
about it
we should at least document the work around to manually set the
encoding via
file.encoding.
Jörn
On 02/28/2013 06:29 PM, Stefan Matheis wrote:
On Thursday, February 28, 2013 at 5:26 PM, Jörn Kottmann wrote:
Hmm, pretty sure there is an encoding mismatch, do you know which
encoding is used by
your JVM? I would guess that is not UTF-8. You can probably get around
the issue by re-encoding the input
file to the encoding the JVM is using.
Have a look here:
http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java
Would be nice if you can run the println statements there.
Jörn
Where ever this comes from ..
$ java CharsetTest
Default Charset=US-ASCII
file.encoding=Latin-1
Default Charset=US-ASCII
Default Charset in Use=ASCII
$ echo $JAVA_TOOL_OPTIONS
(empty)
$ export JAVA_TOOL_OPTIONS='-Dfile.encoding=UTF8'
$ java CharsetTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Default Charset=UTF-8
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=UTF8
But this change itself didn't help .. output remains unchanged, so i
took the road down to dirty-hack-land, applying the following change
to bin/opennlp - for sure not how it should be .. but works at least
for the moment:
-$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@
+$JAVACMD -Xmx1024m -Dfile.encoding=UTF8 -jar
$OPENNLP_HOME/lib/opennlp-tools-*.jar $@