UTF8 encoding

Adrian Petru Dimulescu 30 Jun 2002 22:09:38 -0000

hello,

i have recently posted some messages on iso-8859-2 encoding problems.


trying to solve that problem I encoded the latin2 xml document as UTF-8 and 
did an AddDocument to xindice. 

the behaviour is similar: the characters which happen to be in the iso-8859-1 
(è, â, î) are alright. the ones that are specific to 8859-2 are replaced by 
"?". this happens in the very file where XIndice holds its database.

this is probably caused by opening a Writer somewhere in the I/O part of 
XIndice (i have not found yet the code which actually does this ) without 
specifying an encoding.

as the default encoding is usually iso-8859-1, the latin2 texts are improperly 
handled.

indeed, a solution is changing the file.encoding property for Java. for 
instance, if i call java this way:

java -Dfile.encoding=utf-8

the problem disappears: the latin2 text is stored as utf-8 in the xindice db, 
which is ok for me.

I wonder it would not be more proper to allow the user to choose the encoding 
in which his text will be stored, and do something like:

        Writer writer = new ...Writer(outputStream, "my-encoding-here")

in the I/O code of XIndice.

or, even better, look at the <?xml version=1.0 encoding="my-encoding-here" ?>
and use the given encoding when storing the document into XIndice. 

otherwise, the majority will use, without knowing, the default encodings of 
their machines.


best regards,
adrian.

UTF8 encoding

Reply via email to