RE: [bug] encoding problems

James Bates 25 Feb 2002 09:06:03 -0000

Stefano: this is the kind of thing I address in the patch I "posted zipped" to 
this list, and subsequently made available as a patch.


As far as I know for the moment there are three issues relating to Xindice (I 
too use Cocoon with Xindice, so I've seen exactly your problems too, in French 
;) :

1) In the core database, documents are stored as Latin-1. The Java API 
transforms java Strings (all unicode) into Latin-1 bytes and saves these into 
the database files. This means that all Latin-1 (iso-8859-1) texts will be 
stored correctly if the Java API is used. Any non-Latin1 characters (Polish, 
Greek, Russian, ...) are stored as '?'.

My patch attempts to address this by storing all characters internally by their 
UTF-8 representation rather than their Latin-1 representation, but it's not 
perfect yet...

2) The command-line tools blatently assume that all XML files are Latin-1 
encoded, regardless of the "encoding" pseudo-attribute in the XML declaration. 
However it is a simple matter to correct the source code to let Xerces sort 
encodings out instead of Xindice: Xerces does it really well (auto-detecting 
UTF-8, UTF-16 little and big endian, Latin-1, and a host of Asian encodings 
too). On the output front, I changed the cmd-line tool to always output in 
UTF-8, but a cleaner solution would be to let the user choose with a cmd-line 
switch, defaulting to UTF-8.

Remember: the command-line tool simply reads in the XML document to a Java 
string: this Java string can still only contain Latin-1 defined code-points as 
there the only ones Xindice can store internally (for the moment), even if 
these charcaters were encoded in UTF-8 in your input XML.

3) XPath and XUpdate instructions are sent through CORBA (the remote call API 
used by Xindice's Java interfaces) as is (i.e. as "strings"). Unfortunately, 
strings are 8-bit in CORBA. Unicode charcater strings should be typed as 
wstrings, but for some reason (Kimbro has more on this: see an earlier post), 
wstrings cause compatibility issues between different CORBA Implementations, 
and so this doesn't work either. So even if you fix point 1), queries will 
still not work.

The only solution really worth considering here is moving from CORBA to XML-RPC 
or SOAP, and this is far from over yet (though I'm working on it;) )



For Latin-1 charcater only documents though (e.g. Italian, Portugese, Swedish, 
German, French, Danish, Norwegian, etc...) you can get away with ONLY patching 
the command-line tools to correctly convert your documents to Java strings and 
back again.


As for getting it fixed in release 1.0, I'd have liked it too, but Kimbro 
(rightly) prefers to wait, as making UTF-8 the database's internal encoding 
breaks existing datafiules. There isn't really a reason not to just fix the 
command-line tools though, thus already fixing Stefano's italian problem...
Kimbro?


Hope that clarifies things,
James



> -----Original Message-----
> From: Stefano Mazzocchi [mailto:[EMAIL PROTECTED]
> Sent: 23 February 2002 16:01
> To: Apache XIndice
> Cc: Apache Cocoon
> Subject: [bug] encoding problems
> 
> 
> [cross posted because people on the cocoon list might hit 
> this as well]
> 
> I've always tested xindice with english documents, so I didn't notice
> this behavior until today when I imported an italian XML document.
> 
> The document is encoded using UTF-8 and looks like this:
> 
>  <?xml version="1.0" encoding="UTF-8"?>
>  ...
>   <subtitle>
>    In sempre piÃ¹ film il computer con la Mela Ã¨ l'arma 
>    dei giusti contro criminali di ogni specie che invece 
>    preferiscono i pc
>   </subtitle>
>  ...
> 
> [this is a news document taken from an italian on-line newspaper]
> 
>  Ã¹ -> ù
>  Ã¨ -> è
> 
> are the two unicode translations for the non-ASCII character (since
> UTF-8 is back compatible to ASCII you don't note any difference until
> you use non-ASCII letters such as these)
> 
> Opening the document in Explorer or XML-Spy yields the correct
> characters.
> 
> Then I import it into the database and I access it from the cocoon
> XML:DB source I get (in the explorer window):
> 
>   <?xml version="1.0" encoding="UTF-8" ?> 
>    ...
>   <subtitle>
>    In sempre piÃ¹ film il computer con la Mela Ã¨ l'arma dei giusti 
>    contro criminali di ogni specie che invece preferiscono i pc
>   </subtitle> 
> 
> same thing when opening the source from the the notepad window. But in
> win2k notepad is UNICODE-aware... so I saved the source on disk and I
> opened it with UltraEdit (which is UNICODE-aware but has a nice binary
> view) and voila'
> 
>   ...
>   <subtitle>
>    In sempre piÃfÂ¹ film il computer con la Mela ÃfÂ¨ 
>    l'arma dei giusti contro criminali di ogni specie 
>    che invece preferiscono i pc
>   </subtitle>
>   ...
> 
> where I believe that
> 
>  Ãf -> Ã
>  Â¹ -> ¹
> 
> This similarity in encoding probably shows why nobody noticed this
> before.
> 
> So I went directly into the news.tbl and got the same bytes:
> 
>    n sempre piÃfÂ¹ film il compu
>    ter con la Mela ÃfÂ¨ l'arma d
>    ei giusti 
> 
> which clearly indicates that 'xindice' command line import tool is
> somewhat ignoring the 'UTF-8' encoding and performing UTF-8 
> encoding on
> something that is *already* UTF-8 encoded.
> 
> My perception is that there is nothing wrong in the way XIndice or
> Cocoon get the information *out* of the database: the problem 
> resides on
> how the information gets *in* the database.
> 
> I would suggest the XIndice dev community to consider this bug a
> showstopper for the 1.0 final release.
> 
> -- 
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <[EMAIL PROTECTED]>                             Friedrich Nietzsche
> --------------------------------------------------------------------
> 
> 
>

RE: [bug] encoding problems

Reply via email to