Stefano: this is the kind of thing I address in the patch I "posted zipped" to this list, and subsequently made available as a patch.
As far as I know for the moment there are three issues relating to Xindice (I too use Cocoon with Xindice, so I've seen exactly your problems too, in French ;) : 1) In the core database, documents are stored as Latin-1. The Java API transforms java Strings (all unicode) into Latin-1 bytes and saves these into the database files. This means that all Latin-1 (iso-8859-1) texts will be stored correctly if the Java API is used. Any non-Latin1 characters (Polish, Greek, Russian, ...) are stored as '?'. My patch attempts to address this by storing all characters internally by their UTF-8 representation rather than their Latin-1 representation, but it's not perfect yet... 2) The command-line tools blatently assume that all XML files are Latin-1 encoded, regardless of the "encoding" pseudo-attribute in the XML declaration. However it is a simple matter to correct the source code to let Xerces sort encodings out instead of Xindice: Xerces does it really well (auto-detecting UTF-8, UTF-16 little and big endian, Latin-1, and a host of Asian encodings too). On the output front, I changed the cmd-line tool to always output in UTF-8, but a cleaner solution would be to let the user choose with a cmd-line switch, defaulting to UTF-8. Remember: the command-line tool simply reads in the XML document to a Java string: this Java string can still only contain Latin-1 defined code-points as there the only ones Xindice can store internally (for the moment), even if these charcaters were encoded in UTF-8 in your input XML. 3) XPath and XUpdate instructions are sent through CORBA (the remote call API used by Xindice's Java interfaces) as is (i.e. as "strings"). Unfortunately, strings are 8-bit in CORBA. Unicode charcater strings should be typed as wstrings, but for some reason (Kimbro has more on this: see an earlier post), wstrings cause compatibility issues between different CORBA Implementations, and so this doesn't work either. So even if you fix point 1), queries will still not work. The only solution really worth considering here is moving from CORBA to XML-RPC or SOAP, and this is far from over yet (though I'm working on it;) ) For Latin-1 charcater only documents though (e.g. Italian, Portugese, Swedish, German, French, Danish, Norwegian, etc...) you can get away with ONLY patching the command-line tools to correctly convert your documents to Java strings and back again. As for getting it fixed in release 1.0, I'd have liked it too, but Kimbro (rightly) prefers to wait, as making UTF-8 the database's internal encoding breaks existing datafiules. There isn't really a reason not to just fix the command-line tools though, thus already fixing Stefano's italian problem... Kimbro? Hope that clarifies things, James > -----Original Message----- > From: Stefano Mazzocchi [mailto:[EMAIL PROTECTED] > Sent: 23 February 2002 16:01 > To: Apache XIndice > Cc: Apache Cocoon > Subject: [bug] encoding problems > > > [cross posted because people on the cocoon list might hit > this as well] > > I've always tested xindice with english documents, so I didn't notice > this behavior until today when I imported an italian XML document. > > The document is encoded using UTF-8 and looks like this: > > <?xml version="1.0" encoding="UTF-8"?> > ... > <subtitle> > In sempre più film il computer con la Mela è l'arma > dei giusti contro criminali di ogni specie che invece > preferiscono i pc > </subtitle> > ... > > [this is a news document taken from an italian on-line newspaper] > > ù -> ù > è -> è > > are the two unicode translations for the non-ASCII character (since > UTF-8 is back compatible to ASCII you don't note any difference until > you use non-ASCII letters such as these) > > Opening the document in Explorer or XML-Spy yields the correct > characters. > > Then I import it into the database and I access it from the cocoon > XML:DB source I get (in the explorer window): > > <?xml version="1.0" encoding="UTF-8" ?> > ... > <subtitle> > In sempre più film il computer con la Mela è l'arma dei giusti > contro criminali di ogni specie che invece preferiscono i pc > </subtitle> > > same thing when opening the source from the the notepad window. But in > win2k notepad is UNICODE-aware... so I saved the source on disk and I > opened it with UltraEdit (which is UNICODE-aware but has a nice binary > view) and voila' > > ... > <subtitle> > In sempre piÃf¹ film il computer con la Mela Ãf¨ > l'arma dei giusti contro criminali di ogni specie > che invece preferiscono i pc > </subtitle> > ... > > where I believe that > > Ãf -> à > ¹ -> ¹ > > This similarity in encoding probably shows why nobody noticed this > before. > > So I went directly into the news.tbl and got the same bytes: > > n sempre piÃf¹ film il compu > ter con la Mela Ãf¨ l'arma d > ei giusti > > which clearly indicates that 'xindice' command line import tool is > somewhat ignoring the 'UTF-8' encoding and performing UTF-8 > encoding on > something that is *already* UTF-8 encoded. > > My perception is that there is nothing wrong in the way XIndice or > Cocoon get the information *out* of the database: the problem > resides on > how the information gets *in* the database. > > I would suggest the XIndice dev community to consider this bug a > showstopper for the 1.0 final release. > > -- > Stefano Mazzocchi One must still have chaos in oneself to be > able to give birth to a dancing star. > <[EMAIL PROTECTED]> Friedrich Nietzsche > -------------------------------------------------------------------- > > >