>> My patch attempts to address this by storing all characters internally by >> their UTF-8 representation rather than their Latin-1 representation, but >> it's not perfect yet... > >What's the problem on this? Maybe I can help since I the XML compilation >classes that I wrote for Cocoon are already UTF-8 based.
(James) Well no technical problem really, the work is done. (check the patch). It's just a bit dangerous to apply it now, as it will break existing datafiles... Kimbro and I plan on introducing it after 1.0 ships. Also, it will need some serious testing. >> 2) The command-line tools blatently assume that all XML files are Latin-1 >> encoded, regardless >of the "encoding" pseudo-attribute in the XML >> declaration. > >Yes, I 'blatantely' is the correct word :) > >> However it is a simple matter to correct the source code to let Xerces >> sort encodings out instead f Xindice: Xerces does it really well >> (auto-detecting UTF-8, UTF-16 little and big endian, Latin-1, and a host of >> Asian encodings too). On the output front, I changed the cmd-line tool to >> always output in UTF-8, but a cleaner solution would be to let the user >> choose with a cmd-line switch, defaulting to UTF-8. > >I Don't get it: XML is *designed* to be encoding-safe. Why a database >client tool must have special command line parameters to indicate what >encoding that is while it's already indicated inside the document (and >if you read the XML spec there are a few encoding-guessing algorithms >explained there) (James) I know: Xerces does all that. The proposed command-line tool option is for OUTPUT (xindiceadmin rd and xindiceadmin export commands). Then users that don't like the default (utf-8) OUTPUT encoding, can still get their files in latin-1 if they want (e.g. because they don't have/ don't like utf-8 text editors). Input commands (xindiceadmin ad and xindiceadmin import) require no such option, since as you point out, the information is contained in the XML file, as per the XML spec. The output commands option is is my wish, but it isn't done yet. For the moment (i.e. since today) all output from the cmd-line tools is unconditionally utf-8. No options. No choices. >Besides: can't the client tool simply ask an XML parser to create SAX >events for you and then store those in the database? That's more or less what happens now. Anyway, check this evenig's CVS (in the main source tree): this point (command-line tools) is resolved now. >> Remember: the command-line tool simply reads in the XML document to a >> Java string: this Java string can still only contain Latin-1 defined >> code-points as there the only ones Xindice can store >nternally (for the >> moment), even if these charcaters were encoded in UTF-8 in your input XML. >Sorry but I don't get it: Java is entirely based on Unicode and >characters represented as are unsigned 16 bits. Since I've seen japanese >java strings with my eyes, I think you are mistaken saying that java >strings can only contain Latin-1 chars. (James) Java strings can indeed contain any Unicode charcaters, that's what's cool about them. However, due to point 1) above (a limitation in Xindice, not Java, but a limitation that frustrates me as much as it does you, believe me, whence my development contribution ;-) ), only those characters in the Java string, that also exist in Latin-1, will actually make into the Xindice data-files (which are byte-based). This is because (for the moment), the strings are converted to byte arrays using the deprecated functions, such as String.getBytes(/*nothing*/), and FileReader, and these byte arrays are then used in all of the complex Tree/Symbol table/compression stuff that goes on inside the Xindice datafiles. >> 3) XPath and XUpdate instructions are sent through CORBA (the remote call >> API used by Xindice's Java interfaces) as is (i.e. as "strings"). >> Unfortunately, strings are 8-bit in CORBA. >Unicode charcater strings should >> be typed as wstrings, but for some reason (Kimbro has more on this: see an >> earlier post), wstrings cause compatibility issues between different CORBA >> >Implementations, and so this doesn't work either. So even if you fix point >> 1), queries will still not work. >Ok, that might be the reason. >I see this as a *BIG* push into the 'throw Corba away' direction. (James) I agree 100%. >> The only solution really worth considering here is moving from CORBA to >> XML-RPC or >SOAP, and this is far from over yet (though I'm working on >> it;) ) >really? I thought we all agreed that SOAP/XML-RPC are far better options >than CORBA. (James) We agree, but the work isn't *done* yet ;) In fact XML-RPC isn't much good either, as it accepts only ASCII (yuck!) as strings. (even worse than CORBA). I'm thus working flat out on a SOAP/WDSL solution now. (I hope to have something presentable by Friday) >> For Latin-1 charcater only documents though (e.g. Italian, Portugese, >> Swedish, German, >French, Danish, Norwegian, etc...) you can get away with >> ONLY patching the command-line >tools to correctly convert your documents to >> Java strings and back again. (James)Again, I'm referring here to the *actual characters* your document can contain *regardless* of what encoding is used to represent them. So you can still upload UTF-8 or UCS documents containing charcaters, as long as the charcaters being represented are charcaters that also exist in Latin-1. Your Italian utf-8 files for example should work fine. >>I saw that Kimbro applied the patches, I'll look into it ASAP. >> As for getting it fixed in release 1.0, I'd have liked it too, but Kimbro >> (rightly) prefers to wait, as making UTF-8 the database's internal encoding >> breaks existing datafiules. There isn't really a reason not to just fix the >> command-line tools though, thus already fixing Stefano's italian problem... >> Kimbro? (James) this is now done. >>Ok for release early and often, but, please, write a 'known bugs' page >>or you'll pretty soon be flooded with 'encoding-related' problems (as it >>happened with Cocoon a while ago). (James) I know very well, I was one of the complainers ;) I agree with the "known bugs" page, also we should mention imho that we intend to fix it asap. That's about as much as I can see toward answering the questions. Hope it helps, James