Re: [bug] encoding problems

James Bates 26 Feb 2002 22:00:36 -0000

>> My patch attempts to address this by storing all characters internally by
>> their UTF-8 representation rather than their Latin-1 representation, but
>> it's not perfect yet...
>
>What's the problem on this? Maybe I can help since I the XML compilation
>classes that I wrote for Cocoon are already UTF-8 based.


 (James) Well no technical problem really, the work is done. (check the
 patch). It's just a bit dangerous to apply it now, as it will break
existing
 datafiles... Kimbro and I plan on introducing it after 1.0 ships. Also, it
will need
 some serious testing.

 >> 2) The command-line tools blatently assume that all XML files are
Latin-1
>> encoded, regardless >of the "encoding" pseudo-attribute in the XML
>> declaration.
>
>Yes, I 'blatantely' is the correct word :)
>
>> However it is a simple matter to correct the source code to let Xerces
>> sort encodings out instead f Xindice: Xerces does it really well
>> (auto-detecting UTF-8, UTF-16 little and big endian, Latin-1, and a host
of
>> Asian encodings too). On the output front, I changed the cmd-line tool to
>> always output in UTF-8, but a cleaner solution would be to let the user
>> choose with a cmd-line switch, defaulting to UTF-8.
>
>I Don't get it: XML is *designed* to be encoding-safe. Why a database
>client tool must have special command line parameters to indicate what
>encoding that is while it's already indicated inside the document (and
>if you read the XML spec there are a few encoding-guessing algorithms
>explained there)

(James) I know: Xerces does all that. The proposed command-line tool option
 is for OUTPUT
 (xindiceadmin rd and xindiceadmin export commands). Then users that don't
 like the default (utf-8) OUTPUT encoding, can still get their files in
 latin-1 if they want (e.g. because they don't have/ don't like utf-8 text
 editors). Input commands (xindiceadmin ad and xindiceadmin import) require
 no such option, since as you point out, the information is contained in the
 XML file, as per the XML spec.

 The output commands option is is my wish, but it isn't done yet. For the
 moment (i.e. since today) all output from the cmd-line tools is
 unconditionally utf-8. No options. No choices.

 >Besides: can't the client tool simply ask an XML parser to create SAX
>events for you and then store those in the database?

That's more or less what happens now. Anyway, check this evenig's CVS (in
the main source tree): this point (command-line tools) is resolved now.

 >> Remember: the command-line tool simply reads in the XML document to a
>> Java string: this Java string can still only contain Latin-1 defined
>> code-points as there the only ones Xindice can store >nternally (for the
>> moment), even if these charcaters were encoded in UTF-8 in your input
XML.

 >Sorry but I don't get it: Java is entirely based on Unicode and
>characters represented as are unsigned 16 bits. Since I've seen japanese
>java strings with my eyes, I think you are mistaken saying that java
>strings can only contain Latin-1 chars.

(James) Java strings can indeed contain any Unicode charcaters, that's
what's cool
 about them. However, due to point 1) above (a limitation in Xindice, not
 Java, but a limitation that frustrates me as much as it does you, believe
 me, whence my development contribution ;-) ), only those characters in the
 Java string, that also exist in Latin-1, will actually make into the
Xindice
 data-files (which are byte-based). This is because (for the moment), the
 strings are converted to byte arrays using the deprecated functions, such
as
 String.getBytes(/*nothing*/), and FileReader, and these byte arrays are
then
 used in all of the complex Tree/Symbol table/compression stuff that goes on
 inside the Xindice datafiles.

 >> 3) XPath and XUpdate instructions are sent through CORBA (the remote
call
>> API used by Xindice's Java interfaces) as is (i.e. as "strings").
>> Unfortunately, strings are 8-bit in CORBA. >Unicode charcater strings
should
>> be typed as wstrings, but for some reason (Kimbro has more on this: see
an
>> earlier post), wstrings cause compatibility issues between different
CORBA
>> >Implementations, and so this doesn't work either. So even if you fix
point
>> 1), queries will still not work.

 >Ok, that might be the reason.

 >I see this as a *BIG* push into the 'throw Corba away' direction.

 (James) I agree 100%.

 >> The only solution really worth considering here is moving from CORBA to
>> XML-RPC or >SOAP, and this is far from over yet (though I'm working on
>> it;) )

 >really? I thought we all agreed that SOAP/XML-RPC are far better options
>than CORBA.

 (James) We agree, but the work isn't *done* yet  ;)
 In fact XML-RPC isn't much good either, as it accepts only ASCII (yuck!) as
 strings. (even worse than CORBA). I'm thus working flat out on a SOAP/WDSL
 solution now. (I hope to have something presentable by Friday)


 >> For Latin-1 charcater only documents though (e.g. Italian, Portugese,
>> Swedish, German, >French, Danish, Norwegian, etc...) you can get away
with
>> ONLY patching the command-line >tools to correctly convert your documents
to
>> Java strings and back again.

 (James)Again, I'm referring here to the *actual characters* your document
 can contain *regardless* of what encoding is used to represent them. So you
 can still upload UTF-8 or UCS documents containing charcaters, as long as
 the charcaters being represented are charcaters that also exist in Latin-1.
 Your Italian utf-8 files for example should work fine.

 >>I saw that Kimbro applied the patches, I'll look into it ASAP.

 >> As for getting it fixed in release 1.0, I'd have liked it too, but
Kimbro
>> (rightly) prefers to wait, as making UTF-8 the database's internal
encoding
>> breaks existing datafiules. There isn't really a reason not to just fix
the
>> command-line tools though, thus already fixing Stefano's italian
problem...
>> Kimbro?

 (James) this is now done.

>>Ok for release early and often, but, please, write a 'known bugs' page
>>or you'll pretty soon be flooded with 'encoding-related' problems (as it
>>happened with Cocoon a while ago).

(James)  I know very well, I was one of the complainers ;) I agree with the
"known
 bugs" page, also we should mention imho that we intend to fix it asap.


 That's about as much as I can see toward answering the questions. Hope it
 helps,

 James

Re: [bug] encoding problems

Reply via email to