Re: Unicode again

Dirk Wed, 16 Aug 2006 15:20:55 -0700

And also look here for TinyXMLs support for UTF-8:
http://www.grinninglizard.com/tinyxmldocs/index.html
We're using TinyXML just for writing, so it doesn't need recognitioncode. The converter may run on a platform other than that used tocreate the VSS DB, so the VSS locale may not be available. Hence mylatest patch to force TinyXML to "trust" the specified locale,declareit in the output XML, and pass through all characters from the sourcephysical files unmodified. It's left to the Perl XML parser to convertthe encoding to Unicode internally. Currently the XML encoding ishardcoded to Windows-1252 and needs to be patched in the C++ source byusers of other VSS locales. It might be desirable to pass this in asan argument to ssphys.

Yes, that's what I always wrote in my mails. Let ssphys pass allcharacters unmodified and use perls XML parser for the real conversion.When I read the links mentioned in my first mail of this thread, I startto think a little different:

1.) We still have the problem to transport "discouraged" characters,even if they are encoded in the "&#" form2.) We need to specify the correct codepage (this should be easy) in theheader

3.) We should bypass the console and directly write into a file.

While I'm searching for more information, have you got an idea about theencoding of what windows things is UNICODE. Is it UTF16 or UCS2?If I understand all things correct, UTF16 is again a variable encoding,since code points above 0x10000 are mapped into two 16bit code values.

UCS2 seems to clip the possible range of all Unicode scalar values.

I have a better understanding for the "discouraged" characters now:(http://skew.org/xml/tutorial/)

Note that the XML 1.0 Recommendation refers to UCS characters by theirUnicode scalar values, using a notation of |#x| followed by only asmany hex digits as needed. So |#x9| in the EBNF productions means theabstract character that would be represented in Unicode 3.1's "U+"notation as |U+0009|. It does not necessarily mean a byte with hexvalue 9.

I always interpreted this UCS mapping and microsofts UNICODE mapping asbeing equivalent. I'm not that sure anymore.


So what is the correct way?????

1.) We know that VSS is encoded in the current users ASCII locale

2.) We can automatically determine this locale, or we can specify it onthe commandline3.) We can output the XML file in the current locale, even if we getdiscouraged characters

4.) We can try to convert to UTF8 using the |WideCharToMultiByte functions

Grrrrr

Any further ideas?

Dirk
|
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user

Re: Unicode again

Reply via email to