Hi,

I just found a comment, that windows UNICODE is UCS-2. What do you think about the following specific code for Windows to convert from the decoded ANSI input to UTF-8:

 // Convert file ANSI to Windows UNICODE (AKA UCS-2)
MultiByteToWideChar(CP_ACP,0,....);

  // now convert from Windows UNICODE (AKA UCS-2) to UTF-8

 WideCharToMultiByte(CP_UTF8,0,....);


on linux we could use iconv, or whatever.

Dirk


Dirk schrieb:

And also look here for TinyXMLs support for UTF-8:
http://www.grinninglizard.com/tinyxmldocs/index.html

We're using TinyXML just for writing, so it doesn't need recognition code. The converter may run on a platform other than that used to create the VSS DB, so the VSS locale may not be available. Hence my latest patch to force TinyXML to "trust" the specified locale,declare it in the output XML, and pass through all characters from the source physical files unmodified. It's left to the Perl XML parser to convert the encoding to Unicode internally. Currently the XML encoding is hardcoded to Windows-1252 and needs to be patched in the C++ source by users of other VSS locales. It might be desirable to pass this in as an argument to ssphys.


Yes, that's what I always wrote in my mails. Let ssphys pass all characters unmodified and use perls XML parser for the real conversion. When I read the links mentioned in my first mail of this thread, I start to think a little different:

1.) We still have the problem to transport "discouraged" characters, even if they are encoded in the "&#" form 2.) We need to specify the correct codepage (this should be easy) in the header
3.) We should bypass the console and directly write into a file.

While I'm searching for more information, have you got an idea about the encoding of what windows things is UNICODE. Is it UTF16 or UCS2? If I understand all things correct, UTF16 is again a variable encoding, since code points above 0x10000 are mapped into two 16bit code values.
UCS2 seems to clip the possible range of all Unicode scalar values.

I have a better understanding for the "discouraged" characters now: (http://skew.org/xml/tutorial/)

Note that the XML 1.0 Recommendation refers to UCS characters by their Unicode scalar values, using a notation of |#x| followed by only as many hex digits as needed. So |#x9| in the EBNF productions means the abstract character that would be represented in Unicode 3.1's "U+" notation as |U+0009|. It does not necessarily mean a byte with hex value 9.

I always interpreted this UCS mapping and microsofts UNICODE mapping as being equivalent. I'm not that sure anymore.

So what is the correct way?????

1.) We know that VSS is encoded in the current users ASCII locale
2.) We can automatically determine this locale, or we can specify it on the commandline 3.) We can output the XML file in the current locale, even if we get discouraged characters 4.) We can try to convert to UTF8 using the |WideCharToMultiByte functions

Grrrrr

Any further ideas?

Dirk
|
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user

_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user

Reply via email to