And also look here for TinyXMLs support for UTF-8:
http://www.grinninglizard.com/tinyxmldocs/index.html
We're using TinyXML just for writing, so it doesn't need recognition
code. The converter may run on a platform other than that used to
create the VSS DB, so the VSS locale may not be available. Hence my
latest patch to force TinyXML to "trust" the specified locale,declare
it in the output XML, and pass through all characters from the source
physical files unmodified. It's left to the Perl XML parser to convert
the encoding to Unicode internally. Currently the XML encoding is
hardcoded to Windows-1252 and needs to be patched in the C++ source by
users of other VSS locales. It might be desirable to pass this in as
an argument to ssphys.
Yes, that's what I always wrote in my mails. Let ssphys pass all
characters unmodified and use perls XML parser for the real conversion.
When I read the links mentioned in my first mail of this thread, I start
to think a little different:
1.) We still have the problem to transport "discouraged" characters,
even if they are encoded in the "&#" form
2.) We need to specify the correct codepage (this should be easy) in the
header
3.) We should bypass the console and directly write into a file.
While I'm searching for more information, have you got an idea about the
encoding of what windows things is UNICODE. Is it UTF16 or UCS2?
If I understand all things correct, UTF16 is again a variable encoding,
since code points above 0x10000 are mapped into two 16bit code values.
UCS2 seems to clip the possible range of all Unicode scalar values.
I have a better understanding for the "discouraged" characters now:
(http://skew.org/xml/tutorial/)
Note that the XML 1.0 Recommendation refers to UCS characters by their
Unicode scalar values, using a notation of |#x| followed by only as
many hex digits as needed. So |#x9| in the EBNF productions means the
abstract character that would be represented in Unicode 3.1's "U+"
notation as |U+0009|. It does not necessarily mean a byte with hex
value 9.
I always interpreted this UCS mapping and microsofts UNICODE mapping as
being equivalent. I'm not that sure anymore.
So what is the correct way?????
1.) We know that VSS is encoded in the current users ASCII locale
2.) We can automatically determine this locale, or we can specify it on
the commandline
3.) We can output the XML file in the current locale, even if we get
discouraged characters
4.) We can try to convert to UTF8 using the |WideCharToMultiByte functions
Grrrrr
Any further ideas?
Dirk
|
_______________________________________________
vss2svn-users mailing list
Project homepage:
http://www.pumacode.org/projects/vss2svn/
Subscribe/Unsubscribe/Admin:
http://lists.pumacode.org/mailman/listinfo/vss2svn-users-lists.pumacode.org
Mailing list web interface (with searchable archives):
http://dir.gmane.org/gmane.comp.version-control.subversion.vss2svn.user