(1) Whether the "local code page" encoding for Mac OS X should be UTF-8.
(2) Feedback on a revised and simplified Mac OS Transcoder that now supports UTF-8
as a local code page encoding.
Discussion of UTF-8 as local code page:
Xerces internally supports "Local Code Page" transcoders, whose purpose is to convert between the local encoding of a C-style character string and UTF-16, which Xerces uses internally. These "LCP" transcoders are called by Xerces whenever it is necessary to convert to or from a char string where the exact char encoding is not explicit.
For most practical purposes, that means that the LCP transcoder is used in construction of DOMStrings from C-strings, and in the construction of XMLStrings from C-strings. These cases are encountered in processing of Xerces sample command line arguments (like filenames), as well as any other cases where raw C strings are passed into Xerces.
Our traditional approach has been to set the LCP encoding to match the Mac system script encoding.
This approach, while traditionally the right answer under Mac OS, is a problem in Mac OS X when we're run from the terminal. Since the encoding of posix pathnames under Mac OS X is UTF-8, and since the unix-layer tools generally use parameters without any transcoding, the interface to command line tools is implicitly UTF-8. Not so coincidentally, the default character set encoding in the Mac OS Terminal program is also UTF-8.
But since the Xerces LCP encoding has been set to match the system script (and is thus MacRoman in western areas), parameters, including pathnames, that are passed in on the command line have had to be MacRoman. Thus one has had to encode pathnames to Xerces samples differently than for typical command line tools. This does not seem desirable. But note that it's also not "generally" a problem, as UTF-8 is identical to MacRoman in the ASCII portion of its range, and maybe even beyond.
A change to UTF-8 as the LCP encoding for Xerces under Mac OS X brings usage of Xerces into line with other command line tools.
But note the downside of switching the LCP encoding to UTF-8: any application that has relied internally on the fact that char strings passed to Xerces can be be in the system script encoding, will begin to fail. Since most modern apps are written using unicode, and since Xerces provides a full unicode interface, I submit that this is probably not such a common occurrence, nor one that we should go overboard to propagate. But I'd like your feedback on this point.
As of my check-in today, and unless contrary feedback is received, the LCP encoding of Xerces on the Mac has been changed to UTF-8 if it's running on Mac OS X. The previous behavior can be restored by setting the define(XML_MACOS_LCP_TRADITIONAL).
In a real-world example of this change can cause things to break, note that the change described above causes samples built under CodeWarrior to fail if passed more obscurely encoded file names. Since the CodeWarrior samples use the ccommand interface to collect "command line" arguments through a dialog, and since this dialog collects the arguments using the system script, any characters outside the ASCII range will be misconstrued by Xerces as UTF-8. Exactly the opposite problem of what has been occurring from the terminal. This is either (a) an unfortunate occurrence, (b) an argument for why this switch in the LCP encoding shouldn't be implemented, or (c) a good reason to define(XML_MACOS_LCP_TRADITIONAL) for the CodeWarrior samples.
Changes to the Mac OS X Transcoder:
In order to enable the switchover to UTF-8 as the LCP encoding, I've revised the Mac OS Transcoder. Please watch for any problems that arise due to these code changes. Changes are these:
Revise the Mac OS Transcoder to use the Mac Text Encoding Converter, rather than the previously used Unicode Converter. The TEC allows transcoding between a broader variety of encoding formats, while the unicode converter converts only to and from Unicode (but not between unicode formats, such as UTF-8 <--> UTF-16).
As part of re-implementing the transcoder, I've also simplified it:
- I took out support for size mismatch between XMLCh and UniChar. This version assumes XMLCh will always be 16-bit UTF-16.
- The LCP transcoder is now completely generic. It is implemented strictly in terms of the virtual transcoder interface. As such, any XMLTranscoder can be passed in as the transcoder to fullfill the LCP transcoding.
Additionally, since LCP transcoders are used globally, I've added a mutex that serializes use of the LCP transcoder such that it can be actively used in only one thread at a time. This may slow things down a bit, but it's correct, and note also that LCP transcoders aren't used too often in typical Xerces usage.
Your feedback on either of these issues is appreciated.
James.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]