DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27270>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27270 Xerces cannot open file whose name includes UTF8 characters [EMAIL PROTECTED] changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Additional Comments From [EMAIL PROTECTED] 2004-02-26 23:38 ------- Hi Dave, The code page doesn't vary randomly. It does vary as a function of the system environment and configuration. The "code page" for Mac OS in most western locations is MacRoman, while when localized for other locations it is something else (say, MacCyrillic or MacThai ;). But for a given system, it is predicitable. This, as I understand it, is how code pages work under Windows as well. The twist here is that on Mac OS X, the posix file system layer always (and consistently) uses UTF8 encoding for filenames. This doesn't vary with the code page. The problem occurs because Xerces doesn't distinguish posix file names from any other kind of 8 bit encoding that it's trying to turn into UTF-16--the LCP transcoder just gets called. To address this bug, we could: (1) Switch the LCP transcoder to always assume the LCP is in UTF8 (at least on Mac OS X). This would fix the command line case, but might cause problems if people have compiled into their code hard-coded strings in MacRoman, say, or whatever their localized encoding is. (2) Do some other hack to try to fix the command line cases. (3) My preference: do (1) on Mac OS X by default, adding an API and defines to let somebody change the default LCP to something else (or to our traditional behavior). Since the standard Mac OS X terminal defaults to UTF8 these days, this should all work pretty well--unless there are cases where somebody hard-compiled char constants where the encoding of the source code wasn't UTF8 compatible--in this case they'd have to set the API alternatively, and _deal_with_ the UTF8 encoding of posix file names. Ideas, feedback? James. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
