DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27270>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27270

Xerces cannot open file whose name includes UTF8 characters

[EMAIL PROTECTED] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED



------- Additional Comments From [EMAIL PROTECTED]  2004-02-26 23:38 -------
Hi Dave,

The code page doesn't vary randomly. It does vary as a function of the system 
environment and 
configuration. The "code page" for Mac OS in most western locations is MacRoman, while 
when 
localized for other locations it is something else (say, MacCyrillic or MacThai ;). 
But for a given system, 
it is predicitable. This, as I understand it, is how code pages work under Windows as 
well.

The twist here is that on Mac OS X, the posix file system layer always (and 
consistently) uses UTF8 
encoding for filenames. This doesn't vary with the code page. The problem occurs 
because Xerces 
doesn't distinguish posix file names from any other kind of 8 bit encoding that it's 
trying to turn into 
UTF-16--the LCP transcoder just gets called.

To address this bug, we could:

(1) Switch the LCP transcoder to always assume the LCP is in UTF8 (at least on Mac OS 
X). This would fix 
the command line case, but might cause problems if people have compiled into their 
code hard-coded 
strings in MacRoman, say, or whatever their localized encoding is.

(2) Do some other hack to try to fix the command line cases.

(3) My preference: do (1) on Mac OS X by default, adding an API and defines to let 
somebody change the 
default LCP to something else (or to our traditional behavior). Since the standard Mac 
OS X terminal 
defaults to UTF8 these days, this should all work pretty well--unless there are cases 
where somebody 
hard-compiled char constants where the encoding of the source code wasn't UTF8 
compatible--in this 
case they'd have to set the API alternatively, and _deal_with_ the UTF8 encoding of 
posix file names.

Ideas, feedback?

James.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to