[ https://issues.apache.org/jira/browse/XERCESC-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Scott Cantor updated XERCESC-1166: ---------------------------------- Fix Version/s: (was: Nightly build (please specify the date)) 2.5.0 > Xerces cannot open file whose name includes UTF8 characters > ----------------------------------------------------------- > > Key: XERCESC-1166 > URL: https://issues.apache.org/jira/browse/XERCESC-1166 > Project: Xerces-C++ > Issue Type: Bug > Components: Utilities > Affects Versions: 2.4.0 > Environment: Operating System: Other > Platform: Macintosh > Reporter: Mark Goldstein > Assignee: James Berry > Fix For: 2.5.0 > > > I originally wrote about this as attached below. > James Berry asked me to file the big report, see his e-mail below as well. > On Feb 25, 2004, at 5:31 PM, Mark Goldstein wrote: > Hello, > Using Xalan/Xerces I tried to transform a file with a name that included an > "e" > with accent. Your mailer might show it: > féébad.xml > The command line call (using Mac OS-X copy/paste which converts the characters > to octal constants) looks like this: > mark$ ./Xalan -o foo.out fe\314\201e\314\201bad.xml foo.xsl > And it results in this error: > Fatal Error at (unknown file , line 0 , column {null} ): An exception > occurred! > Type:RuntimeException, > Message:The primary document entity could not be opened. Id=féébad.xml > SAXParseException: An exception occurred! Type:RuntimeException, > Message:The primary document entity could not be opened. Id=féébad.xml (, > line > 0, column 0) > Is this a known bug? Is there a work-around? > This isn't a known bug, but, having done a bit of snooping, I do believe that > it > is a bug. > Here's what I think is going on: > Xerces creates a transcoder that converts from the local code page to unicode > (LCP Transcoder). On Mac OS, it assumes the local code page is whatever the > default system script encoding is, which is often MacRoman. This LCP > Transcoder > is used whenever a XMLString is created from a char*. That is done, for > instance, as part of taking a file off the command line and creating a parser > from it. > The problem in your case is that the characters coming off the Mac OS X > command > line are actually utf-8, not (MacRoman, or whatever). They're being converted > to > utf-16 as if they were MacRoman. And all hell breaks loose, including the > unfortunate fact that the file can't be opened. > This is a bit of a no-win situation. We could simply make the LCP Transcoder > assume the LCP is always utf-8, but that would require a major re-architecting > of the transcoder, since we rely on the lower level unicode converter, which > can't transcode between unicode encodings, only to and from them. It also may > not be quite the right answer either, since it just fixes the situation for > the > command line and ignores the fact that there are a number of other LCP > encodings > being used, which this decision could affect. > There are probably a number of workarounds, but they all basically boil down > to > not relying on the LCP transcoder to convert the utf-8 string from the command > line into unicode in the first place. For instance, you could explicitly call > the intrinsic utf-8 transcoder through Transervice, or cheat and call > TranscodeUTF8ToUniChars, which is buried down in MacOSPlatformUtils. There are > probably better solutions, but it's getting late for me now. Once you have the > filename in utf-16, pass that directly into the parser. > There may be other simpler workarounds, which might include simply changing > the > encoding of text in the terminal to MacRoman, or whatever. But it's making my > head hurt to understand the interactions that would occur in that case...your > file wouldn't list correctly in that case, I would think. > Please let me know how it goes, and if you could write a bug report that would > help as well. > James. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org For additional commands, e-mail: c-dev-h...@xerces.apache.org