On 2/17/2011 1:58 AM, Hussein Shafie wrote: > On 02/16/2011 11:37 PM, Andy Black wrote: >> Hussein: >> >> One of our users recently ran into a frustrating situation. He had a >> <link> element whose href attribute was set as a relative URI to a sound >> file in MP3 format. The sound file was on his hard disk in a >> subdirectory of where the main file is located. (This is for the >> XLingPaper custom configuration files we have for use with XXE, but I >> suspect that is not a factor.) The file name included accented vowels. >> As an example, one was >> >> ɔ̀kɔ́ àbɔ̀n.mp3 >> >> (The name may not come through correctly in this email. The Unicode >> characters are: U+0254 U+0300 k U+0254 U+0301 (space) a U+0300 b U+0254 >> U+0300 n . m p 3.) > Surprising! Why, for example, express agrave as U+0061 U+0300 (that is, > Letter 'a' followed by Combining Diacritical Mark '`') when there is a > Unicode character for agrave: U+00E0?
I think that either his input method created it as NFD or the user copied this string from some material in a program that internally uses NFD (this is a sound file for a particular phrase in a minority language and the name of the file is that phrase). >> When he used the Browse Files... option in the Attributes Editor of XXE, >> the result was >> >> %C9%94%CC%80k%C9%94%CC%81%20%C3%A0b%C9%94%CC%80n.mp3 >> >> I understand that this is the same file name using percent-encoding >> (with UTF-8 encoding values) for those requiring it with the exception >> that the acute a in the file name is two Unicode characters (U+0061 >> U+0300) while the acute a in the URI is one Unicode character (U+00E0). >> Apparently, this difference is crucial. >> >> The problem becomes clear when the user's XML file is converted to >> either a web page output or a PDF output and the user clicks on the >> link. The browser or PDF reader indicates that it will look for the >> correct file name (at least, it looks correct - one can see the acute a, >> for example), but these applications report that they could not find the >> file. Looking at the web page, the file name is exactly as the URI >> returned from the Attributes Editor as given above. Similarly for the >> PDF file. So why is it that everything looks good, but these >> applications say that they cannot find the file? >> >> From what we can tell, the problem is for characters like the acute >> accented a. The file system on the user's hard drive > Probably Mac OS X HFS+. See > http://stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode-names-with-jdk-6-unicode-normalization-issues Actually it was on Linux. I was able to reproduce the problem on both a Linux system and on Windows XP. I had not tried it on a Mac. Thanks, though, for the insightful link. >> uses the decomposed >> form (NFD) of the acute a (i.e. it is U+0061 U+0300) while the result of >> the Browse Files... option in the Attributes Tool (U+00E0) uses the >> composed form (NFC). When the composed form is used by a web browser or >> PDF reader, there is a mismatch to the file name on the hard drive so >> the file cannot be found. >> >> Is this a known issue with the Browse Files... dialog box in the >> Attributes Tool? That is, is it known that this tool converts NFD format >> to NFC? Is there a preferences setting that can be set to control this? >> Is there some other work-around available? >> > XXE simply uses the characters of the filenames passed to it by the Java > runtime. Therefore it's a Java issue and not an XXE issue. Java does not > seem to keep the original decomposed form of the characters. I don't > know any (simple) workaround. Thanks again for looking into this. I understand the problem much better now, although what we're going to do for a general solution is not yet clear... --Andy -- XMLmind XML Editor Support List [email protected] http://www.xmlmind.com/mailman/listinfo/xmleditor-support

