On 2/17/2011 1:58 AM, Hussein Shafie wrote:
> On 02/16/2011 11:37 PM, Andy Black wrote:
>> Hussein:
>>
>> One of our users recently ran into a frustrating situation. He had a
>> <link>  element whose href attribute was set as a relative URI to a sound
>> file in MP3 format. The sound file was on his hard disk in a
>> subdirectory of where the main file is located. (This is for the
>> XLingPaper custom configuration files we have for use with XXE, but I
>> suspect that is not a factor.) The file name included accented vowels.
>> As an example, one was
>>
>> ɔ̀kɔ́ àbɔ̀n.mp3
>>
>> (The name may not come through correctly in this email. The Unicode
>> characters are: U+0254 U+0300 k U+0254 U+0301 (space) a U+0300 b U+0254
>> U+0300 n . m p 3.)
> Surprising! Why, for example, express agrave as U+0061 U+0300 (that is,
> Letter 'a' followed by Combining Diacritical Mark '`') when there is a
> Unicode character for agrave: U+00E0?

I think that either his input method created it as NFD or the user 
copied this string from some material in a program that internally uses 
NFD  (this is a sound file for a particular phrase in a minority 
language and the name of the file is that phrase).

>> When he used the Browse Files... option in the Attributes Editor of XXE,
>> the result was
>>
>> %C9%94%CC%80k%C9%94%CC%81%20%C3%A0b%C9%94%CC%80n.mp3
>>
>> I understand that this is the same file name using percent-encoding
>> (with UTF-8 encoding values) for those requiring it with the exception
>> that the acute a in the file name is two Unicode characters (U+0061
>> U+0300) while the acute a in the URI is one Unicode character (U+00E0).
>> Apparently, this difference is crucial.
>>
>> The problem becomes clear when the user's XML file is converted to
>> either a web page output or a PDF output and the user clicks on the
>> link. The browser or PDF reader indicates that it will look for the
>> correct file name (at least, it looks correct - one can see the acute a,
>> for example), but these applications report that they could not find the
>> file. Looking at the web page, the file name is exactly as the URI
>> returned from the Attributes Editor as given above. Similarly for the
>> PDF file. So why is it that everything looks good, but these
>> applications say that they cannot find the file?
>>
>>   From what we can tell, the problem is for characters like the acute
>> accented a. The file system on the user's hard drive
> Probably Mac OS X HFS+. See
> http://stackoverflow.com/questions/3610013/file-listfiles-mangles-unicode-names-with-jdk-6-unicode-normalization-issues

Actually it was on Linux.  I was able to reproduce the problem on both a 
Linux system and on Windows XP.  I had not tried it on a Mac.

Thanks, though, for the insightful link.

>> uses the decomposed
>> form (NFD) of the acute a (i.e. it is U+0061 U+0300) while the result of
>> the Browse Files... option in the Attributes Tool (U+00E0) uses the
>> composed form (NFC). When the composed form is used by a web browser or
>> PDF reader, there is a mismatch to the file name on the hard drive so
>> the file cannot be found.
>>
>> Is this a known issue with the Browse Files... dialog box in the
>> Attributes Tool? That is, is it known that this tool converts NFD format
>> to NFC? Is there a preferences setting that can be set to control this?
>> Is there some other work-around available?
>>
> XXE simply uses the characters of the filenames passed to it by the Java
> runtime. Therefore it's a Java issue and not an XXE issue. Java does not
> seem to keep the original decomposed form of the characters. I don't
> know any (simple) workaround.

Thanks again for looking into this.  I understand the problem much 
better now, although what we're going to do for a general solution is 
not yet clear...

--Andy

 
--
XMLmind XML Editor Support List
[email protected]
http://www.xmlmind.com/mailman/listinfo/xmleditor-support

Reply via email to