Thank you, Hussein, for looking into this. I fully understand your
conclusion.
I've written to our user about this and suggested he try using a tool
like convmv (https://linux.die.net/man/1/convmv) to convert his file
names to NFC.
--Andy
On 2/15/2020 12:18 PM, Hussein Shafie wrote:
On 2/6/20 5:40 PM, H. Andrew Black wrote:
Mine is Windows 10. A user is on some form of Linux. He just wrote
to me the following:
"The decomposed (NFD) form is preferred for our work, because High
tone is not a part of a vowel, but rather something that goes with
it. This is reflected in our virtual keyboards, and allows searching
for high tone, irrespective of vowel ... This is different from the
normal NFC case, like French é, which is it's own vowel, not e and
high tone, etc."
So for him, he always uses NFD at least internally in his programs
(and maybe in his text files so he can do searches). The file name at
issue was automatically generated by a program called Praat
(http://www.fon.hum.uva.nl/praat/) based on the information he keyed
into it. That is, Praat created the file name using NFD and his
Linux file system accepted it without converting it to NFC. (In the
NFD sample file in the zip file above, I copied just a small portion
of the Praat-generated file name and renamed my PNG file to use it.)
So, what to do? I see at least two possibilities:
1. Our user finds a way to convert his *file names* to use NFC and
not NFD. Then they always load correctly in XXE. If, however, he
needs to be able to do searches on file names where the tone (NFD) is
crucial, then that will be a problem for him. If, instead, he only
needs to search on the *contents* of files, then converting file
names to NFD should work well for him (assuming he can figure out how
to do that).
2. See if XXE can keep the NFD/NFC distinction when getting the file
name. One possibility might be at
https://stackoverflow.com/questions/43380362/java-differentiate-between-files-in-unicode-nfc-and-nfd.
Sorry but we currently see no way to preserve NFD-encoded characters
(e.g. "é" represented by "e" followed by combining acute accent
U+0301) when converting a file path to an URL.
We have tested the above stackoverflow workaround (which is "use
Path.toURI") on all platforms but at the end, we did not keep it.
With Path.toURI replacing File.toURI, using the most recent version of
Java, the results are:
* Fixes the bug on Linux.
* On the Mac, NFC chars seem to be converted to NFD (which is just the
opposite bug).
* On Windows, Path.toURI works just like File.toURI. That is, NFD
chars are converted to NFC chars.
--
XMLmind XML Editor Support List
[email protected]
https://www.xmlmind.com/mailman/listinfo/xmleditor-support