why do I get the feeling apple made everything worse by not sticking with either UTF-16 or UTF-8 encodings and posix collation for Finder etc.?
On Mon, Mar 17, 2014 at 8:18 PM, Bjoern Kahl <googlelo...@bjoern-kahl.de> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > To late in the night, hit "send" to early :-( > > Am 18.03.14 00:31, schrieb Bjoern Kahl: >> >> I apologize for this being a bit longer, but I tried to really >> clarify what normalization is all about nd how it affects ZFS on >> OSX. >> >> Am 17.03.14 20:56, schrieb Philip Robar: >>> On Mon, Mar 17, 2014 at 3:40 PM, Dave Cottlehuber >>> <d...@jsonified.com> wrote: >> >>>> On 17. März 2014 at 19:17:23, Philip Robar >>>> (philip.ro...@gmail.com) wrote: >>>>> I admit to being one whose eyes glaze over when the >>>>> discussion turns to i18n/l10n. So why should I use formD >>>>> normalization? >>>> >>>> Because (as you point out ;-) poorly written software won't >>>> work. >>>> >>>> iTunes is one of them, sadly. >>>> >> >>> OK, let me try again. I read a description of the various >>> normalization forms and despite my being a native speaker of >>> English I couldn't find any meaning in the words. (Something, >>> unfortunately all too common when it comes to standards docs.) >>> So can you explain for the naive and mildly interested what >>> "formD" means? >> >> The two normalization forms "formD" and "formC" mandate how >> certain characters outside the standard ASCII range (A-Z, a-z, 0-9 >> and a few punctuation characters ".,-;" and some other) are >> represented. >> >> >> For example (note: the following is not fully technical correct, >> but illustrates the idea), the German letter "ö", named o_umlaut, >> could be represented as-is, that is as a single entity of Unicode >> code point number 246. >> >> However, the "ö" could also be seen as a plain "o" with two dots >> (in printed text and modern German hand writing since 1978) or two >> short downward lines (in some German hand writing scripts, for >> example the Sütterlin script or other Kurrent scripts and hand >> writing taught before 1978). >> >> Similarly, the "ö" can be encode in Unicode by a two character >> sequence, a plain "o" and a modifier '"' with the meaning "put two >> dots above the previous character" (note: '"' is not such a >> modifier, it serves here as a visualization of the actual >> modifier). >> >> >> Now, a text in normalization "formC" or "combined form" would have >> all characters, which can be represented by a single entity encode >> using this single character. >> >> A text in "formD" or "decomposed form" normalization would have >> all characters that have some dots, accents, or other "additions" >> encoded using the plain base character followed by one or more >> modifiers. >> >> It is normalized formD, if the modifiers come in a defined order, >> for example if a character has a dot above and below, the modifier >> for "dot below" comes always first. >> >> It is in irregular formD, if all characters are decomposed, but >> the modifiers do not come in the defined order, in the example of a >> dot below and above a character, having the modifier for "dot >> above" coming before the modifier for "dot below" makes the string >> irregular. >> >> >> This whole mess is important, because it affects how sorting >> works. For example, two strings "o" + "dot_below" + "dot_above" >> and "o" + "dot_above" + "dot_below" should compare equal, because >> they carry the same information, despite the fact that they differ >> in their binary representation. >> >> Normalizing make comparing and sorting easier. >> >> >> Normalization and ZFS and OSX ============================= >> >> >> Why should we care? >> >> Because Finder wants to sort directory listings, and for this needs >> to know how the byte sequence it gets from the VFS maps to >> scripting symbols and how these symbols order. >> >> Finder expects text like filenames to be in formD. >> >> For file systems like ZFS this means, they need to >> >> (a) simple case: ignore encoding altogether and just deal with >> byte sequences. Since names are stored and returned as they arrive >> from the Finder & Co. no Problem arises. (In practice, problems >> arise when the using terminal or applications that don't follow >> Apple's encoding rules, because names in the wrong encoding could >> end up on the file system.) >> >> (b) complex case: Convert the internal form to and from formD when >> communicating with the VFS (and through it with higher levels like >> Finder) >> >> In case of (b) we have two implementation choices: >> >> (1) stick to the rules and really do the conversion, in both >> directions, and verifying that what ever we get from the VFS is >> actually in formD (it might not, when using terminal or 3rd party >> applications not following Apple's encoding rules). In that case, >> the setting of the normalization property doesn't matter, because >> it controls how names are recorded *on* *disk*, and this encoding >> would *never* be exposed to the VFS. >> >> (2) be lazy and essentially do (a), that is present the names to >> VFS in the form mandated by the normalization property when >> reading, i.e. pass-through, but still do a best effort to force >> names received from the VFS into the form mandated by normalization >> property when writing. > > That should have read: > > (2) be lazy and essentially do (a) but require the user to set "formD" > as value for he normalization property and then present the names to > VFS in the form found on disk, but still do a best effort to force > names received from the VFS into the form mandated by normalization > property when writing, in order not to taint a ZFS pool originating > from some other system. > > Obviously (b.2) isn't a real option. > > > >> I hope this answers the question and sheds some light on the >> problem of filename encoding. >> >> >> Best regards >> >> Björn > > - -- > | Bjoern Kahl +++ Siegburg +++ Germany | > | "googlelogin@-my-domain-" +++ www.bjoern-kahl.de | > | Languages: German, English, Ancient Latin (a bit :-)) | > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1 > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQCVAgUBUyeQtVsDv2ib9OLFAQK3MQP+JEhwtmjyAwikJ+KRMdcKOqWxy/Sf1jjG > z2tVkM2BM2zkZAFV+iq3W3BwWHftESiKWRObzbLkvZjEhYUYxGfCbuTfD0f4V8Ng > oV5vjOkoxNCi82QiCDQq04vUlCEpbp0QSojguixLpBKPM4OisPYdGqoNo510w8cx > J9f+G88Iw10= > =2Z9E > -----END PGP SIGNATURE----- > > -- > > --- > You received this message because you are subscribed to the Google Groups > "zfs-macos" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to zfs-macos+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups "zfs-macos" group. To unsubscribe from this group and stop receiving emails from it, send an email to zfs-macos+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.