why do I get the feeling apple made everything worse by not sticking
with either UTF-16 or UTF-8 encodings and posix collation for Finder
etc.?

On Mon, Mar 17, 2014 at 8:18 PM, Bjoern Kahl <googlelo...@bjoern-kahl.de> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>  To late in the night, hit "send" to early :-(
>
> Am 18.03.14 00:31, schrieb Bjoern Kahl:
>>
>> I apologize for this being a bit longer, but I tried to really
>> clarify what normalization is all about nd how it affects ZFS on
>> OSX.
>>
>> Am 17.03.14 20:56, schrieb Philip Robar:
>>> On Mon, Mar 17, 2014 at 3:40 PM, Dave Cottlehuber
>>> <d...@jsonified.com> wrote:
>>
>>>> On 17. März 2014 at 19:17:23, Philip Robar
>>>> (philip.ro...@gmail.com) wrote:
>>>>> I admit to being one whose eyes glaze over when the
>>>>> discussion turns to i18n/l10n. So why should I use formD
>>>>> normalization?
>>>>
>>>> Because (as you point out ;-) poorly written software won't
>>>> work.
>>>>
>>>> iTunes is one of them, sadly.
>>>>
>>
>>> OK, let me try again. I read a description of the various
>>> normalization forms and despite my being a native speaker of
>>> English I couldn't find any meaning in the words. (Something,
>>> unfortunately all too common when it comes to standards docs.)
>>> So can you explain for the naive and mildly interested what
>>> "formD" means?
>>
>> The two normalization forms "formD" and "formC" mandate how
>> certain characters outside the standard ASCII range (A-Z, a-z, 0-9
>> and a few punctuation characters ".,-;" and some other) are
>> represented.
>>
>>
>> For example (note: the following is not fully technical correct,
>> but illustrates the idea), the German letter "ö", named o_umlaut,
>> could be represented as-is, that is as a single entity of Unicode
>> code point number 246.
>>
>> However, the "ö" could also be seen as a plain "o" with two dots
>> (in printed text and modern German hand writing since 1978) or two
>> short downward lines (in some German hand writing scripts, for
>> example the Sütterlin script or other Kurrent scripts and hand
>> writing taught before 1978).
>>
>> Similarly, the "ö" can be encode in Unicode by a two character
>> sequence, a plain "o" and a modifier '"' with the meaning "put two
>> dots above the previous character" (note: '"' is not such a
>> modifier, it serves here as a visualization of the actual
>> modifier).
>>
>>
>> Now, a text in normalization "formC" or "combined form" would have
>> all characters, which can be represented by a single entity encode
>> using this single character.
>>
>> A text in "formD" or "decomposed form" normalization would have
>> all characters that have some dots, accents, or other "additions"
>> encoded using the plain base character followed by one or more
>> modifiers.
>>
>> It is normalized formD, if the modifiers come in a defined order,
>> for example if a character has a dot above and below, the modifier
>> for "dot below" comes always first.
>>
>> It is in irregular formD, if all characters are decomposed, but
>> the modifiers do not come in the defined order, in the example of a
>> dot below and above a character, having the modifier for "dot
>> above" coming before the modifier for "dot below" makes the string
>> irregular.
>>
>>
>> This whole mess is important, because it affects how sorting
>> works. For example, two strings "o" + "dot_below" + "dot_above"
>> and "o" + "dot_above" + "dot_below" should compare equal, because
>> they carry the same information, despite the fact that they differ
>> in their binary representation.
>>
>> Normalizing make comparing and sorting easier.
>>
>>
>> Normalization and ZFS and OSX =============================
>>
>>
>> Why should we care?
>>
>> Because Finder wants to sort directory listings, and for this needs
>> to know how the byte sequence it gets from the VFS maps to
>> scripting symbols and how these symbols order.
>>
>> Finder expects text like filenames to be in formD.
>>
>> For file systems like ZFS this means, they need to
>>
>> (a) simple case: ignore encoding altogether and just deal with
>> byte sequences.  Since names are stored and returned as they arrive
>> from the Finder & Co. no Problem arises.  (In practice, problems
>> arise when the using terminal or applications that don't follow
>> Apple's encoding rules, because names in the wrong encoding could
>> end up on the file system.)
>>
>> (b) complex case: Convert the internal form to and from formD when
>> communicating with the VFS (and through it with higher levels like
>> Finder)
>>
>> In case of (b) we have two implementation choices:
>>
>> (1) stick to the rules and really do the conversion, in both
>> directions, and verifying that what ever we get from the VFS is
>> actually in formD (it might not, when using terminal or 3rd party
>> applications not following Apple's encoding rules).  In that case,
>> the setting of the normalization property doesn't matter, because
>> it controls how names are recorded *on* *disk*, and this encoding
>> would *never* be exposed to the VFS.
>>
>> (2) be lazy and essentially do (a), that is present the names to
>> VFS in the form mandated by the normalization property when
>> reading, i.e. pass-through, but still do a best effort to force
>> names received from the VFS into the form mandated by normalization
>> property when writing.
>
>  That should have read:
>
>  (2) be lazy and essentially do (a) but require the user to set "formD"
>  as value for he normalization property and then present the names to
>  VFS in the form found on disk, but still do a best effort to force
>  names received from the VFS into the form mandated by normalization
>  property when writing, in order not to taint a ZFS pool originating
>  from some other system.
>
>  Obviously (b.2) isn't a real option.
>
>
>
>> I hope this answers the question and sheds some light on the
>> problem of filename encoding.
>>
>>
>> Best regards
>>
>> Björn
>
> - --
> |     Bjoern Kahl   +++   Siegburg   +++    Germany     |
> | "googlelogin@-my-domain-"   +++   www.bjoern-kahl.de  |
> | Languages: German, English, Ancient Latin (a bit :-)) |
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQCVAgUBUyeQtVsDv2ib9OLFAQK3MQP+JEhwtmjyAwikJ+KRMdcKOqWxy/Sf1jjG
> z2tVkM2BM2zkZAFV+iq3W3BwWHftESiKWRObzbLkvZjEhYUYxGfCbuTfD0f4V8Ng
> oV5vjOkoxNCi82QiCDQq04vUlCEpbp0QSojguixLpBKPM4OisPYdGqoNo510w8cx
> J9f+G88Iw10=
> =2Z9E
> -----END PGP SIGNATURE-----
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "zfs-macos" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to zfs-macos+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"zfs-macos" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to