-----BEGIN PGP SIGNED MESSAGE-----
To late in the night, hit "send" to early :-(
Am 18.03.14 00:31, schrieb Bjoern Kahl:
> I apologize for this being a bit longer, but I tried to really
> clarify what normalization is all about nd how it affects ZFS on
> Am 17.03.14 20:56, schrieb Philip Robar:
>> On Mon, Mar 17, 2014 at 3:40 PM, Dave Cottlehuber
>> <d...@jsonified.com> wrote:
>>> On 17. März 2014 at 19:17:23, Philip Robar
>>> (philip.ro...@gmail.com) wrote:
>>>> I admit to being one whose eyes glaze over when the
>>>> discussion turns to i18n/l10n. So why should I use formD
>>> Because (as you point out ;-) poorly written software won't
>>> iTunes is one of them, sadly.
>> OK, let me try again. I read a description of the various
>> normalization forms and despite my being a native speaker of
>> English I couldn't find any meaning in the words. (Something,
>> unfortunately all too common when it comes to standards docs.)
>> So can you explain for the naive and mildly interested what
>> "formD" means?
> The two normalization forms "formD" and "formC" mandate how
> certain characters outside the standard ASCII range (A-Z, a-z, 0-9
> and a few punctuation characters ".,-;" and some other) are
> For example (note: the following is not fully technical correct,
> but illustrates the idea), the German letter "ö", named o_umlaut,
> could be represented as-is, that is as a single entity of Unicode
> code point number 246.
> However, the "ö" could also be seen as a plain "o" with two dots
> (in printed text and modern German hand writing since 1978) or two
> short downward lines (in some German hand writing scripts, for
> example the Sütterlin script or other Kurrent scripts and hand
> writing taught before 1978).
> Similarly, the "ö" can be encode in Unicode by a two character
> sequence, a plain "o" and a modifier '"' with the meaning "put two
> dots above the previous character" (note: '"' is not such a
> modifier, it serves here as a visualization of the actual
> Now, a text in normalization "formC" or "combined form" would have
> all characters, which can be represented by a single entity encode
> using this single character.
> A text in "formD" or "decomposed form" normalization would have
> all characters that have some dots, accents, or other "additions"
> encoded using the plain base character followed by one or more
> It is normalized formD, if the modifiers come in a defined order,
> for example if a character has a dot above and below, the modifier
> for "dot below" comes always first.
> It is in irregular formD, if all characters are decomposed, but
> the modifiers do not come in the defined order, in the example of a
> dot below and above a character, having the modifier for "dot
> above" coming before the modifier for "dot below" makes the string
> This whole mess is important, because it affects how sorting
> works. For example, two strings "o" + "dot_below" + "dot_above"
> and "o" + "dot_above" + "dot_below" should compare equal, because
> they carry the same information, despite the fact that they differ
> in their binary representation.
> Normalizing make comparing and sorting easier.
> Normalization and ZFS and OSX =============================
> Why should we care?
> Because Finder wants to sort directory listings, and for this needs
> to know how the byte sequence it gets from the VFS maps to
> scripting symbols and how these symbols order.
> Finder expects text like filenames to be in formD.
> For file systems like ZFS this means, they need to
> (a) simple case: ignore encoding altogether and just deal with
> byte sequences. Since names are stored and returned as they arrive
> from the Finder & Co. no Problem arises. (In practice, problems
> arise when the using terminal or applications that don't follow
> Apple's encoding rules, because names in the wrong encoding could
> end up on the file system.)
> (b) complex case: Convert the internal form to and from formD when
> communicating with the VFS (and through it with higher levels like
> In case of (b) we have two implementation choices:
> (1) stick to the rules and really do the conversion, in both
> directions, and verifying that what ever we get from the VFS is
> actually in formD (it might not, when using terminal or 3rd party
> applications not following Apple's encoding rules). In that case,
> the setting of the normalization property doesn't matter, because
> it controls how names are recorded *on* *disk*, and this encoding
> would *never* be exposed to the VFS.
> (2) be lazy and essentially do (a), that is present the names to
> VFS in the form mandated by the normalization property when
> reading, i.e. pass-through, but still do a best effort to force
> names received from the VFS into the form mandated by normalization
> property when writing.
That should have read:
(2) be lazy and essentially do (a) but require the user to set "formD"
as value for he normalization property and then present the names to
VFS in the form found on disk, but still do a best effort to force
names received from the VFS into the form mandated by normalization
property when writing, in order not to taint a ZFS pool originating
from some other system.
Obviously (b.2) isn't a real option.
> I hope this answers the question and sheds some light on the
> problem of filename encoding.
> Best regards
| Bjoern Kahl +++ Siegburg +++ Germany |
| "googlelogin@-my-domain-" +++ www.bjoern-kahl.de |
| Languages: German, English, Ancient Latin (a bit :-)) |
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
-----END PGP SIGNATURE-----
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
For more options, visit https://groups.google.com/d/optout.