Hash: SHA1

 I apologize for this being a bit longer, but I tried to really clarify
 what normalization is all about nd how it affects ZFS on OSX.

Am 17.03.14 20:56, schrieb Philip Robar:
> On Mon, Mar 17, 2014 at 3:40 PM, Dave Cottlehuber
> <d...@jsonified.com> wrote:
>> On 17. März 2014 at 19:17:23, Philip Robar
>> (philip.ro...@gmail.com) wrote:
>>> I admit to being one whose eyes glaze over when the discussion
>>> turns to i18n/l10n. So why should I use formD normalization?
>> Because (as you point out ;-) poorly written software won't
>> work.
>> iTunes is one of them, sadly.
> OK, let me try again. I read a description of the various
> normalization forms and despite my being a native speaker of
> English I couldn't find any meaning in the words. (Something,
> unfortunately all too common when it comes to standards docs.) So
> can you explain for the naive and mildly interested what "formD"
> means?

 The two normalization forms "formD" and "formC" mandate how certain
 characters outside the standard ASCII range (A-Z, a-z, 0-9 and a few
 punctuation characters ".,-;" and some other) are represented.

 For example (note: the following is not fully technical correct, but
 illustrates the idea), the German letter "ö", named o_umlaut, could
 be represented as-is, that is as a single entity of Unicode code point
 number 246.

 However, the "ö" could also be seen as a plain "o" with two dots (in
 printed text and modern German hand writing since 1978) or two short
 downward lines (in some German hand writing scripts, for example the
 Sütterlin script or other Kurrent scripts and hand writing taught
 before 1978).

 Similarly, the "ö" can be encode in Unicode by a two character
 sequence, a plain "o" and a modifier '"' with the meaning "put two
 dots above the previous character" (note: '"' is not such a modifier,
 it serves here as a visualization of the actual modifier).

 Now, a text in normalization "formC" or "combined form" would have all
 characters, which can be represented by a single entity encode using
 this single character.

 A text in "formD" or "decomposed form" normalization would have all
 characters that have some dots, accents, or other "additions" encoded
 using the plain base character followed by one or more modifiers.

 It is normalized formD, if the modifiers come in a defined order, for
 example if a character has a dot above and below, the modifier for
 "dot below" comes always first.

 It is in irregular formD, if all characters are decomposed, but the
 modifiers do not come in the defined order, in the example of a dot
 below and above a character, having the modifier for "dot above"
 coming before the modifier for "dot below" makes the string irregular.

 This whole mess is important, because it affects how sorting works.
 For example, two strings "o" + "dot_below" + "dot_above"  and "o" +
 "dot_above" + "dot_below" should compare equal, because they carry
 the same information, despite the fact that they differ in their
 binary representation.

 Normalizing make comparing and sorting easier.

 Normalization and ZFS and OSX

 Why should we care?

 Because Finder wants to sort directory listings, and for this needs to
 know how the byte sequence it gets from the VFS maps to scripting
 symbols and how these symbols order.

 Finder expects text like filenames to be in formD.

 For file systems like ZFS this means, they need to

 (a) simple case: ignore encoding altogether and just deal with byte
 sequences.  Since names are stored and returned as they arrive from
 the Finder & Co. no Problem arises.  (In practice, problems arise
 when the using terminal or applications that don't follow Apple's
 encoding rules, because names in the wrong encoding could end up on
 the file system.)

 (b) complex case: Convert the internal form to and from formD when
 communicating with the VFS (and through it with higher levels like

 In case of (b) we have two implementation choices:

 (1) stick to the rules and really do the conversion, in both
 directions, and verifying that what ever we get from the VFS is
 actually in formD (it might not, when using terminal or 3rd party
 applications not following Apple's encoding rules).  In that case, the
 setting of the normalization property doesn't matter, because it
 controls how names are recorded *on* *disk*, and this encoding would
 *never* be exposed to the VFS.

 (2) be lazy and essentially do (a), that is present the names to VFS
 in the form mandated by the normalization property when reading, i.e.
 pass-through, but still do a best effort to force names received from
 the VFS into the form mandated by normalization property when writing.

 I hope this answers the question and sheds some light on the problem
 of filename encoding.

 Best regards


- -- 
|     Bjoern Kahl   +++   Siegburg   +++    Germany     |
| "googlelogin@-my-domain-"   +++   www.bjoern-kahl.de  |
| Languages: German, English, Ancient Latin (a bit :-)) |
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/



You received this message because you are subscribed to the Google Groups 
"zfs-macos" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-macos+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to