On 06/12/2003 09:49, Doug Ewell wrote:
...
But as C10 does not mandate any normalized form (just canonical
equivalence of the results), I don't think that it requires that a
compressor should produce its result in either NFC or NFD form
Right. I know that. But Mark and Ken said it should, and so I'm trying
to find out if this *SHOULD* be in C10, or alternatively how I would
expect to find out about this extra recommendation if I weren't on the
Unicode mailing list.
My feeling is that it should NOT be in C10. The conformance rules do not
mention normalisation, which is a convenience and a recommendation but
not a conformance issue - that is, apart from C14-C16 which do not
specify that normalisation should be done, only how it must be done if
it is done. I would also argue that adding such a rule to C10 would
conflict with C9 or at least with the principle underlying it, that for
conformance purposes all canonically equivalent forms are equal and
indistinguishable.
Note that this is a devil's advocate argument. I'm not necessarily
disagreeing with Mark and Ken's recommendations. I'm just trying to
reconcile the differences between what they say (NFC or NFD only) and
what C10 says (any canonically equivalent sequence).
I would agree with these recommendations as long as they remain
recommendations rather than conformance requirements, and are perhaps
added to section 5.6 of TUS (4.0), which already includes the sentence
(in the context of canonically equivalent alternative spellings):
Implementations that are âliberalâ in what they accept, but
âconservativeâ in what they issue, will have the fewest compatibility
problems.
...
With this view, normalization of strings should not be done on output
from a process but on its input.
Yes. Correct. To paraphrase Peter Kirk, according to C9, if my Unicode
text is going through Process A (which outputs NFD) and on to Process B
(which wants NFC input), A does not have to convert the text to NFC just
to appease B. Instead, it is up to B to do the work of converting to
NFC.
A more accurate paraphrase would be that process A may output any
canonically equivalent form, maybe one which is not a normalisation form
at all e.g. because it has composed composition exclusions.
So, in a compressor/decompressor environment, the compressor is the one
that has to do any normalization work. The decompressor just
decompresses. This is consistent with what I wrote a few days ago about
wanting this normalized-compression bit to work with existing
decompressors.
For "any normalization work" I would prefer to say "any canonically
equivalent transformation work". The compressor performs whatever
canonically equivalent transformations it chooses, perhaps because they
produce the best compression. The decompressor is only obliged to
decompress; C9 implies that it is not obliged to normalise and that no
other process can rely on it doing so - although it is free to do so and
may choose to do so if that is expected to improve overall efficiency.
Mark and Ken's recommendation is indeed that it should do so, and I have
no quarrel with that.
...
Subprocesses within a closed system may be able to make certain
assumptions for efficiency. Process B, for example, may know that its
only source of input is Process A, which is guaranteed always to produce
NFC. ...
Does C9 actually allow this? Well, perhaps within a closed system, but
then standardisation and so Unicode is irrelevant to data transfer
between sub-processes within a closed system. Outside a closed system,
Process B's best assumption for efficiency may be that Process A has
*probably* normalised, and so it is worth performing first a quick check
before a full normalisation.
... For any other situation, the algorithm described in Annex 8 of UAX
#15 should be employed (not re-invented).
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/