On Fri, Dec 19 2025 at 10:32:57 -06:00:00, Jacob Moody via Unicode <[email protected]> wrote:
I do wish the documents on migration[1] had explicitly explained that these new characters have ccc=0 conjoiners, it may imply it when discussing them,
and maybe I'm still a bit green on the details to put 2 and 2 together
but it would have saved me some time.

No objection here despite the foregoing.

On the topic I did find the suggested resolution of using the quickcheck value a bit strange, as far as I know use of quickcheck was not strictly required for normalziation prior to this update. Or well, my v15 implementation did not use it and passed all the normalization
tests.

I haven’t gotten to implementing canonical composition yet, nor have I looked at any other implementation including yours, but AFAICT the QC properties aren’t required now either: looking at the 3.11 Normalization Forms in Unicode 13, predating this change, the recomposition algorithm that suggests itself is:

starter = 0  # sentinel not part of any compositions
starter index = uninitialized

index = 0
while index < length of string:
   composition = try to compose (starter, string[index])
   if succeeded:
       assert ccc[composition] = 0
       string[starter index] = composition
       delete string[index]
   else:
       if ccc[string[index]] = 0:  # NB only this late
           starter = string[index]
           starter index = index
       index = index + 1

If you check conditions in this order, then the handling of starter+starter compositions falls out naturally. (Also note that the composition table only needs to contain pairs of an NFC-form starter and an NFD character, and there are possible optimizations connected to the fact that, if the next character after a successful composition is a nonstarter too, then the first character in the next lookup will be the result of this one.)

Trying to merge de- and recomposition into a single streaming process (e.g. with limits on the length of a composing character sequence to avoid worst-case linear memory consumption) will of course make things much more difficult.

I guess as an upside I found that with these changes and the inclusion of quickcheck hangul no longer needed to be special cased.

I don’t believe you ever actually *have* to special-case Hangul after you’ve generated your tables, it’s just that if you are trying to keep your table size down (as I am) then doing so will give you something like 2x savings.

--
HTH,
Alex



Reply via email to