Re: Combining Characters

Alex Shpilkin via Unicode Fri, 19 Dec 2025 13:08:31 -0800

On Fri, Dec 19 2025 at 10:32:57 -06:00:00, Jacob Moody via Unicode<[email protected]> wrote:

I do wish the documents on migration[1] had explicitly explained thatthesenew characters have ccc=0 conjoiners, it may imply it when discussingthem,
and maybe I'm still a bit green on the details to put 2 and 2 together
but it would have saved me some time.


No objection here despite the foregoing.

On the topic I did find the suggested resolution of using thequickcheck value a bit strange, as far as I know use of quickcheckwas not strictly required for normalziation prior to this update. Orwell, my v15 implementation did not use it and passed all thenormalization
tests.

I haven’t gotten to implementing canonical composition yet, nor haveI looked at any other implementation including yours, but AFAICT the QCproperties aren’t required now either: looking at the 3.11Normalization Forms in Unicode 13, predating this change, therecomposition algorithm that suggests itself is:


starter = 0  # sentinel not part of any compositions
starter index = uninitialized

index = 0
while index < length of string:
   composition = try to compose (starter, string[index])
   if succeeded:
       assert ccc[composition] = 0
       string[starter index] = composition
       delete string[index]
   else:
       if ccc[string[index]] = 0:  # NB only this late
           starter = string[index]
           starter index = index
       index = index + 1

If you check conditions in this order, then the handling ofstarter+starter compositions falls out naturally. (Also note that thecomposition table only needs to contain pairs of an NFC-form starterand an NFD character, and there are possible optimizations connected tothe fact that, if the next character after a successful composition isa nonstarter too, then the first character in the next lookup will bethe result of this one.)

Trying to merge de- and recomposition into a single streaming process(e.g. with limits on the length of a composing character sequence toavoid worst-case linear memory consumption) will of course make thingsmuch more difficult.

I guess as an upside I found that with these changes and theinclusion of quickcheck hangul no longer needed to be special cased.

I don’t believe you ever actually *have* to special-case Hangul afteryou’ve generated your tables, it’s just that if you are trying tokeep your table size down (as I am) then doing so will give yousomething like 2x savings.


--
HTH,
Alex

Re: Combining Characters

Reply via email to