"Doug Ewell" <[email protected]> wrote: |Philippe Verdy wrote: |>>> Well, you do have eleven bits for flags per codepoint, for example. |>> |>> That's not UCS-4; that's a custom encoding. |>> |>> (any UCS-4 code unit) & 0xFFE00000 == 0 | |(changing to "UTF-32" per Ken's observation) | |> Per definition yes, but UTC-4 is not Unicode. | |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting |held in 1989? | |> As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which |> would allow 32 planes instead of just the 17 first ones). | |I used bitwise arithmetic strictly to address Steffen's premise that the |11 "unused bits" in a UTF-32 code unit were available to store metadata |about the code point. Of course UTF-32 does not allow 0x110000 through |0x1FFFFF either. | |> I suppose he meant 21 bits, not 11 bits which covers only a small part |> of the BMP. | |No, his comment "you do have eleven bits for flags per codepoint" pretty |clearly referred to using the "extra" 11 bits beyond what is needed to |hold the Unicode scalar value.
It surely is a weak argument for a general string encoding. But sometimes, and for local use cases it surely is valid. You could store the wcwidth(3) plus a graphem codepoint count both in these bits of the first codepoint of a cluster, for example, and, then, that storage detail hidden under an access method interface. --steffen

