I don't think there is any particular value to that restructuring, from what
I can make of your email.

Note also, with regard to your message about 'real' weights, that there is
no requirement that implementations preserve the DUCET values, as long as
the ordering is the same. In particular, CLDR and many implementations use
the 'fractional' UCA weights, which are derived from the DUCET values, but
express weights using a variable number of bytes. These are similar to your
'rationals' but are really decimal value chunked into bytes, with some extra
features to allow interleaving and avoid overlap.

http://unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html

Mark
*— Il meglio è l’inimico del bene —*


On Sun, Sep 11, 2011 at 01:06, Philippe Verdy <[email protected]> wrote:

> I think that the UCA forgets to specify which are the valid primary weights
> infered from the default rules used in the current DUCET.
>
> # Derived weight ranges:  FB40..FBFF
> #   [Hani] core primaries:           FB40..FB41 (2)
>       U+4E00..U+9FFF                    FB40..FB41 (2)
>       U+F900..U+FAFF                    FB41       (1)
> #   [Hani] extended primaries:       FB80..FB9D (30)
>       U+3400..U+4DBF                    FB80       (1)
>       U+20000..U+EFFFF                  FB84..FB9D (29)
> #   Other primaries:                 FBC0..FBE1 (34)
>       U+0000..U+EFFFF                   FBC0..FBDD (30)
>       U+F0000..U+10FFFF                 FBDE..FBE1 (4)
> #  Trailing weights:                 FC00..FFFF (1024)
>
> It clearly exhibits that the currently assigned ranges of primary weights
> are way too large for the use.
>
> - Sinograms can fully be assigned a first primary weight within a set of
> only 32 values, instead of the 128 assigned.
>
> - This leaves enough place to separate the primary weights used by PUA
> blocks (both in the BMP or in planes 15 and 16), which just requires 1
> primary weight for the PUAs in the BMP, and 4 primary weights for the last
> two planes (if some other future PUA ranges are assigned, for example for
> RTL PUAs, we could imagine that this count of 5 weights would be extended
> to
>
> - All other primaries will never be assigned to anything outside planes 0
> to 14, and only for unassigned code points (whose primary weight value
> should probably be between the first derived primary weights for sinograms,
> and those from the PUA), so they'll never need more than 30 primary weights.
>
> Couldn't we remap these default bases for derived primary weights like
> this, and keep more space for the rest:
>
> # Derived weight ranges: FBB0..FBFF (80)
> #   [Hani] core primaries:      FBB0..FBB1 (2)
>       U+4E00..U+9FFF                    FBB0 (1)
>         (using base=U+2000 for the 2nd primary weight)
>       U+F900..U+FAFF                    FBB1 (1)
>         (using base=U+A000 for the 2nd primary weight)
> #   [Hani] extended primaries:  FBB2..FB9D (30)
>       U+3400..U+4DBF                    FBB2       (1)
>         (using base=U+2000 for the 2nd primary weight)
>       reserved                          FBB3       (1)
>       U+20000..U+EFFFF                  FBB4..FBCF (26)
>         (using base=U+n0000 or U+n8000 for the 2nd primary weight)
> #   Other non-PUA primaries:    FBD0..FBEF (32)
>       U+0000..U+EFFFF                   FBD0..FBED (30)
>         (using base=U+n0000 or U+n8000 for the 2nd primary weight)
>       reserved                          FBEE..FBEF (2)
> #   PUA primaries:              FBF0..FBFF (16)
>       U+D800..U+DFFF                    FBF0       (1)
>         (using base=U+n8000 for the 2nd primary weight)
>       reserved                          FBF1..FBFB (11)
>       U+F0000..U+10FFFF                 FBFC..FBFF (4)
>         (using base=U+n0000 or U+n8000 for the 2nd primary weight)
> # Trailing weights:      FC00..FFFF (1024)
>
> This scheme completely frees the range FB40..FBAF, while reducing the gaps
> currently left which will never have any use.
>
> (In this scheme, I have no opinion of which best range to use for code
> points assigned to non-characters, but they could all map to FBFF, used here
> for PUA, but with the second primary weight at end of the encoding space
> 8000..FFFF moved to 4000..BFFF so that the second primary weight for
> non-characters goes easily into C000..FFFF)
>
> This way, we would keep ranges available for future large non-sinographic
> scripts (pictographic, non-Han ideographic), that would probably use only
> derived weights, or for a refined DUCET containing more precise levels or
> gaps facilitating some derived collation tables (for example in CLDR).
>
> And all PUAs would clearly sort within dedicated ranges of primary weights,
> with a warranty of all being sorted at end, after all scripts.
>
> -- Philippe.
>
>

Reply via email to