2013/1/4 Asmus Freytag <[email protected]>: > On 1/4/2013 2:36 AM, Stephan Stiller wrote: >> >> All, >> >> There are plenty of unassigned code points within blocks that are in use; >> these often come at the end of a block but there are plenty of holes as >> well. >> >> I have a cluster of interrelated questions: >> 1. What sorts of reasons are there (or have there been) for leaving holes? >> Code page conversion and changes to casing by simple arithmetic? What else? > > > There are a number of reasons why a code chart may not be contiguous besides > the reason you give. Sometimes, a character gets removed from the draft at > last minute, In those cases, a hole may be left. In general, the possible > reasons for leaving a hole can not be enumerated in a fixed list. It's more > of a case-by-case thing.
And sometimes the holes are left pending a further decision. It remains reserved for a while as long as the proposed character has not been formally rejected. Sometimes holes are coming from simple mappings from legacy encodings, just to preserve the relative order. The holes were not allocated because the legacy encoding referenced a character already encoded elsewhere. These holes, initially kept to preserve compatibility with simple mappings of legacy encodings and with some fonts may be left empty for long (even though the font assignments are normally invalid: this is the case in the block of Windings symbols). For normal scripts (alphabets, abjads, alphasyllabaries, sinograms, ideographs), they may be allocated later for completely unrelated new characters in the same script (as long as there's evidence that this script will likely include more historic characters in the future : this is the case for Latin, Arabic, Cyrillic, and many Indic scripts, and for blocks containing puntuations, mathematical symbols, and pictograms like emojis or game symbols like deck cards). As long as a single proposal can fit in existing holes of existing blocks, no new block would be allocated, but if the proposal contains more characters than those that can fit in a hole, a new block will be allocated to fit them all at once (allowing new fonts to be added to support all of them at once, without having to update many fonts for the full coverage of the accepted proposal, thus simplifying the implementation, deployment and usage). Many proposals just consist in a single or very few characters : slowly they will fill the holes left in blocks by prior assignments. I think that the rationale is to allow grouping together characters that will be used together and in the same fonts (notably if there are contextual substitution rules or ligatures). Just look at the history of Unicode versions in the Extended Latin blocks, and you'll find these later allocations filling holes left by prior assignments. The roadmap also reveals some info about the estimated number of characters for which there are pending proposals. Very often they are referencing these holes, but these proposals will not be concluded before a long time, and these proposals must avoid colliding each other, competing for the same positions after the initial encoding steps have been passed but not finalized, or the proposal finally abandoned completely by a newer more complete proposal. Many proposals will take months or years to be completed, even if their blocks are already accepted and are encoding a small part of the needed characters.

