On Thu, 04 Jun 2015 14:39:27 +0000 David Starner <[email protected]> wrote:
> On Thu, Jun 4, 2015 at 6:09 AM John <[email protected]> wrote: > > > Mostly just a matter of upgrading the character size. > > > Which totally blows any concern with text size out of the water. > Using 30 bytes to define certain very rare characters and 1 byte to > define ASCII is way better then using 8 bytes to define all > characters. The character size can be increased to 64 bits in such a way that no new surrogates are required, current UTF-8 text remains UTF-8, current UTF-16 text remains UTF-16 and current UTF-32 remains UTF-32, the extended UTF-8 still has 8-bit code units, the extended UTF-16 still has 16-bit units, and the extended UTF-32 still has 32-bit code units. In fact, the character size can be made unbounded. The trick is to extend UTF-8 indefinitely, and then for UTF-16 and UTF-32 repeat the idea of the UTF-8 scheme using sequences of two or more low surrogates (or two or more high surrogates - one must chose) much as UTF-8 uses bytes. Tom Bishop publicised the idea. Richard.

