subject:"Re\: Code pages and Unicode"

Re: Code pages and Unicode

2011-08-25 Thread Asmus Freytag


On 8/24/2011 7:45 PM, Richard Wordingham wrote:


Which earlier coding system supported Welsh?  (I'm thinking of 'W WITH
CIRCUMFLEX', U+0174 and U+0175.)  How was the use of the canonical
decompositions incompatible with the character encodings of legacy
systems?  Latin-1 has the same codes as ISO-8859-1, but that's as far
as having the same codes goes. Was the use of combining jamo
incompatible with legacy Hangul encodings?


See, how time flies.

Early adopters were interested in 1:1 transcoding, using a single 256 
entry table for an 8-bit character set, with guaranteed predictable 
length. Early designs of Unicode (and 10646) attempted to address these 
concerns, because they promised severe impediments to migration.


Some characters were included as part of the merger, without the same 
rigorous process as is in force for characters today. At that time, 
scuttling the deal over a few characters here or there would not have 
been a reasonable action. So you will always find some exceptions to 
many of the principles - which doesn't make them less valid.


Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. 
Remembering that there is a guarantee that there will be no more 
surrogate points, an extension form has to be non-conformant with 
current UTF-16! 


And that's the reason why there's no interest in this part of the 
discussion. Nobody will need an extension next Tuesday, or in a decade 
or even in several decades - or ever. Haven't seen an upgrade to Morse 
code recently to handle Unicode, for example. Technology has a way of 
moving on.


So, best thing is to drop this silly discussion, and let those future 
people that might be facing a real *requirement* use their good judgment 
to come to a technical solution appropriate to their time - instead of 
wasting collective cycles of discussion how to make 1990's technology 
work for an unknown future requirement. It's just bad engineering.

Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit
range.


I disagree (as would anyone with a bit of long-term perspective). Nobody 
needs to look into this for decades, so let it rest.


A./

RE: Code pages and Unicode

2011-08-25 Thread Erkki I Kolehmainen

+1

I'm also guilty of pushing through one particular proposal (much to Ken's 
disliking) that I most certainly would no longer even try, but, alas, times 
were different.

Sincerely, Erkki 

-Alkuperäinen viesti-
Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] 
Puolesta Asmus Freytag
Lähetetty: 25. elokuuta 2011 9:00
Vastaanottaja: Richard Wordingham
Kopio: Ken Whistler; unicode@unicode.org
Aihe: Re: Code pages and Unicode

On 8/24/2011 7:45 PM, Richard Wordingham wrote:

 Which earlier coding system supported Welsh?  (I'm thinking of 'W WITH
 CIRCUMFLEX', U+0174 and U+0175.)  How was the use of the canonical
 decompositions incompatible with the character encodings of legacy
 systems?  Latin-1 has the same codes as ISO-8859-1, but that's as far
 as having the same codes goes. Was the use of combining jamo
 incompatible with legacy Hangul encodings?

See, how time flies.

Early adopters were interested in 1:1 transcoding, using a single 256 
entry table for an 8-bit character set, with guaranteed predictable 
length. Early designs of Unicode (and 10646) attempted to address these 
concerns, because they promised severe impediments to migration.

Some characters were included as part of the merger, without the same 
rigorous process as is in force for characters today. At that time, 
scuttling the deal over a few characters here or there would not have 
been a reasonable action. So you will always find some exceptions to 
many of the principles - which doesn't make them less valid.

 Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. 
 Remembering that there is a guarantee that there will be no more 
 surrogate points, an extension form has to be non-conformant with 
 current UTF-16! 

And that's the reason why there's no interest in this part of the 
discussion. Nobody will need an extension next Tuesday, or in a decade 
or even in several decades - or ever. Haven't seen an upgrade to Morse 
code recently to handle Unicode, for example. Technology has a way of 
moving on.

So, best thing is to drop this silly discussion, and let those future 
people that might be facing a real *requirement* use their good judgment 
to come to a technical solution appropriate to their time - instead of 
wasting collective cycles of discussion how to make 1990's technology 
work for an unknown future requirement. It's just bad engineering.
 Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit
 range.

I disagree (as would anyone with a bit of long-term perspective). Nobody 
needs to look into this for decades, so let it rest.

A./

RE: Code pages and Unicode

2011-08-24 Thread William_J_G Overington

On Tuesday 23 August 2011, Doug Ewell d...@ewellic.org wrote:
 
 Asmus Freytag asmusf at netcom dot com wrote:
 
  Until then, I find further speculation rather pointless and would love if 
  it moved off this list (until such time).
 
 +1
 
-0.7
 
It is harmless fun, indeed it is fun that assists learning and understanding, 
and so as long as it does not go on for a long time, I think that it is good.
 
http://www.unicode.org/policies/mail_policy.html
 
quote
 
A mail list is also a social organization, and as such, there will inevitably 
be some off-topic posting, fun, and games. This is not inherently discouraged 
unless it dominates a list for a length of time.
 
end quote
 
William Overington
 
24 August 2011

Re: Re: Code pages and Unicode

2011-08-24 Thread Jean-François Colson

On 23 août 2011 21:44 Richard Wordingham
richard.wording...@ntlworld.com richard.wording...@ntlworld.com
wrote:

 On Tue, 23 Aug 2011 07:18:21 +0200
 Jean-François Colson j...@colson.eu j...@colson.eu wrote:
 
  And what dou you think about (H1,H2,VS1,L3,L4)?
 
 The L4 is unnecessary. The trick then is to think of a BMP
 character that would very rarely be searched for on its own.
 
 Richard.
 

With (H1,H2,VS1,L3), you'd only reach U+4010.
To reach U+7FFF, you'd need either an additional low surrogate
(H1,H2,VS1,L3,L4)
or two VS (H1,H2,VS1,L3) and (H1,H2,VS2,L3).

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Doug Ewell

Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 (1) a plain-text file
 (2) using only plain-text conventions (i.e. not adding rich text)
 (3) which contains the same PUA code point with two meanings
 (4) using different fonts or other mechanisms
 (5) in a platform-independent, deterministic way

 One or more of the numbered items above must be sacrificed.

 The only numbered item to sacifice is number (3) here. that's the case
 where separate PUA agreements are still coordinated so that they don't
 use the same PUA assignments. This is the case of PUA greements in the
 Conscript registry.

Number 3 was the entire basis for srivas's question:

If same codes within PUA becomes standard for different purposes, how
to get both working using same font?
How to instruct text docs, what font if different fonts are used?

Changing the question around, so that we are no longer talking about one
code point with two meanings, doesn't accomplish anything.

 With only this exception, you can perfectly have separate agreements
 (using multiple fonts transporting them), for rendering a plain-text
 document. Of course the PUA only agreement stored in the font are the
 set of glyphs, and the display properties. Other properties (for
 collation, case mappings, text segmentation, and so on...) are not
 suitable for being in the font, but they are not needed for correct
 editing (without automated case changes) or for correct rendering.

We have different views concerning the relative importance of these
other properties, and I'm not going to try further to convince you.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Doug Ewell

Luke-Jr luke at dashjr dot org wrote:

 Too bad the Conscript registry is censoring assignments the maintainer
 doesn't like for unspecified personal reasons, increasing the chances
 of an overlap.

This isn't censorship, which would imply some sort of political,
ethical, or moral agenda.  This is a registrar making a technical (not
an unspecified personal) decision, which he already explained to you,
not to add something to the registry he maintains.

(For what it's worth, and as you'll remember, I agreed with you about
registering the tonal digits.  But Michael is the CSUR registrar, not
me.)

Philippe Verdy verdy_p_at_wanadoo.fr replied:

 Even the UTC could create its own PUA registry, probably coordinating
 it with WG2, and with the IRG, for experimenting new encodings, or
 working on proposals, helping document the needed features or
 difficulties, and cooperate better with non-technical people that have
 good cultural knowledge, or that have access to rare texts or corpus
 for which there still does not exist any numerisation (scans), or
 whose numerisation is restricted or not financed, and for which it is
 also impossible to create OCR versions.

As Richard said, and you probably already know, there is no chance that
UTC will ever do anything with the PUA, especially anything that gives
the appearance of endorsing its use.  I'm just thankful they haven't
deprecated it.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Code pages and Unicode

2011-08-24 Thread John H. Jenkins


Asmus Freytag 於 2011年8月23日 下午2:00 寫道：

 
 Until then, I find further speculation rather pointless and would love if it 
 moved off this list (until such time).
 


That would be wonderful, because we could then turn our attention to more 
urgent subjects, such as what to do when the sun reaches its red giant stage 
and threatens to engulf the Earth. ☺ 

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

RE: Code pages and Unicode

2011-08-24 Thread Doug Ewell

William_J_G Overington wjgo underscore 10009 at btinternet dot com
wrote:

 Until then, I find further speculation rather pointless and would
 love if it moved off this list (until such time).

 It is harmless fun, indeed it is fun that assists learning and
 understanding, and so as long as it does not go on for a long time,
 I think that it is good.

If it were limited to the fun and the hypothetical, I would probably
agree.

But some people seem to be dead serious about the need to go beyond 1.1
million code points, and are making dead-serious arguments that we need
to plan for it.  I don't know if they truly believe we are going to
communicate with space aliens using Unicode (judicious use of  might
reassure me here), or whether they think adding 2 billion code points
will provide a back door to encoding all sorts of non-character every
grain of sand on the beach objects, or what.  But it isn't rooted in
any sort of reality; both UTC and WG2 have permanently sealed the upper
limit at 0x10, and knowledgeable people have tried and tried until
they are blue in the face to explain why this is NOT a problem.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Code pages and Unicode

2011-08-24 Thread Richard Wordingham

On Wed, 24 Aug 2011 08:02:42 -0700
Doug Ewell d...@ewellic.org wrote:

 But some people seem to be dead serious about the need to go beyond
 1.1 million code points, and are making dead-serious arguments that
 we need to plan for it.

Those are two different claims.  'Never say never' is a useful maxim.
The extension of UCS-2, namely UTF-16, is far from optimal, but it
could have been a lot worse - at least the surrogates are contiguous.
All I ask is that we have a reasonable way of extending it if, say,
code points are squandered.  I think, however, that highhighrare
BMP codelow offers a legitimate extension mechanism that can
actually safely be ignored when scattering code assignments about the
17 planes (of which only 2 are full).

Perhaps it is just as well we will never need a CJK character for every
surname.  It seems that we can safely accommodate CJK language tags.

Richard.

Re: Code pages and Unicode

2011-08-24 Thread Ken Whistler


On 8/24/2011 10:48 AM, Richard Wordingham wrote:

Those are two different claims.  'Never say never' is a useful maxim.


So is Leave well enough alone.

The problem would be in using maxims instead
of an analysis of engineering requirements to drive architectural decisions.


The extension of UCS-2, namely UTF-16, is far from optimal, but it
could have been a lot worse - at least the surrogates are contiguous.
All I ask is that we have a reasonable way of extending it


Why?


  if, say,
code points are squandered.


Oh.

Well, in that case, the correct action is to work to ensure that code 
points are

not squandered.


  I think, however, thathighhighrare
BMP codelow  offers a legitimate extension mechanism


One could argue about the description as legitimate. It is clearly not 
conformant,

and would require a decision about an architectural change to the standard.
I see no chance of that happening for either the Unicode Standard or 10646.


that can
actually safely be ignored when scattering code assignments about the
17 planes (of which only 2 are full).


A quibble (I know), but only 1 plane is arguably full. Or, if you 
count PUA, then

*3* planes are full.

Here are the current stats for the forthcoming Unicode 6.1, counting 
*designated*

code points (as opposed to assigned graphic characters).

Plane 0: 63,207 / 65,536 = 96.45% full
Plane 1: 7497 / 65,536 = 11.44% full
Plane 2: 47,626 / 65,536 = 72.67% full (plane reserved for CJK ideographs)
Plane 14: 339 / 65,536 = 0.52% full
Plane 15: 65,536 / 65,536 = 100% full (PUA)
Plane 16: 65,536 / 65,536 = 100% full (PUA)

--Ken

Re: Code pages and Unicode

2011-08-24 Thread Richard Wordingham

On Wed, 24 Aug 2011 12:40:54 -0700
Ken Whistler k...@sybase.com wrote:

 On 8/24/2011 10:48 AM, Richard Wordingham wrote:
   if, say,
 code points are squandered.
 
 Oh.
 
 Well, in that case, the correct action is to work to ensure that code 
 points are not squandered.

Have there not already been several failures on that front?  The BMP is
littered with concessions to the limitations of rendering systems -
precomposed characters, Hangul syllables and Arabic presentation forms
are the most significant.  Hangul syllables being also a political
compromise does not instil confidence in the lines of defence.  I don't
dispute that there have also been victories. Has Japanese
disunification been completely killed, or merely scotched?

I think, however, thathighhighrare
  BMP codelow  offers a legitimate extension mechanism

 One could argue about the description as legitimate. It is clearly
 not conformant,

With what?  It's obviously not UTF-16 as we know it, but a possibly new
type of code-unit sequence.

 and would require a decision about an architectural change to the
 standard.

Naturally.  The standard says only 17 planes.  However, apart from
UTF-16, the change to the *standard* would not be big.  (Even so, a lot
of UTF-8 and UTF-32 code would have to be changed to accommodate the new
limit.)

 I see no chance of that happening for either the Unicode
 Standard or 10646.

It will only happen when the need becomes obvious, which may be never,
or may be 30 years hence.  It's even conceivable that UTF-16 will
drop out of use.

 Here are the current stats for the forthcoming Unicode 6.1, counting 
 *designated*
 code points (as opposed to assigned graphic characters).
 
 Plane 0: 63,207 / 65,536 = 96.45% full
 Plane 1: 7497 / 65,536 = 11.44% full
 Plane 2: 47,626 / 65,536 = 72.67% full (plane reserved for CJK
 ideographs)
 Plane 14: 339 / 65,536 = 0.52% full
 Plane 15: 65,536 / 65,536 = 100% full (PUA)
 Plane 16: 65,536 / 65,536 = 100% full (PUA)

I only see two planes that are actually full.  Which are you counting
as the full non-PUA plane?

Richard.

Re: Code pages and Unicode

2011-08-24 Thread John H. Jenkins

It has ceased to be. It's expired and gone to meet its maker. It's a stiff. 
Bereft of life, it rests in peace.…Its metabolic processes are now history. 
It's off the twig. It's kicked the bucket, it's shuffled off its mortal coil, 
run down the curtain and joined the bleedin' choir invisible.  This is an 
ex-possibility.

And even if that *weren't* true, there are nowhere *near* enough kanji to have 
a serious impact on Ken's analysis.  

Richard Wordingham 於 2011年8月24日 下午4:51 寫道：

 Has Japanese
 disunification been completely killed, or merely scotched?

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-24 Thread Ken Whistler


On 8/24/2011 3:51 PM, Richard Wordingham wrote:

Well, in that case, the correct action is to work to ensure that code
  points are not squandered.

Have there not already been several failures on that front?  The BMP is
littered with concessions to the limitations of rendering systems -
precomposed characters, Hangul syllables and Arabic presentation forms
are the most significant.


Those are not concessions to the limitations of rendering systems -- they
are concessions to the need to stay compatible with the character encodings
of legacy systems, which had limitations for their rendering systems.

A quibble? I think not.

Note the outcome for Tibetan, for example. A proposal came in some years 
ago to encode all of
the stacks for Tibetan as separate, precomposed characters -- ostensibly 
because
of the limitations of rendering systems. That proposal was stopped dead 
in its
tracks in the encoding committees, both because it would have been a 
duplicate

encoding and normalization nightmare, and because, well, current rendering
systems *can* render Tibetan just fine, thank you, given the current 
encoding.



Hangul syllables being also a political
compromise


From *1995*, when such a compromise was necessary to keep in place
the still fragile consensus which had driven 10646 and the Unicode Standard
into a still-evolving coexistence.

It is a mistake to extrapolate from that one example to conclusions that
political decisions will inevitably lead to encoding useless additional
hundreds of thousands of characters.


does not instil confidence in the lines of defence.  I don't
dispute that there have also been victories. Has Japanese
disunification been completely killed, or merely scotched?


   I think, however, thathighhighrare
BMP codelow   offers a legitimate extension mechanism

  One could argue about the description as legitimate. It is clearly
  not conformant,

With what?  It's obviously not UTF-16 as we know it, but a possibly new
type of code-unit sequence.


In whichever encoding form you choose to specify, the sequence highhigh
is non-conformant. Not merely a possibly new type of code unit sequence.

D800 D800 is non-conformant UTF-16

D800 D800 is non-conformant UTF-32

ED A0 80 ED A0 80 is non-conformant UTF-8




  and would require a decision about an architectural change to the
  standard.

Naturally.  The standard says only 17 planes.  However, apart from
UTF-16, the change to the*standard*  would not be big.  (Even so, a lot
of UTF-8 and UTF-32 code would have to be changed to accommodate the new
limit.)


Which is why this is never going to happen. (And yes, I said never. ;-) )


  I see no chance of that happening for either the Unicode
  Standard or 10646.

It will only happen when the need becomes obvious, which may be never,
or may be 30 years hence.  It's even conceivable that UTF-16 will
drop out of use.


Could happen. It still doesn't matter, because such a proposal also breaks
UTF-8 and UTF-32.




  Plane 0: 63,207 / 65,536 = 96.45% full


I only see two planes that are actually full.  Which are you counting
as the full non-PUA plane?


The BMP. 96.45% full is, for all intents and purposes, considered full 
now.


If you look at the BMP roadmap:

http://www.unicode.org/roadmaps/bmp/

there are only 9 columns left which are not already in assigned blocks. 
More characters
will gradually be added to existing blocks, of course, filling in nooks 
and crannies, but

the real action for new encoding has now turned almost entirely to Plane 1.

--Ken

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Philippe Verdy

2011/8/24 Doug Ewell d...@ewellic.org:
 Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 (1) a plain-text file
 (2) using only plain-text conventions (i.e. not adding rich text)
 (3) which contains the same PUA code point with two meanings
 (4) using different fonts or other mechanisms
 (5) in a platform-independent, deterministic way

 One or more of the numbered items above must be sacrificed.

 The only numbered item to sacifice is number (3) here. that's the case
 where separate PUA agreements are still coordinated so that they don't
 use the same PUA assignments. This is the case of PUA greements in the
 Conscript registry.

 Number 3 was the entire basis for srivas's question:

 If same codes within PUA becomes standard for different purposes, how
 to get both working using same font?
 How to instruct text docs, what font if different fonts are used?

 Changing the question around, so that we are no longer talking about one
 code point with two meanings, doesn't accomplish anything.

But my initial suggestion implied that condition 3 was not part of it.
This is not me, but sriva that has modified the problem. The problem
was changed later by adding new conditions that I have never intended.
It is clear that this condition 3 is completely unsatisfiable in all
cases.

Re: Code pages and Unicode

2011-08-24 Thread Philippe Verdy

2011/8/25 Richard Wordingham richard.wording...@ntlworld.com:
 It will only happen when the need becomes obvious, which may be never,
 or may be 30 years hence.  It's even conceivable that UTF-16 will
 drop out of use.
Conceivable but extremely unlikely because it will remain used in
extremely frequent cases, even if it can only support a subset of the
new encoding.

[begin side note]
This is a situation similar to the case of the UCS-2 subset, and of
the ISO 10646 implementation levels that have been withdrawn and are
no longer meaningful as a condition for conformance: conforming
applications today *must* exhibit behaviors that effectively can
respect the unbreakability and unreorderability of surrogate pairs;
the need to support isolated surrogates or custom encodings that would
depend on different pairing rules of surrogates, i.e. a high surrogate
followed by a low surrogate, are not conforming.

This does not mean that applications have to imply distinctive
semantics to surrogates or have to support non-BMP characters by
recognizing their distinctive properties: as long as runs of
surrogates are handled in such a way that they will never be reordered
or composed in arbitrary sequences, these applications can satisfy the
conformance requirement, without having to fully assert a higher
implementation level.

So an UCS-2 only application can continue to blindly treat surrogates
*as if* they were unbreakable strings of symbols with a strong LTR
directionality and unknown glyphs (or just the same .notdef glyph),
or to treat them *as if* they were unassigned (but valid) code points
in the BMP (all with the same default property values, except that the
value of individual code units must all be preserved; alternatively an
UCS-2 application may still replace those surrogate code units all
simultaneously to the same value associated to a non-ignorable
character, such as 0xFFFD or 0x003F, or may still suppress all of
them, knowing that it is destructive of information, or opt for
throwing a fatal exception for all of them; these are some of the
worst situations where this UCS-2 only behavior is still conforming).
[end side note]

This does not mean that existing UTF's will be the favored encoding in
the future (we can't say that even about UTF-8, or UTF-32). It's just
impossible to magically predict now which of the three standard UTF's
(or their standard byte-order variants) will become out of use, or if
any one of them will become out of use: for now there is absolutely no
sign that this will ever occur. Instead, we still see a very large
(and still accelerating) adoption rate for these UTFs (notably UTF-8).

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Doug Ewell

Philippe wrote:

 But my initial suggestion implied that condition 3 was not part of it.
This is not me, but sriva that has modified the problem. The problem
was changed later by adding new conditions that I have never intended.
It is clear that this condition 3 is completely unsatisfiable in all
cases.

The problem was stated initially by srivas, yesterday, so it's hard to imagine 
how he modified it. But of course I agree, and said so first, that condition 3 
(one font, two different characters, same font, plain text) is impossible.

--
Doug Ewell • d...@ewellic.org
Sent via BlackBerry by ATT

-Original Message-
From: Philippe Verdy verd...@wanadoo.fr
Sender: unicode-bou...@unicode.org
Date: Thu, 25 Aug 2011 02:10:27 
To: Doug Ewelld...@ewellic.org
Reply-To: verd...@wanadoo.fr
Cc: unicode@unicode.org
Subject: Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011/8/24 Doug Ewell d...@ewellic.org:
 Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 (1) a plain-text file
 (2) using only plain-text conventions (i.e. not adding rich text)
 (3) which contains the same PUA code point with two meanings
 (4) using different fonts or other mechanisms
 (5) in a platform-independent, deterministic way

 One or more of the numbered items above must be sacrificed.

 The only numbered item to sacifice is number (3) here. that's the case
 where separate PUA agreements are still coordinated so that they don't
 use the same PUA assignments. This is the case of PUA greements in the
 Conscript registry.

 Number 3 was the entire basis for srivas's question:

 If same codes within PUA becomes standard for different purposes, how
 to get both working using same font?
 How to instruct text docs, what font if different fonts are used?

 Changing the question around, so that we are no longer talking about one
 code point with two meanings, doesn't accomplish anything.

But my initial suggestion implied that condition 3 was not part of it.
This is not me, but sriva that has modified the problem. The problem
was changed later by adding new conditions that I have never intended.
It is clear that this condition 3 is completely unsatisfiable in all
cases.

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Doug Ewell

s/one font/one code point/

--
Doug Ewell • d...@ewellic.org
Sent via BlackBerry by ATT

-Original Message-
From: Doug Ewell d...@ewellic.org
Sender: unicode-bou...@unicode.org
Date: Thu, 25 Aug 2011 01:39:24 
To: verd...@wanadoo.fr
Reply-To: d...@ewellic.org
Cc: unicode@unicode.org
Subject: Re: Multiple private agreements (was: RE: Code pages and Unicode)

Philippe wrote:

 But my initial suggestion implied that condition 3 was not part of it.
This is not me, but sriva that has modified the problem. The problem
was changed later by adding new conditions that I have never intended.
It is clear that this condition 3 is completely unsatisfiable in all
cases.

The problem was stated initially by srivas, yesterday, so it's hard to imagine 
how he modified it. But of course I agree, and said so first, that condition 3 
(one font, two different characters, same font, plain text) is impossible.

--
Doug Ewell • d...@ewellic.org
Sent via BlackBerry by ATT

-Original Message-
From: Philippe Verdy verd...@wanadoo.fr
Sender: unicode-bou...@unicode.org
Date: Thu, 25 Aug 2011 02:10:27 
To: Doug Ewelld...@ewellic.org
Reply-To: verd...@wanadoo.fr
Cc: unicode@unicode.org
Subject: Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011/8/24 Doug Ewell d...@ewellic.org:
 Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 (1) a plain-text file
 (2) using only plain-text conventions (i.e. not adding rich text)
 (3) which contains the same PUA code point with two meanings
 (4) using different fonts or other mechanisms
 (5) in a platform-independent, deterministic way

 One or more of the numbered items above must be sacrificed.

 The only numbered item to sacifice is number (3) here. that's the case
 where separate PUA agreements are still coordinated so that they don't
 use the same PUA assignments. This is the case of PUA greements in the
 Conscript registry.

 Number 3 was the entire basis for srivas's question:

 If same codes within PUA becomes standard for different purposes, how
 to get both working using same font?
 How to instruct text docs, what font if different fonts are used?

 Changing the question around, so that we are no longer talking about one
 code point with two meanings, doesn't accomplish anything.

But my initial suggestion implied that condition 3 was not part of it.
This is not me, but sriva that has modified the problem. The problem
was changed later by adding new conditions that I have never intended.
It is clear that this condition 3 is completely unsatisfiable in all
cases.

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Philippe Verdy

2011/8/24 Doug Ewell d...@ewellic.org:
 As Richard said, and you probably already know, there is no chance that
 UTC will ever do anything with the PUA, especially anything that gives
 the appearance of endorsing its use.  I'm just thankful they haven't
 deprecated it.

The appearance of endorsing its use would only come if the website
describing the registry was using a frame using the Unicode logo.

It can act exactly like the CSUR registry, as an independant project
(with its own membership and participation policies), that would also
be helpful for collaborating with liaison members, ISO NB's, or some
local cultural organizations or collaborative projects.

The focus of this registry would only be for helping the encoding
process: registered PUAs or PUA ranges would not survive to finalized
proposals that were formally proposed and rejected by both the UTC and
WG2, and abandonned as well by its iniital promoters in the registry
(no new updated proposal), or to proposals that have been finally
released in the UCS (and there would likely be a short timeframe for
the death of these registrations, probably not exceeding one year).

It would be different from the CSUR, because CSUR also focuses on
supported PUAs that will never be suppoorted in the UCS (for example,
due to legal reasons, such as copyright which would restrict the
publication of any representative glyph in the UCS charts), or
creative/artistic designs

(For example, I'm still not convinced that Klingon qualifies for
encoding in the UCS, because of copyright restrictions and absence of
a formal free licence from right owners; the same would apply to any
collection of logos, including the logos of national or international
standard bodies that you can find on lots of manufactured products and
in their documentation, because the usage of these logos is severely
restricted and often implies contractual assessments by those
displaying it on their products or publications; this would also apply
to corporate logos, even if they are widely used, sometimes with
permission, but this time because these logos frequently change for
marketing reasons).

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Andrew Cunningham

so you will end up with the CSUR AND the registry Pilippe is
suggesting AND all the existing uses of PUA that will not end up in
CSUR or the other registry.

sounds like it will be a mess.

its bad enough dealing with Unicode and pseudo-Unicode in the Myanmar
script, adding PUA potentially into the mix  ummm...

On 25 August 2011 11:55, Philippe Verdy verd...@wanadoo.fr wrote:
 2011/8/24 Doug Ewell d...@ewellic.org:
 As Richard said, and you probably already know, there is no chance that
 UTC will ever do anything with the PUA, especially anything that gives
 the appearance of endorsing its use.  I'm just thankful they haven't
 deprecated it.

 The appearance of endorsing its use would only come if the website
 describing the registry was using a frame using the Unicode logo.

 It can act exactly like the CSUR registry, as an independant project
 (with its own membership and participation policies), that would also
 be helpful for collaborating with liaison members, ISO NB's, or some
 local cultural organizations or collaborative projects.

 The focus of this registry would only be for helping the encoding
 process: registered PUAs or PUA ranges would not survive to finalized
 proposals that were formally proposed and rejected by both the UTC and
 WG2, and abandonned as well by its iniital promoters in the registry
 (no new updated proposal), or to proposals that have been finally
 released in the UCS (and there would likely be a short timeframe for
 the death of these registrations, probably not exceeding one year).

 It would be different from the CSUR, because CSUR also focuses on
 supported PUAs that will never be suppoorted in the UCS (for example,
 due to legal reasons, such as copyright which would restrict the
 publication of any representative glyph in the UCS charts), or
 creative/artistic designs

 (For example, I'm still not convinced that Klingon qualifies for
 encoding in the UCS, because of copyright restrictions and absence of
 a formal free licence from right owners; the same would apply to any
 collection of logos, including the logos of national or international
 standard bodies that you can find on lots of manufactured products and
 in their documentation, because the usage of these logos is severely
 restricted and often implies contractual assessments by those
 displaying it on their products or publications; this would also apply
 to corporate logos, even if they are widely used, sometimes with
 permission, but this time because these logos frequently change for
 marketing reasons).






-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Philippe Verdy

2011/8/25 Andrew Cunningham lang.supp...@gmail.com:
 so you will end up with the CSUR AND the registry Philippe is
 suggesting AND all the existing uses of PUA that will not end up in
 CSUR or the other registry.

 sounds like it will be a mess.

 its bad enough dealing with Unicode and pseudo-Unicode in the Myanmar
 script, adding PUA potentially into the mix  ummm...

Where did someone speak about the Myanmar script case ? MAy be you're
now the source of this mix.
And anyway I've not said that the CSUR project, or a putative project
to help the UCS encoding process are the only options for using PUAs.
And not all PUA usages need to be coordinated:

- in East Asia, PUAs are frequently used only for personal reasons, in
a purely creative way (for example in personal ideographs), only for
the final purpose of creating something else that will be communicated
with others, and which will not be necessarily plain text.

- people can encode their own photos or colorful drawings in a PUA, if
they want it for their own uses... They dont require the
authorization or approval from others.

- applications may internally use their own PUAs as a simple part of
their implementation, to make it work and produce the results wanted,
without having to even expose how this internal use is effectively
defined (others may try to investigate, by reverse engineering, or the
application author may change this representation at any time, or
remove it by using some other solutions, it should not matter).

However, I am still convinced that the coordinated uses of PUAs is
justified by the desire to create something else, as a temporary
working tool, which can be justified by the current limitations of
existing standards in their defined scope (including policies) or in
their initially expected usage. In that case, PUAs are a very useful
transition mechanism.

Re: Code pages and Unicode

2011-08-23 Thread Richard Wordingham

On Mon, 22 Aug 2011 16:18:56 -0700
Ken Whistler k...@sybase.com wrote:

 How about Clause 12.5 of ISO/IEC 10646:
 
 001B, 0025, 0040
 
 You escape out of UTF-16 to ISO 2022, and then you can do whatever
 the heck you want, including exchange and processing of complete
 4-byte forms, with all the billions of characters folks seem to think
 they need.

 Of course you would have to convince implementers to honor the ISO
 2022 escape sequence...

Which they only need to if the text is in an ISO 2022 or similar
context.  Your idea does suggest that a pattern of
highhighSOlow would be reasonable.  The shift-out code U+000E
has no meaning as a Unicode character so it wouldn't be unreasonable to
require a special check that one finds a full character if looking for
a one-character string consisting only of U+000E.  We could also have
highhighSIlow to gives the full *two* thousand million odd
characters that would be resupported by UTF-32.

Richard.

Re: Code pages and Unicode

2011-08-23 Thread Asmus Freytag


On 8/23/2011 12:00 PM, Richard Wordingham wrote:

On Mon, 22 Aug 2011 16:18:56 -0700
Ken Whistlerk...@sybase.com  wrote:


How about Clause 12.5 of ISO/IEC 10646:

001B, 0025, 0040

You escape out of UTF-16 to ISO 2022, and then you can do whatever
the heck you want, including exchange and processing of complete
4-byte forms, with all the billions of characters folks seem to think
they need.
Of course you would have to convince implementers to honor the ISO
2022 escape sequence...

Which they only need to if the text is in an ISO 2022 or similar
context.  Your idea does suggest that a pattern of
highhighSOlow  would be reasonable.


I don't see where Ken's reply (as quoted) suggests anything like that.

What he wrote is that, formally, 10646 supports a mechanism to switch to 
ISO 2022.


Therefore, formally, there's an escape hatch built in.

If and when such should be needed, in a few hundred years, it'll be there.
Until then, I find further speculation rather pointless and would love 
if it moved off this list (until such time).


A./

RE: Code pages and Unicode

2011-08-23 Thread Doug Ewell

Asmus Freytag asmusf at netcom dot com wrote:

 Until then, I find further speculation rather pointless and would
 love if it moved off this list (until such time).

+1

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-23 Thread Doug Ewell

srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote:

 If same codes within PUA becomes standard for different purposes,

They aren't standard.  Two different private agreements could assign
different characters to the same PUA code points.

 how to get both working using same font?

You can't.

 How to instruct text docs, what font if different fonts are used?

There's no standard way to specify even one font or private agreement in
plain text, let alone how to switch between them within the same
document.  This is not an intended use of the PUA.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-23 Thread Philippe Verdy

2011/8/23 Doug Ewell d...@ewellic.org:
 srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote:

 If same codes within PUA becomes standard for different purposes,

 They aren't standard.  Two different private agreements could assign
 different characters to the same PUA code points.

 how to get both working using same font?

 You can't.

I do agree.

 How to instruct text docs, what font if different fonts are used?

 There's no standard way to specify even one font or private agreement in
 plain text, let alone how to switch between them within the same
 document.  This is not an intended use of the PUA.

There exists such standard in the context of plain-text rendering,
because of font fallback mechanisms (in Windows with Uniscribe, such
fallback mechanism is not tunable per user preferences, as the list of
alternative fonts that are tried is fixed by the implementation of
Uniscribe; but anyway it still exists), which implies that multiple
fonts will be scanned with an order of preference; font fallback is
involved each time a character is not mapped on the selected font but
may be mapped in another font.

Such mechanism is exactly similar to the explicit fallback mechanism
in CSS (where one provides an ordered comma-separated list of
font-family names), but that also extends this list of fonts
automatically using the default font fallback mechanisms used for
plain-text rendering.

In other words, even if you can't instruct a plain-text to use glyphs
from one font or from another for the same code point (PUA here), such
possibility still exists in rich-text rendering, because all glyphs
can become selectable as variants (including the variants listed in
the same font for the same glyph, in standardized OpenType features,
provided that the rich-text application implements such
glyph-selection mechanism).

PUAs are effectively not meant to supply the PUA agreement. This has
to be provided elsewhere, but a font can perfectly transport this
agreement (for the font as a whole which is separately selectable,
just like its designed glyph variants are individually selectable by
some typographic feature tables, as well as by index, for example
several swash variants of the same letter with more or less
decorations).

If you can use font fallbacks, then you can render the same text
containing distinct PUAs designed for distinct PUA agreements (and
this demonstrates the utility of the conscript registry, which allows
cooperation between authors of separate agreements, that have accepted
to encode their PUA characters with non-conflicting PUA code point
assignments).

-- Philippe.

RE: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-23 Thread Doug Ewell

Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 There's no standard way to specify even one font or private agreement in
 plain text, let alone how to switch between them within the same
 document.  This is not an intended use of the PUA.
 
 There exists such standard in the context of plain-text rendering,
 because of font fallback mechanisms (in Windows with Uniscribe, such
 fallback mechanism is not tunable per user preferences, as the list of
 alternative fonts that are tried is fixed by the implementation of
 Uniscribe; but anyway it still exists), which implies that multiple
 fonts will be scanned with an order of preference; font fallback is
 involved each time a character is not mapped on the selected font but
 may be mapped in another font.

That's not a way to specify a font.  Neither the creator nor the reader
has control over which fallback font is used.  In any event, if the
document contains a PUA code point that is used with two or more
different intended meanings (srivas's scenario), the engine will surely
pick the same font for both instances.

 Such mechanism is exactly similar to the explicit fallback mechanism
 in CSS (where one provides an ordered comma-separated list of
 font-family names), but that also extends this list of fonts
 automatically using the default font fallback mechanisms used for
 plain-text rendering.

In CSS the author can at least pick the fonts.

 In other words, even if you can't instruct a plain-text to use glyphs
 from one font or from another for the same code point (PUA here), such
 possibility still exists in rich-text rendering, because all glyphs
 can become selectable as variants (including the variants listed in
 the same font for the same glyph, in standardized OpenType features,
 provided that the rich-text application implements such
 glyph-selection mechanism).

Then it's not plain text, which is all I was talking about.

 PUAs are effectively not meant to supply the PUA agreement. This has
 to be provided elsewhere, but a font can perfectly transport this
 agreement (for the font as a whole which is separately selectable,
 just like its designed glyph variants are individually selectable by
 some typographic feature tables, as well as by index, for example
 several swash variants of the same letter with more or less
 decorations).

Not perfectly, unless you think that display is everything.

 If you can use font fallbacks, then you can render the same text
 containing distinct PUAs designed for distinct PUA agreements (and
 this demonstrates the utility of the conscript registry, which allows
 cooperation between authors of separate agreements, that have accepted
 to encode their PUA characters with non-conflicting PUA code point
 assignments).

Coordinating private agreements so they don't conflict is clearly the
ideal situation.  But many different people and organizations have
already claimed the same chunk of PUA space, as Richard exemplified
yesterday with his Taiwan/Hong Kong example.  There is no standard way
to display:

(1) a plain-text file
(2) using only plain-text conventions (i.e. not adding rich text)
(3) which contains the same PUA code point with two meanings
(4) using different fonts or other mechanisms
(5) in a platform-independent, deterministic way

One or more of the numbered items above must be sacrificed.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-23 Thread Philippe Verdy

2011/8/24 Doug Ewell d...@ewellic.org:
 Coordinating private agreements so they don't conflict is clearly the
 ideal situation.  But many different people and organizations have
 already claimed the same chunk of PUA space, as Richard exemplified
 yesterday with his Taiwan/Hong Kong example.  There is no standard way
 to display:

 (1) a plain-text file
 (2) using only plain-text conventions (i.e. not adding rich text)
 (3) which contains the same PUA code point with two meanings
 (4) using different fonts or other mechanisms
 (5) in a platform-independent, deterministic way

 One or more of the numbered items above must be sacrificed.

The only numbered item to sacifice is number (3) here. that's the case
where separate PUA agreements are still coordinated so that they don't
use the same PUA assignments. This is the case of PUA greements in the
Conscript registry.

With only this exception, you can perfectly have separate agreements
(using multiple fonts transporting them), for rendering a plain-text
document. Of course the PUA only agreement stored in the font are the
set of glyphs, and the display properties. Other properties (for
collation, case mappings, text segmentation, and so on...) are not
suitable for being in the font, but they are not needed for correct
editing (without automated case changes) or for correct rendering.

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-23 Thread Luke-Jr

On Tuesday, August 23, 2011 10:29:58 PM Philippe Verdy wrote:
 2011/8/24 Doug Ewell d...@ewellic.org:
  (3) which contains the same PUA code point with two meanings
 The only numbered item to sacifice is number (3) here. that's the case
 where separate PUA agreements are still coordinated so that they don't
 use the same PUA assignments. This is the case of PUA greements in the
 Conscript registry.

Too bad the Conscript registry is censoring assignments the maintainer doesn't 
like for unspecified personal reasons, increasing the chances of an overlap.

Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-23 Thread Philippe Verdy

2011/8/24 Luke-Jr l...@dashjr.org:
 On Tuesday, August 23, 2011 10:29:58 PM Philippe Verdy wrote:
 2011/8/24 Doug Ewell d...@ewellic.org:
  (3) which contains the same PUA code point with two meanings
 The only numbered item to sacifice is number (3) here. that's the case
 where separate PUA agreements are still coordinated so that they don't
 use the same PUA assignments. This is the case of PUA greements in the
 Conscript registry.

 Too bad the Conscript registry is censoring assignments the maintainer doesn't
 like for unspecified personal reasons, increasing the chances of an overlap.

It's their choice, their private decision. Nobody is required to
accept the conditions of CSUR. In fact other groups could be created
to coordinate other choices compatible with each other.

Even the UTC could create its own PUA registry, probably coordinating
it with WG2, and with the IRG, for experimenting new encodings, or
working on proposals, helping document the needed features or
difficulties, and cooperate better with non-technical people that have
good cultural knowledge, or that have access to rare texts or corpus
for which there still does not exist any numerisation (scans), or
whose numerisation is restricted or not financed, and for which it is
also impossible to create OCR versions.

In order to get financements, some of those projects would need to
exhibit only some fragments, explaining what is found in the rest of
the corpus, using significant samples, but also new creating didactic
documents, for which PUAs will be needed if they want to interchange
with something else than handwritten papers, and photocopies or scans
(which are not easy to handle via emails or in HTML pages, or that are
to reproduce).

Such PUA registry is not required to be stable for extensive periods.
Its content will evolve so that the encoded documents will be valid
for a limited time. This also means that the necessary fonts required
to keep those texts in a legible way (and possible future reencoding,
to new PUAs or to standard assignments in the UCS) would have to be
kept with those PUA texts. Those fonts should be clearly versioned,
containing an expected lifetime for which the PUA registry may
warranty some stability (example: the PUA registry will make
assignments only by early leases that will need to be renewed by
interested people).

Note that I clearly want that PUA fonts contain explicitly the
character properties needed for proper rendering. Simply because it is
expected that PUA documents will be created and interchanged for a
limited time. There will be almost no transforms of those texts, only
updates to their content via editing.

Now which font format will be the best suited for this work with PUA
texts? May be OpenType is not the best fit (tools to create them are
too complex for most users, and often are too costly, probably a
consequence of this complexity that destinate those tools only to very
few specialists), when there are simpler formats that are easily
editable from more tools (SVG fonts look promising, even if their
typographic capabilities are not very advanced for now; I just hope
that someday there will be support for this format in more renderers,
even if those fonts are larger in size for less glyphs inside; but
this SVG format can be easily zipped into a SVGZ format also
recognized automatically).

But some OSes or applications are offering simple accesory tools to
create PUA glyphs stored in personal fonts that can be reedited,
embedded, or uploaded to the recipients of a document needing these
glyphs. This may be used as an extension to input method editors,
notably for entering custom sinograms). Those tools won't let you
create glyphs with perfect metrics, or fonts with ligatures/GSUB
features, or advanced GPOS'itioning. Drawing tools are minimized to
reproduce how we draw basic shapes with the circle head of a pen, of
the elliptic head of a pencil, or the thin linear head of some
highliting pens. Some other tools just let you use a scan and produce
basic shapes.

Re: Code pages and Unicode

2011-08-22 Thread Andrew West

On 21 August 2011 02:14, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 On Fri, 19 Aug 2011 17:03:41 -0700
 Ken Whistler k...@sybase.com wrote:

 O.k., so apparently we have awhile to go before we have to start
 worrying about the Y2K or IPv4 problem for Unicode. Call me again in
 the year 2851, and we'll still have 5 years left to design a new
 scheme and plan for the transition. ;-)

 It'll be much easier to extend UTF-16 if there are still enough
 contiguous points available.  Set that wake-up call for 2790, or
 whenever plane 13 (better, plane 12) is about to come into use.

Stymied by the Unicode® stability policies again:

The General_Category property values will not be further subdivided. 
The General_Category property value Surrogate (Cs) is immutable: the
set of code points with that value will never change.

http://unicode.org/policies/stability_policy.html#Property_Value

Can anyone think of a way to extend UTF-16 without adding new
surrogates or inventing a new general category?

Andrew

Re: Code pages and Unicode

2011-08-22 Thread Shriramana Sharma


On 08/22/2011 03:05 PM, Andrew West wrote:

Can anyone think of a way to extend UTF-16 without adding new
surrogates or inventing a new general category?


Why would anyone *need* to do so? UTF-16 can represent all codepoints 
upto Plane 16 right?


--
Shriramana Sharma

Re: Code pages and Unicode

2011-08-22 Thread Andrew West

On 22 August 2011 12:51, Shriramana Sharma samj...@gmail.com wrote:
 On 08/22/2011 03:05 PM, Andrew West wrote:

 Can anyone think of a way to extend UTF-16 without adding new
 surrogates or inventing a new general category?

 Why would anyone *need* to do so? UTF-16 can represent all codepoints upto
 Plane 16 right?

To clarify, I was replying to Richard Wordingham's tongue in cheek
suggestion to extend UTF-16 to go beyond Plane 16 in the year 2790 or
when only one free plane remains.  I am not advocating extending
UTF-16 or the Unicode code space, or suggesting that it will ever be
necessary to do so.

But hypothetically, I don't see a way to extend UTF-16 without
breaking the stability policy.  The same stability policies would also
prohibit the assignment of any area of the Unicode code space for code
page usage as Srivas Sinnathurai has proposed.  (If there was an
automatic filter on ideas that break one or more stability policies
this mailing list would be a far quieter place.)

Andrew

RE: Code pages and Unicode

2011-08-22 Thread Doug Ewell

srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote:

 The true lifting of UTF-16 would be to UTF-32.
 
 Leave the UTF-16 un touched and make the new half versatile as possible.
 
 I think any other solution is just a patch up for the timebeing.

There is no evidence whatsoever that this is a problem that needs to be
solved, not in 700 or 800 years, not ever.  Ken's words are again being
ignored.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Code pages and Unicode

2011-08-22 Thread John H. Jenkins


Christoph Päper 於 2011年8月20日 上午2:31 寫道：

 Mark Davis ☕:
 
 Under the original design principles of Unicode, the goal was a bit more 
 limited; we envisioned […] a generative mechanism for infrequent CJK 
 ideographs,
 
 I'd still like having that as an option.
 


Et voilà!  We have Ideographic Description Sequences.  Or, if you're more 
ambitious, CDL.  

Generative mechanisms for Han are very attractive given the nature of the 
script, but once you try to support something other than display, or even try 
to write a rendering engine, all sorts of nasty problems crop up that have 
proven difficult to solve.  We won't even get into the problem of wanting to 
discourage people from making up new ad hoc characters for Han. 

I won't say some sort of generative mechanism will never become the preferred 
way of handling unencoded ideographs, but there is a lot of work to be done 
before that would be practical.

=
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-22 Thread William_J_G Overington

On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote:
 
 Can anyone think of a way to extend UTF-16 without adding new surrogates or 
 inventing a new general category?
 
 Andrew
 
How about a triple sequence of two high surrogates followed by one low 
surrogate?
 
I suggest this as a solution to the problem that is posed by Andrew as I feel 
that it would be interesting to know if that would be possible or whether it 
would be forbidden due to an existing policy that has already been guaranteed 
to be unchangeable.
 
William Overington
 
22 August 2011

Re: Code pages and Unicode

2011-08-22 Thread Jean-François Colson


On 22/08/11 16:55, Doug Ewell wrote:

srivas sinnathuraisisrivas at blueyonder dot co dot uk  wrote:


The true lifting of UTF-16 would be to UTF-32.

Leave the UTF-16 un touched and make the new half versatile as possible.

I think any other solution is just a patch up for the timebeing.

There is no evidence whatsoever that this is a problem that needs to be
solved, not in 700 or 800 years, not ever.  Ken's words are again being
ignored.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell 


I see at least one reason to extend the present 17 planes Unicode space: 
that would provide space for a RTL PUA. ☺


Presently, UTF-16 uses surrogate pairs to address non-BMP characters: HS 
LS (High Surrogate followed by Low Surrogate).


What would happen if we imbricate them? Would HS1 HS2 LS1 LS2 be 
acceptable to address more characters?

Re: Code pages and Unicode

2011-08-22 Thread Jean-François Colson


On 20/08/11 02:03, Ken Whistler wrote:

O.k., so apparently we have awhile to go before we have to start worrying
about the Y2K or IPv4 problem for Unicode. Call me again in the
year 2851, and we'll still have 5 years left to design a new scheme 
and plan

for the transition. ;-)

--Ken


I wonder whether you aren’t a little too optimistic.

Have you considered the unencoded ideographic scripts?

1,071 hieroglyphs have already been encoded. I think there are 
approximately 4,000 more to encode.


1,165 Yi syllables and 55 Yi radicals have been encoded. But they only 
support one dialect of Yi and I read there are tens of thousands of Yi 
ideographs and that a proposal to encode 88,613 classical Yi characters 
was made 4 years ago.


The threshold of 200,000 characters doesn’t seem very far.

Re: Code pages and Unicode

2011-08-22 Thread Ken Whistler


On 8/22/2011 9:58 AM, Jean-François Colson wrote:

I wonder whether you aren’t a little too optimistic.


No. If anything I'm assuming that the folks working on proposals will
be amazingly assiduous during the next decade.



Have you considered the unencoded ideographic scripts?


Why, yes I have.



1,071 hieroglyphs have already been encoded. I think there are 
approximately 4,000 more to encode.


A preliminary listing of 4548 additional hieroglyphs, based on 
Hieroglyphica (1993), was
presented to WG2 in 1999. Twelve years have passed, and no additional 
document has
been forthcoming to work through the issues in standardizing such a list 
as characters.

I won't hold my breath, but somebody *might* get through that work by 2021.



1,165 Yi syllables and 55 Yi radicals have been encoded. But they only 
support one dialect of Yi and I read there are tens of thousands of Yi 
ideographs and that a proposal to encode 88,613 classical Yi 
characters was made 4 years ago.


88,613 classical Yi *glyphs*. This is just a collection of every glyph 
form noted

from wherever. Even the proponents acknowledged that it was more on the
order of maybe 7000 *characters* involved. They got feedback to do the 
homework
to work through the character/glyph model for classical Yi, and come 
back when
they have a documented, reliable listing of the Yi *characters* that 
need encoding,
together with the list of variants for each character. Given the nature 
and scope of
the work, and no (current) indication of the progress being made, this 
also *might*

get done by 2021.



The threshold of 200,000 characters doesn’t seem very far.


Nah. It is still way over the extended horizon. The only big historic 
ideographic
script that is close to being done is Tangut, and the wrangling even 
over that one

has gone on for years now.

--Ken

Re: Code pages and Unicode

2011-08-22 Thread Richard Wordingham

On Mon, 22 Aug 2011 14:06:00 +0100 (BST)
William_J_G Overington wjgo_10...@btinternet.com wrote:

 On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote:
  
  Can anyone think of a way to extend UTF-16 without adding new
  surrogates or inventing a new general category?
  
  Andrew
  
 How about a triple sequence of two high surrogates followed by one
 low surrogate? 

The problem is that a search for the character represented by the code
unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3).
While there is no ambiguity, it does make searching more complicated
to code.  The same issue applies to the suggestion of using
(H1,H2,L3,L4) sequences.

Now, we could use (H1,H2,L3,L4) sequences and never assign the (H2,L3)
combinations.  They would therefore be category Cn, which currently
consists of both the unassigned characters and the non-characters.
However, I can't help feeling that they'd be almost a sort of
surrogate.  It's slightly more efficient to replace L3 by a single BMP
character.

Practically, I think that if we can change the semantics of the Myanmar
script, our descendants can go back on the guarantee of no more
surrogates.

Richard.

Re: Code pages and Unicode

2011-08-22 Thread Ken Whistler


On 8/22/2011 3:15 PM, Richard Wordingham wrote:

On Monday 22 August 2011, Andrew Westandrewcw...@gmail.com  wrote:
  

Can anyone think of a way to extend UTF-16 without adding new
surrogates or inventing a new general category?

Andrew
  
  How about a triple sequence of two high surrogates followed by one

  low surrogate?


How about Clause 12.5 of ISO/IEC 10646:

001B, 0025, 0040

You escape out of UTF-16 to ISO 2022, and then you can do whatever the
heck you want, including exchange and processing of complete 4-byte forms,
with all the billions of characters folks seem to think they need.

Of course you would have to convince implementers to honor the ISO 2022
escape sequence and liberate themselves into a high-level world of 
nosebleed

character numerosity. But then I guess by the time this is needed, folks are
counting on the need being self-evident. ;-)

--Ken

Re: Code pages and Unicode

2011-08-22 Thread Jean-François Colson


On 23/08/11 00:15, Richard Wordingham wrote:

The problem is that a search for the character represented by the code
unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3).
While there is no ambiguity, it does make searching more complicated
to code.  The same issue applies to the suggestion of using
(H1,H2,L3,L4) sequences.


And what dou you think about (H1,H2,VS1,L3,L4)?

Re: Code pages and Unicode

2011-08-20 Thread srivas sinnathurai

About the research works.

I alone (with with my colleagues) researching the fact that
Sumerian is Tamil / Tamil is Sumerian
This requires quite a lot of space.

Additionally I do research on Tamil alphabet as based on scientific
definitions and it only represents the mechanical parts , ie only represents
the places of articulation as alphabet and not sound based. And, what is
call a mathematical multiplier theory on expanding the alphabets leads to
not just long-mathematics (nedung kaNaku), but also to extra long
mathematics.

This is just a sample requirement from me and my colleagues. How many others
are there who would require Unicode support? Do you think allocating 32,000
to the code page model would help?

Regards
Sinnathurai

On 20 August 2011 09:31, Christoph Päper christoph.pae...@crissov.dewrote:

 Mark Davis ☕:

  Under the original design principles of Unicode, the goal was a bit more
 limited; we envisioned […] a generative mechanism for infrequent CJK
 ideographs,

 I'd still like having that as an option.

Re: Code pages and Unicode

2011-08-20 Thread Doug Ewell

It sounds like you’re trying to encode glyphs or glyph fragments, not 
characters.  There is a virtually endless repertoire of “shapes” that could be 
encoded, but unless each of these is a character actually used in a writing 
system (not just hypothetically), it’s probably not appropriate for a character 
encoding.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell 



From: srivas sinnathurai 
Sent: Saturday, August 20, 2011 3:35
To: Christoph Päper 
Cc: unicode@unicode.org 
Subject: Re: Code pages and Unicode

About the research works.

I alone (with with my colleagues) researching the fact that
Sumerian is Tamil / Tamil is Sumerian
This requires quite a lot of space.

Additionally I do research on Tamil alphabet as based on scientific definitions 
and it only represents the mechanical parts , ie only represents the places of 
articulation as alphabet and not sound based. And, what is call a mathematical 
multiplier theory on expanding the alphabets leads to not just long-mathematics 
(nedung kaNaku), but also to extra long mathematics.

This is just a sample requirement from me and my colleagues. How many others 
are there who would require Unicode support? Do you think allocating 32,000 to 
the code page model would help?

Regards
Sinnathurai

Re: Code pages and Unicode

2011-08-20 Thread Richard Wordingham

On Fri, 19 Aug 2011 17:03:41 -0700
Ken Whistler k...@sybase.com wrote:

 O.k., so apparently we have awhile to go before we have to start
 worrying about the Y2K or IPv4 problem for Unicode. Call me again in
 the year 2851, and we'll still have 5 years left to design a new
 scheme and plan for the transition. ;-)

It'll be much easier to extend UTF-16 if there are still enough
contiguous points available.  Set that wake-up call for 2790, or
whenever plane 13 (better, plane 12) is about to come into use.

Richard.

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread srivas sinnathurai

Doug,

First of all flat code space is the primary functionality of Unicode and not
calling for any changes to existing encodings.

What I propose is assign about 16,000 codes to code-page switching model.

Why this suggestion?
With current flat space, one code point is only allocated to one and only
one purpose.
We can run out of code space soon.

While processing the contemporary languages and other like mathamatical sym
in flat space, the 16,000 codes in the portin that is code page switchable
will be able to support 1000ands of different characters on each of the
code.
Ie, take 16 codes. with flat  space only supports 16 characters. but with
code page, can support  16 differnt purposes, each with a capacity of 14
characters. that is 140 characters instead of just 10 flat characters.
Sinnathurai
On 19 August 2011 15:27, Doug Ewell d...@ewellic.org wrote:

 srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote:

  PUA is not structured

 It's not supposed to be.  It's a private-use area.  You use it the way
 you see fit.

  and not officially programmable to accommodate
  numerous code pages.

 None of Unicode is designed around code-page switching.  It's a flat
 code space.  This is true even for ISO 10646, which nominally divides
 the space into groups and planes and rows.

 As a programmer, I don't understand what not officially programmable
 means here.  I've written lots of programs that use and understand the
 PUA.

  Take the ISO 8859-1, 2, 3, and so on .
  These are now allocating the same code points to many languages and
  for other purposes.

 Character encodings don't allocate code points to languages.  They
 allocate code points to characters, which are used to write text in
 languages.  This is not a trivial distinction; it is crucial to
 understanding how character encodings work.

  Similarly, a structured and official allocations to any many
  requirements can be done using the same codes, say 16,000 of them.

 If you want to use ISO 2022, just use ISO 2022.

 I guess what I'm missing is why the code-page switching model is
 considered superior, in any way, to the flat code space of
 Unicode/10646.

 --
 Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

RE: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Doug Ewell

srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote:

 Why this suggestion?
 With current flat space, one code point is only allocated to one and
 only one purpose.
 We can run out of code space soon.

Argument over.  There are not 800,000 more characters that need to be
encoded for storage or interchange.  There may well be 800,000 glyphs,
or images, or meanings, but that is not what any character encoding
standard is for.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread John H. Jenkins


srivas sinnathurai 於 2011年8月19日 上午9:40 寫道：

 Why this suggestion?
 With current flat space, one code point is only allocated to one and only one 
 purpose.
 We can run out of code space soon.
 


There are a couple of problems here.

We currently have over 860,000 unassigned code points.  Surveys of all known 
writing systems indicate that only a small fraction of these will be needed.  
Indeed, although it looks likely that Han will spill out of the SIP into plane 
3, all non-Han will likely fit into the SMP.  (Michael, you can correct me on 
this if I'm wrong.)

Even if we allow for the possibility that there are a lot of writing systems 
out there we don't know about, there would have to be a *lot* of writing 
systems out there we don't know about to fill up planes 4 through 14.  If the 
average script requires 256 code points, there would have to be some 2800 
unencoded scripts to do that.  

Moreover, it's taken us 20 years to use 250,000 code points.  Even if that rate 
remained steady (and it's been going down), it will take us something on the 
order of a century to fill up the remaining space, if that's even possible, and 
that hardly qualifies as soon.

And there already is a code page switching mechanism such as you propose.  It's 
called ISO 2022 and it supports Unicode.  

In order to get the UTC and WG2 to agree to a major architectural change such 
as you're suggesting, you'd have to have some very solid evidence that it's 
needed—not an interesting idea, not potentially useful, but seriously *needed*. 
 That's how surrogates and the astral planes came about—people came up with 
solid figures showing that 65,536 code points was not nearly enough.  So far, 
the evidence suggests that we're in no danger of running out of code points.  

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-19 Thread Christoph Päper

John H. Jenkins:

 there would have to be a *lot* of writing systems out there we don't know 
 about to fill up planes 4 through 14

That’s quite possible, though, the universe is huge. The question rather is 
whether we will ever know about them. It’s quite possible we won’t.

RE: Code pages and Unicode

2011-08-19 Thread Doug Ewell

Maybe we should step back a bit:

 I'm not calling for any change to existing major aloocations. However,
 this is about time we allocate (not PUA) large number of codes to a
 code page based sub codes so that not only all 7000+ languages can
 Freely use it without INTERFERENCE from Unicode and have the freedom
 to carry out research works, like we were doing with the legacy 8bit
 codes.

Can you provide some detail about these research works that have to do
with encoding characters and are projected to require more than the
137,468 code points available in the PUA?  What sort of INTERFERENCE
from Unicode needs to be avoided, and are we talking about encoding or
architectural decisions or what?  (There must be a GREAT DEAL of
perceived interference here, because of the capital letters.)

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Michael Everson

On 19 Aug 2011, at 18:24, John H. Jenkins wrote:

 We currently have over 860,000 unassigned code points.  Surveys of all known 
 writing systems indicate that only a small fraction of these will be needed.  
 Indeed, although it looks likely that Han will spill out of the SIP into 
 plane 3, all non-Han will likely fit into the SMP.  (Michael, you can correct 
 me on this if I'm wrong.)

I wouldn't like to guarantee that non-Han won't spill over out of the SMP, but 
I doubt we'd fill Plane 4.

Michael Everson * http://www.evertype.com/

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Mark E. Shoulson


On 08/19/2011 01:24 PM, John H. Jenkins wrote:


In order to get the UTC and WG2 to agree to a major architectural 
change such as you're suggesting, you'd have to have some very solid 
evidence that it's needed—not an interesting idea, not potentially 
useful, but seriously *needed*. That's how surrogates and the astral 
planes came about—people came up with solid figures showing that 
65,536 code points was not nearly enough. So far, the evidence 
suggests that we're in no danger of running out of code points.


And indeed, it went the other way too, back when ISO-10646 had not 17, 
but 65536 *planes* and someone provided some reasonable evidence (or 
just plain reasoned arguments) that 4.3 *billion* characters was 
probably overkill.


~mark

RE: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Doug Ewell

Mark E. Shoulson mark at kli dot org wrote:

 And indeed, it went the other way too, back when ISO-10646 had not 17, 
 but 65536 *planes* and someone provided some reasonable evidence (or 
 just plain reasoned arguments) that 4.3 *billion* characters was 
 probably overkill.

Technically, I think 10646 was always limited to 32,768 planes so that
one could always address a code point with a 32-bit signed integer (a
nod to the Java fans).

Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Jukka K. Korpela


20.8.2011 0:07, Doug Ewell wrote:


Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.


And now we think that a little over a million is enough for everyone, 
just as they thought in the late 1980s that 16 bits is enough for everyone.


--
Yucca, http://www.cs.tut.fi/~jkorpela/

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Mark E. Shoulson


On 08/19/2011 05:07 PM, Doug Ewell wrote:

Mark E. Shoulsonmark at kli dot org  wrote:

And indeed, it went the other way too, back when ISO-10646 had not 17,
but 65536 *planes* and someone provided some reasonable evidence (or
just plain reasoned arguments) that 4.3 *billion* characters was
probably overkill.

Technically, I think 10646 was always limited to 32,768 planes so that
one could always address a code point with a 32-bit signed integer (a
nod to the Java fans).

Whew!  So I guess it wasn't THAT many characters anyway... :)

(Like Hofstadter's story about the professor who says that she 
calculates that the sun will burn out in 5 billion years.  A nervous 
voice in the back of the room asks h-how soon again? 5 billion 
years.  Whew! says the voice, sounding relieved.  For a minute I 
thought you said only 5 *million*)


~mark

Re: Code pages and Unicode

2011-08-19 Thread Benjamin M Scarborough

On 20 Aug 2011, at 00:35, Jukka K. Korpela wrote:
And now we think that a little over a million is enough for everyone,
just as they thought in the late 1980s that 16 bits is enough for everyone. 

Whenever somebody talks about needing 31 bits for Unicode, I always think of 
the hypothetical situation of discovering some extraterrestrial civilization 
and trying to add all of their writing systems to Unicode. I imagine there 
would be little to unify outside of U+002E FULL STOP.

The point I'm getting at is that somebody always claims that U+..U+10 
isn't enough, but I never see convincing evidence or rationale that an 
expansion is necessary—just speculation.

—Ben Scarborough

RE: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Doug Ewell

Jukka K. Korpela jkorpela at cs dot tut dot fi wrote:

 And now we think that a little over a million is enough for everyone, 
 just as they thought in the late 1980s that 16 bits is enough for
 everyone.

I know this is an enjoyable exercise — people love to ridicule Bill
Gates for his comment in 1981 about 640K, even though that was an order
of magnitude larger than any home computer of the day — but every time
I hear someone protest that the Unicode code space won't be large
enough, it eventually comes down to one of:

1. Expanding scope to cover extraterrestrial characters
2. Expanding scope to cover glyphs or other things that aren't
   currently considered characters

I don't worry about item 1.  I suppose I should worry some about item 2,
ever since the emoji experience.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Ken Whistler


On 8/19/2011 2:07 PM, Doug Ewell wrote:

Technically, I think 10646 was always limited to 32,768 planes so that
one could always address a code point with a 32-bit signed integer (a
nod to the Java fans).


Well, yes, but it didn't really have anything to do with Java. Remember 
that Java
wasn't released until 1995, but the 10646 architecture dates back to 
circa 1986.
So more likely it was a nod to C implementations which would, it was 
supposed,
have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t, 
and which
would have wanted a signed 32 bit type to work. I suspect, by the way, 
that that

limitation was probably originally brought to WG2 by the U.S. national body,
as they would have been the ones most worried about the C implementations
of 10646 multi-octet forms.

And the original architecture was also not really a full 32K planes in 
the sense

that we now understand planes for Unicode and 10646. The original design
for 10646 was for a 1- to 4-octet encoding, with all octets conforming 
to the

ISO 2022 specification. It used the option that the working sets for the
encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and
for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were
not used except for the single-octet form, as in 2022-conformant schemes
still used today for some East Asian character encodings.

And the octets were then designated G (group) P (plane) R (row) and C.

The 1-octet form thus allowed 95 + 96 = 191 code positions.

The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions

The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions

The Group octet was constrained to the low set of 94. (This is the origin
of the constraint to half the planes, which would keep wchar_t 
implementations

out of negative signed range.)

The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions

The grand total for all possible forms was the sum of those values or:

*631,279,375* code positions

(before various *other* set-asides for plane swapping and private
use start getting taken into account)



Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.


So a lot less than 2.1 billion characters. But I think Doug's point is 
still valid:

631 million plus code points was still overkill for the problem to
be addressed.

And I think that we can thank our lucky stars that it isn't *that* 
architecture for
a universal character encoding that we would now be implementing and 
debating on

the alternative universe version of this email list. ;-)

--Ken

Re: Code pages and Unicode

2011-08-19 Thread John H. Jenkins


Benjamin M Scarborough 於 2011年8月19日 下午3:53 寫道：

 Whenever somebody talks about needing 31 bits for Unicode, I always think of 
 the hypothetical situation of discovering some extraterrestrial civilization 
 and trying to add all of their writing systems to Unicode. I imagine there 
 would be little to unify outside of U+002E FULL STOP.

Oh, I imagine they'll have one or two turtle ideographs.  :-)

Seriously, though, if and when we run into ETs with all their myriad writing 
systems, I really don't think that we'll be Unicode to represent them.

=
井作恆
John H. Jenkins
jenk...@apple.com

Re: Code pages and Unicode

2011-08-19 Thread Ken Whistler


On 8/19/2011 2:53 PM, Benjamin M Scarborough wrote:

Whenever somebody talks about needing 31 bits for Unicode, I always think of 
the hypothetical situation of discovering some extraterrestrial civilization 
and trying to add all of their writing systems to Unicode. I imagine there 
would be little to unify outside of U+002E FULL STOP.


It is the *terrestrial* denizens of this discussion list that I worry 
more about. Most of
the proposals for filling up uncountable planes with numbers 
representing -- well,

who knows? -- originate here. ;-)



The point I'm getting at is that somebody always claims that U+..U+10 
isn't enough, but I never see convincing evidence or rationale that an 
expansion is necessary—just speculation.



Well, it is a late Friday afternoon in August. A slow news day, I guess.

So it is time to trot out the periodically updated statistics that long ago
convinced the folks who think 21 bits is just fine and dandy, and has a 
usefulness

warranty that far exceeds our lifetimes, but which of course no matter how
often repeated never convince the we-need-31-bits crowd.

Newly updated to include the Unicode 6.1 repertoire in process for 
publication

very early next year, the figures are:

110,181 characters encoded (graphic, format, and control codes counted)

Now let's just assign that number an era of 2011, to make the math a 
little simpler.


The first version of Unicode was published in 1991, so we've been at 
this for 20
years, not counting start up time. If you just divide 110,181 by 20 
years, that is

a rough average of 5509 characters added per year.

But here is the interesting part: the rate of inclusion is declining, 
rather than being

steady. Again, to make the math simpler, just compare the *first* decade of
Unicode (1991 - 2001) and the *second* decade of Unicode (2001 - 2011).
Unicode 3.1 (2001) had 94,205 characters in it. So:

1st decade: 94,205 characters, or roughly 9420 characters/year

2nd decade: 15,976 characters, or roughly 1598 characters/year

Also keep in mind that the absolute numbers have always been completely
dominated by CJK. 75.46% of the characters encoded in Unicode 3.1
are CJK ideographs (unified and compatibility). The IRG has been working
mightily to keep adding to the total of encoded CJK ideographs, but they
are starting to scrape the bottom even of that deep barrel.

And look at the SMP Roadmap:

http://www.unicode.org/roadmaps/smp/

We know there are a few big historic ideographic scripts to go: Tangut 
is the
biggest and most advanced of the proposals, weighing in at something 
over 7000
characters. But even with East Asian heavyweights like Tangut, Jurchen, 
and Khitan
given tentative allocations on the SMP roadmap, there is plenty of 
unassigned
air on Plane 1 still. And frankly, a lot of very serious people have 
been looking

hard for good, encodable candidate scripts to add to the roadmap, for a very
long time.

The upshot is, based on 20 years in the business, as it were, my best
estimate of what we can expect for the next decade is something as follows:

Two big chunks: roughly 10K more CJK ideographs nobody ever heard of,
plus 7K+ Tangut ideographs. After that, the two committees (UTC and WG2)
will be hard pressed to find and process many more than 1000 characters per
year. Why? Because all the *easy* stuff was done long ago, during the
first decade of Unicode. Everything from here on out is very obscure, hard
to research, hard to document and review, hard to get consensus on, and
is often fragmentary or even undeciphered, or consists of sets of notations
that many folks won't even agree *are* characters.

So: 10K + 7K + 1k/year for 10 years = 27,000 *maximum* additions by 2021.

And that is to fill the gaping hole -- nay, gigantic chasm -- of 862,020 
unassigned

code points still left in the 21-bit space.

Past 2021, who knows? Many of us will no longer be participating by then,
but there are various possible scenarios:

1. The committees may creak to a halt, freeze the standards, and the
delta encoding rate will drop from 1000/year to 0/year. This is actually
a scenario with a non-zero probability.

2. Somebody with non-character agendas may seize control and start using
numbers for, I don't know, perhaps localizable sentences, or something, just
because over 835,000 numbers will be available and nature abhors a
vacuum. I consider that a very low likelihood, because of the enormous
vested interest there will be by the entire worldwide IT industry in keeping
the character encoding standard stable.

3. Or, the committees may limp along more or less indefinitely, with 
more and

more obscure scripts being documented and standardized, with a trickle
of new ones always being invented, and new sets of symbols or notations
being invented and stuck in. So maybe they could keep up the pace
of 1000 characters encoded per year for some time off into the future.
But at that rate, when do we have to start worrying? 835,000 divided

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Asmus Freytag


On 8/19/2011 2:35 PM, Jukka K. Korpela wrote:

20.8.2011 0:07, Doug Ewell wrote:


Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.


And now we think that a little over a million is enough for everyone, 
just as they thought in the late 1980s that 16 bits is enough for 
everyone.




The difference is that these early plans were based on rigorously *not* 
encoding certain characters, or using combining methodology or variation 
selection much more aggressively. That might have been more feasible, 
except for the needs of migrating software and having Unicode-based 
systems play nicely in a world where character sets had different ideas 
of what constitutes a character.


Allowing thousands of characters for compatibility reasons, more than 
ten thousand precomposed characters, and many types of other characters 
and symbols not originally on the radar still has not inflated the 
numbers all that much. The count stands at roughly double that original 
goal, after over twenty years of steady accumulation.


Was the original concept of being able to shoehorn the world into 
sixteen bit, overly aggressive? Probably, because the estimates had 
always been that there are about a quarter million written elements. 
If you took the current repertoire and used code-space saving techniques 
in hindsight, you might be able to create something that fits into 
16-bits. But it would end up using strings for many things that are now 
single characters.


But the numbers, so far, show that this original estimate of a quarter 
million, rough as it was, appears to be rather accurate. Over twenty 
years of encoding characters have not been enough to exceed that.


The million code points are therefore a much more comfortable limit 
and, from the beginning, assume a ceiling that has ample head-room (as 
opposed to the can we fit the world in this shoebox approach of 
earlier designs).


So, no, the two cases are not as comparable.

A./

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Asmus Freytag


On 8/19/2011 3:24 PM, Ken Whistler wrote:

On 8/19/2011 2:07 PM, Doug Ewell wrote:

Technically, I think 10646 was always limited to 32,768 planes so that
one could always address a code point with a 32-bit signed integer (a
nod to the Java fans).


Well, yes, but it didn't really have anything to do with Java. 
Remember that Java
wasn't released until 1995, but the 10646 architecture dates back to 
circa 1986.


Yep.

So more likely it was a nod to C implementations which would, it was 
supposed,
have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t, 
and which
would have wanted a signed 32 bit type to work. I suspect, by the way, 
that that
limitation was probably originally brought to WG2 by the U.S. national 
body,

as they would have been the ones most worried about the C implementations
of 10646 multi-octet forms.


No, it was the Japanese NB, as represented by the individual from Toppan 
Printing.


This limitation was insisted upon in 1991, after the accord on the 
merger between

Unicode and 10646, when 10646 was changed to use a flat codespace, not the
ISO 2022-like scheme.



And the original architecture was also not really a full 32K planes in 
the sense

that we now understand planes for Unicode and 10646. The original design
for 10646 was for a 1- to 4-octet encoding, with all octets conforming 
to the

ISO 2022 specification. It used the option that the working sets for the
encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and
for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were
not used except for the single-octet form, as in 2022-conformant schemes
still used today for some East Asian character encodings.

And the octets were then designated G (group) P (plane) R (row) and C.

The 1-octet form thus allowed 95 + 96 = 191 code positions.

The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions

The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions

The Group octet was constrained to the low set of 94. (This is the origin
of the constraint to half the planes, which would keep wchar_t 
implementations

out of negative signed range.)

The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions

The grand total for all possible forms was the sum of those values or:

*631,279,375* code positions

(before various *other* set-asides for plane swapping and private
use start getting taken into account)


This was so mind-bogglingly complicated that it was a deal breaker for 
many companies. Unicode's more restrictive concept of a character or its 
combining technology or many other innovations weren't initially seen as 
its primary benefits by people being faced with evaluating the 
differences between the formal ISO-backed project and the de-facto 
industry collaboration forming around Apple and Xerox. But the flat code 
space, now you were talking.



Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.


So a lot less than 2.1 billion characters. But I think Doug's point is 
still valid:

631 million plus code points was still overkill for the problem to
be addressed.

And I think that we can thank our lucky stars that it isn't *that* 
architecture for
a universal character encoding that we would now be implementing and 
debating on

the alternative universe version of this email list. ;-)


Even remembering it makes my head hurt.

A./

61 matches

Mail list logo