Re: Code pages and Unicode
On 8/24/2011 7:45 PM, Richard Wordingham wrote: Which earlier coding system supported Welsh? (I'm thinking of 'W WITH CIRCUMFLEX', U+0174 and U+0175.) How was the use of the canonical decompositions incompatible with the character encodings of legacy systems? Latin-1 has the same codes as ISO-8859-1, but that's as far as having the same codes goes. Was the use of combining jamo incompatible with legacy Hangul encodings? See, how time flies. Early adopters were interested in 1:1 transcoding, using a single 256 entry table for an 8-bit character set, with guaranteed predictable length. Early designs of Unicode (and 10646) attempted to address these concerns, because they promised severe impediments to migration. Some characters were included as part of the merger, without the same rigorous process as is in force for characters today. At that time, scuttling the deal over a few characters here or there would not have been a reasonable action. So you will always find some exceptions to many of the principles - which doesn't make them less valid. Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. Remembering that there is a guarantee that there will be no more surrogate points, an extension form has to be non-conformant with current UTF-16! And that's the reason why there's no interest in this part of the discussion. Nobody will need an extension next Tuesday, or in a decade or even in several decades - or ever. Haven't seen an upgrade to Morse code recently to handle Unicode, for example. Technology has a way of moving on. So, best thing is to drop this silly discussion, and let those future people that might be facing a real *requirement* use their good judgment to come to a technical solution appropriate to their time - instead of wasting collective cycles of discussion how to make 1990's technology work for an unknown future requirement. It's just bad engineering. Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit range. I disagree (as would anyone with a bit of long-term perspective). Nobody needs to look into this for decades, so let it rest. A./
RE: Code pages and Unicode
+1 I'm also guilty of pushing through one particular proposal (much to Ken's disliking) that I most certainly would no longer even try, but, alas, times were different. Sincerely, Erkki -Alkuperäinen viesti- Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] Puolesta Asmus Freytag Lähetetty: 25. elokuuta 2011 9:00 Vastaanottaja: Richard Wordingham Kopio: Ken Whistler; unicode@unicode.org Aihe: Re: Code pages and Unicode On 8/24/2011 7:45 PM, Richard Wordingham wrote: Which earlier coding system supported Welsh? (I'm thinking of 'W WITH CIRCUMFLEX', U+0174 and U+0175.) How was the use of the canonical decompositions incompatible with the character encodings of legacy systems? Latin-1 has the same codes as ISO-8859-1, but that's as far as having the same codes goes. Was the use of combining jamo incompatible with legacy Hangul encodings? See, how time flies. Early adopters were interested in 1:1 transcoding, using a single 256 entry table for an 8-bit character set, with guaranteed predictable length. Early designs of Unicode (and 10646) attempted to address these concerns, because they promised severe impediments to migration. Some characters were included as part of the merger, without the same rigorous process as is in force for characters today. At that time, scuttling the deal over a few characters here or there would not have been a reasonable action. So you will always find some exceptions to many of the principles - which doesn't make them less valid. Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. Remembering that there is a guarantee that there will be no more surrogate points, an extension form has to be non-conformant with current UTF-16! And that's the reason why there's no interest in this part of the discussion. Nobody will need an extension next Tuesday, or in a decade or even in several decades - or ever. Haven't seen an upgrade to Morse code recently to handle Unicode, for example. Technology has a way of moving on. So, best thing is to drop this silly discussion, and let those future people that might be facing a real *requirement* use their good judgment to come to a technical solution appropriate to their time - instead of wasting collective cycles of discussion how to make 1990's technology work for an unknown future requirement. It's just bad engineering. Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit range. I disagree (as would anyone with a bit of long-term perspective). Nobody needs to look into this for decades, so let it rest. A./
RE: Code pages and Unicode
On Tuesday 23 August 2011, Doug Ewell d...@ewellic.org wrote: Asmus Freytag asmusf at netcom dot com wrote: Until then, I find further speculation rather pointless and would love if it moved off this list (until such time). +1 -0.7 It is harmless fun, indeed it is fun that assists learning and understanding, and so as long as it does not go on for a long time, I think that it is good. http://www.unicode.org/policies/mail_policy.html quote A mail list is also a social organization, and as such, there will inevitably be some off-topic posting, fun, and games. This is not inherently discouraged unless it dominates a list for a length of time. end quote William Overington 24 August 2011
Re: Re: Code pages and Unicode
On 23 août 2011 21:44 Richard Wordingham richard.wording...@ntlworld.com richard.wording...@ntlworld.com wrote: On Tue, 23 Aug 2011 07:18:21 +0200 Jean-François Colson j...@colson.eu j...@colson.eu wrote: And what dou you think about (H1,H2,VS1,L3,L4)? The L4 is unnecessary. The trick then is to think of a BMP character that would very rarely be searched for on its own. Richard. With (H1,H2,VS1,L3), you'd only reach U+4010. To reach U+7FFF, you'd need either an additional low surrogate (H1,H2,VS1,L3,L4) or two VS (H1,H2,VS1,L3) and (H1,H2,VS2,L3).
Re: Multiple private agreements (was: RE: Code pages and Unicode)
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: (1) a plain-text file (2) using only plain-text conventions (i.e. not adding rich text) (3) which contains the same PUA code point with two meanings (4) using different fonts or other mechanisms (5) in a platform-independent, deterministic way One or more of the numbered items above must be sacrificed. The only numbered item to sacifice is number (3) here. that's the case where separate PUA agreements are still coordinated so that they don't use the same PUA assignments. This is the case of PUA greements in the Conscript registry. Number 3 was the entire basis for srivas's question: If same codes within PUA becomes standard for different purposes, how to get both working using same font? How to instruct text docs, what font if different fonts are used? Changing the question around, so that we are no longer talking about one code point with two meanings, doesn't accomplish anything. With only this exception, you can perfectly have separate agreements (using multiple fonts transporting them), for rendering a plain-text document. Of course the PUA only agreement stored in the font are the set of glyphs, and the display properties. Other properties (for collation, case mappings, text segmentation, and so on...) are not suitable for being in the font, but they are not needed for correct editing (without automated case changes) or for correct rendering. We have different views concerning the relative importance of these other properties, and I'm not going to try further to convince you. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Multiple private agreements (was: RE: Code pages and Unicode)
Luke-Jr luke at dashjr dot org wrote: Too bad the Conscript registry is censoring assignments the maintainer doesn't like for unspecified personal reasons, increasing the chances of an overlap. This isn't censorship, which would imply some sort of political, ethical, or moral agenda. This is a registrar making a technical (not an unspecified personal) decision, which he already explained to you, not to add something to the registry he maintains. (For what it's worth, and as you'll remember, I agreed with you about registering the tonal digits. But Michael is the CSUR registrar, not me.) Philippe Verdy verdy_p_at_wanadoo.fr replied: Even the UTC could create its own PUA registry, probably coordinating it with WG2, and with the IRG, for experimenting new encodings, or working on proposals, helping document the needed features or difficulties, and cooperate better with non-technical people that have good cultural knowledge, or that have access to rare texts or corpus for which there still does not exist any numerisation (scans), or whose numerisation is restricted or not financed, and for which it is also impossible to create OCR versions. As Richard said, and you probably already know, there is no chance that UTC will ever do anything with the PUA, especially anything that gives the appearance of endorsing its use. I'm just thankful they haven't deprecated it. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Code pages and Unicode
Asmus Freytag 於 2011年8月23日 下午2:00 寫道: Until then, I find further speculation rather pointless and would love if it moved off this list (until such time). That would be wonderful, because we could then turn our attention to more urgent subjects, such as what to do when the sun reaches its red giant stage and threatens to engulf the Earth. ☺ = Siôn ap-Rhisiart John H. Jenkins jenk...@apple.com
RE: Code pages and Unicode
William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: Until then, I find further speculation rather pointless and would love if it moved off this list (until such time). It is harmless fun, indeed it is fun that assists learning and understanding, and so as long as it does not go on for a long time, I think that it is good. If it were limited to the fun and the hypothetical, I would probably agree. But some people seem to be dead serious about the need to go beyond 1.1 million code points, and are making dead-serious arguments that we need to plan for it. I don't know if they truly believe we are going to communicate with space aliens using Unicode (judicious use of might reassure me here), or whether they think adding 2 billion code points will provide a back door to encoding all sorts of non-character every grain of sand on the beach objects, or what. But it isn't rooted in any sort of reality; both UTC and WG2 have permanently sealed the upper limit at 0x10, and knowledgeable people have tried and tried until they are blue in the face to explain why this is NOT a problem. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Code pages and Unicode
On Wed, 24 Aug 2011 08:02:42 -0700 Doug Ewell d...@ewellic.org wrote: But some people seem to be dead serious about the need to go beyond 1.1 million code points, and are making dead-serious arguments that we need to plan for it. Those are two different claims. 'Never say never' is a useful maxim. The extension of UCS-2, namely UTF-16, is far from optimal, but it could have been a lot worse - at least the surrogates are contiguous. All I ask is that we have a reasonable way of extending it if, say, code points are squandered. I think, however, that highhighrare BMP codelow offers a legitimate extension mechanism that can actually safely be ignored when scattering code assignments about the 17 planes (of which only 2 are full). Perhaps it is just as well we will never need a CJK character for every surname. It seems that we can safely accommodate CJK language tags. Richard.
Re: Code pages and Unicode
On 8/24/2011 10:48 AM, Richard Wordingham wrote: Those are two different claims. 'Never say never' is a useful maxim. So is Leave well enough alone. The problem would be in using maxims instead of an analysis of engineering requirements to drive architectural decisions. The extension of UCS-2, namely UTF-16, is far from optimal, but it could have been a lot worse - at least the surrogates are contiguous. All I ask is that we have a reasonable way of extending it Why? if, say, code points are squandered. Oh. Well, in that case, the correct action is to work to ensure that code points are not squandered. I think, however, thathighhighrare BMP codelow offers a legitimate extension mechanism One could argue about the description as legitimate. It is clearly not conformant, and would require a decision about an architectural change to the standard. I see no chance of that happening for either the Unicode Standard or 10646. that can actually safely be ignored when scattering code assignments about the 17 planes (of which only 2 are full). A quibble (I know), but only 1 plane is arguably full. Or, if you count PUA, then *3* planes are full. Here are the current stats for the forthcoming Unicode 6.1, counting *designated* code points (as opposed to assigned graphic characters). Plane 0: 63,207 / 65,536 = 96.45% full Plane 1: 7497 / 65,536 = 11.44% full Plane 2: 47,626 / 65,536 = 72.67% full (plane reserved for CJK ideographs) Plane 14: 339 / 65,536 = 0.52% full Plane 15: 65,536 / 65,536 = 100% full (PUA) Plane 16: 65,536 / 65,536 = 100% full (PUA) --Ken
Re: Code pages and Unicode
On Wed, 24 Aug 2011 12:40:54 -0700 Ken Whistler k...@sybase.com wrote: On 8/24/2011 10:48 AM, Richard Wordingham wrote: if, say, code points are squandered. Oh. Well, in that case, the correct action is to work to ensure that code points are not squandered. Have there not already been several failures on that front? The BMP is littered with concessions to the limitations of rendering systems - precomposed characters, Hangul syllables and Arabic presentation forms are the most significant. Hangul syllables being also a political compromise does not instil confidence in the lines of defence. I don't dispute that there have also been victories. Has Japanese disunification been completely killed, or merely scotched? I think, however, thathighhighrare BMP codelow offers a legitimate extension mechanism One could argue about the description as legitimate. It is clearly not conformant, With what? It's obviously not UTF-16 as we know it, but a possibly new type of code-unit sequence. and would require a decision about an architectural change to the standard. Naturally. The standard says only 17 planes. However, apart from UTF-16, the change to the *standard* would not be big. (Even so, a lot of UTF-8 and UTF-32 code would have to be changed to accommodate the new limit.) I see no chance of that happening for either the Unicode Standard or 10646. It will only happen when the need becomes obvious, which may be never, or may be 30 years hence. It's even conceivable that UTF-16 will drop out of use. Here are the current stats for the forthcoming Unicode 6.1, counting *designated* code points (as opposed to assigned graphic characters). Plane 0: 63,207 / 65,536 = 96.45% full Plane 1: 7497 / 65,536 = 11.44% full Plane 2: 47,626 / 65,536 = 72.67% full (plane reserved for CJK ideographs) Plane 14: 339 / 65,536 = 0.52% full Plane 15: 65,536 / 65,536 = 100% full (PUA) Plane 16: 65,536 / 65,536 = 100% full (PUA) I only see two planes that are actually full. Which are you counting as the full non-PUA plane? Richard.
Re: Code pages and Unicode
It has ceased to be. It's expired and gone to meet its maker. It's a stiff. Bereft of life, it rests in peace.…Its metabolic processes are now history. It's off the twig. It's kicked the bucket, it's shuffled off its mortal coil, run down the curtain and joined the bleedin' choir invisible. This is an ex-possibility. And even if that *weren't* true, there are nowhere *near* enough kanji to have a serious impact on Ken's analysis. Richard Wordingham 於 2011年8月24日 下午4:51 寫道: Has Japanese disunification been completely killed, or merely scotched? = 井作恆 John H. Jenkins jenk...@apple.com
Re: Code pages and Unicode
On 8/24/2011 3:51 PM, Richard Wordingham wrote: Well, in that case, the correct action is to work to ensure that code points are not squandered. Have there not already been several failures on that front? The BMP is littered with concessions to the limitations of rendering systems - precomposed characters, Hangul syllables and Arabic presentation forms are the most significant. Those are not concessions to the limitations of rendering systems -- they are concessions to the need to stay compatible with the character encodings of legacy systems, which had limitations for their rendering systems. A quibble? I think not. Note the outcome for Tibetan, for example. A proposal came in some years ago to encode all of the stacks for Tibetan as separate, precomposed characters -- ostensibly because of the limitations of rendering systems. That proposal was stopped dead in its tracks in the encoding committees, both because it would have been a duplicate encoding and normalization nightmare, and because, well, current rendering systems *can* render Tibetan just fine, thank you, given the current encoding. Hangul syllables being also a political compromise From *1995*, when such a compromise was necessary to keep in place the still fragile consensus which had driven 10646 and the Unicode Standard into a still-evolving coexistence. It is a mistake to extrapolate from that one example to conclusions that political decisions will inevitably lead to encoding useless additional hundreds of thousands of characters. does not instil confidence in the lines of defence. I don't dispute that there have also been victories. Has Japanese disunification been completely killed, or merely scotched? I think, however, thathighhighrare BMP codelow offers a legitimate extension mechanism One could argue about the description as legitimate. It is clearly not conformant, With what? It's obviously not UTF-16 as we know it, but a possibly new type of code-unit sequence. In whichever encoding form you choose to specify, the sequence highhigh is non-conformant. Not merely a possibly new type of code unit sequence. D800 D800 is non-conformant UTF-16 D800 D800 is non-conformant UTF-32 ED A0 80 ED A0 80 is non-conformant UTF-8 and would require a decision about an architectural change to the standard. Naturally. The standard says only 17 planes. However, apart from UTF-16, the change to the*standard* would not be big. (Even so, a lot of UTF-8 and UTF-32 code would have to be changed to accommodate the new limit.) Which is why this is never going to happen. (And yes, I said never. ;-) ) I see no chance of that happening for either the Unicode Standard or 10646. It will only happen when the need becomes obvious, which may be never, or may be 30 years hence. It's even conceivable that UTF-16 will drop out of use. Could happen. It still doesn't matter, because such a proposal also breaks UTF-8 and UTF-32. Plane 0: 63,207 / 65,536 = 96.45% full I only see two planes that are actually full. Which are you counting as the full non-PUA plane? The BMP. 96.45% full is, for all intents and purposes, considered full now. If you look at the BMP roadmap: http://www.unicode.org/roadmaps/bmp/ there are only 9 columns left which are not already in assigned blocks. More characters will gradually be added to existing blocks, of course, filling in nooks and crannies, but the real action for new encoding has now turned almost entirely to Plane 1. --Ken
Re: Multiple private agreements (was: RE: Code pages and Unicode)
2011/8/24 Doug Ewell d...@ewellic.org: Philippe Verdy verdy underscore p at wanadoo dot fr wrote: (1) a plain-text file (2) using only plain-text conventions (i.e. not adding rich text) (3) which contains the same PUA code point with two meanings (4) using different fonts or other mechanisms (5) in a platform-independent, deterministic way One or more of the numbered items above must be sacrificed. The only numbered item to sacifice is number (3) here. that's the case where separate PUA agreements are still coordinated so that they don't use the same PUA assignments. This is the case of PUA greements in the Conscript registry. Number 3 was the entire basis for srivas's question: If same codes within PUA becomes standard for different purposes, how to get both working using same font? How to instruct text docs, what font if different fonts are used? Changing the question around, so that we are no longer talking about one code point with two meanings, doesn't accomplish anything. But my initial suggestion implied that condition 3 was not part of it. This is not me, but sriva that has modified the problem. The problem was changed later by adding new conditions that I have never intended. It is clear that this condition 3 is completely unsatisfiable in all cases.
Re: Code pages and Unicode
2011/8/25 Richard Wordingham richard.wording...@ntlworld.com: It will only happen when the need becomes obvious, which may be never, or may be 30 years hence. It's even conceivable that UTF-16 will drop out of use. Conceivable but extremely unlikely because it will remain used in extremely frequent cases, even if it can only support a subset of the new encoding. [begin side note] This is a situation similar to the case of the UCS-2 subset, and of the ISO 10646 implementation levels that have been withdrawn and are no longer meaningful as a condition for conformance: conforming applications today *must* exhibit behaviors that effectively can respect the unbreakability and unreorderability of surrogate pairs; the need to support isolated surrogates or custom encodings that would depend on different pairing rules of surrogates, i.e. a high surrogate followed by a low surrogate, are not conforming. This does not mean that applications have to imply distinctive semantics to surrogates or have to support non-BMP characters by recognizing their distinctive properties: as long as runs of surrogates are handled in such a way that they will never be reordered or composed in arbitrary sequences, these applications can satisfy the conformance requirement, without having to fully assert a higher implementation level. So an UCS-2 only application can continue to blindly treat surrogates *as if* they were unbreakable strings of symbols with a strong LTR directionality and unknown glyphs (or just the same .notdef glyph), or to treat them *as if* they were unassigned (but valid) code points in the BMP (all with the same default property values, except that the value of individual code units must all be preserved; alternatively an UCS-2 application may still replace those surrogate code units all simultaneously to the same value associated to a non-ignorable character, such as 0xFFFD or 0x003F, or may still suppress all of them, knowing that it is destructive of information, or opt for throwing a fatal exception for all of them; these are some of the worst situations where this UCS-2 only behavior is still conforming). [end side note] This does not mean that existing UTF's will be the favored encoding in the future (we can't say that even about UTF-8, or UTF-32). It's just impossible to magically predict now which of the three standard UTF's (or their standard byte-order variants) will become out of use, or if any one of them will become out of use: for now there is absolutely no sign that this will ever occur. Instead, we still see a very large (and still accelerating) adoption rate for these UTFs (notably UTF-8).
Re: Multiple private agreements (was: RE: Code pages and Unicode)
Philippe wrote: But my initial suggestion implied that condition 3 was not part of it. This is not me, but sriva that has modified the problem. The problem was changed later by adding new conditions that I have never intended. It is clear that this condition 3 is completely unsatisfiable in all cases. The problem was stated initially by srivas, yesterday, so it's hard to imagine how he modified it. But of course I agree, and said so first, that condition 3 (one font, two different characters, same font, plain text) is impossible. -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT -Original Message- From: Philippe Verdy verd...@wanadoo.fr Sender: unicode-bou...@unicode.org Date: Thu, 25 Aug 2011 02:10:27 To: Doug Ewelld...@ewellic.org Reply-To: verd...@wanadoo.fr Cc: unicode@unicode.org Subject: Re: Multiple private agreements (was: RE: Code pages and Unicode) 2011/8/24 Doug Ewell d...@ewellic.org: Philippe Verdy verdy underscore p at wanadoo dot fr wrote: (1) a plain-text file (2) using only plain-text conventions (i.e. not adding rich text) (3) which contains the same PUA code point with two meanings (4) using different fonts or other mechanisms (5) in a platform-independent, deterministic way One or more of the numbered items above must be sacrificed. The only numbered item to sacifice is number (3) here. that's the case where separate PUA agreements are still coordinated so that they don't use the same PUA assignments. This is the case of PUA greements in the Conscript registry. Number 3 was the entire basis for srivas's question: If same codes within PUA becomes standard for different purposes, how to get both working using same font? How to instruct text docs, what font if different fonts are used? Changing the question around, so that we are no longer talking about one code point with two meanings, doesn't accomplish anything. But my initial suggestion implied that condition 3 was not part of it. This is not me, but sriva that has modified the problem. The problem was changed later by adding new conditions that I have never intended. It is clear that this condition 3 is completely unsatisfiable in all cases.
Re: Multiple private agreements (was: RE: Code pages and Unicode)
s/one font/one code point/ -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT -Original Message- From: Doug Ewell d...@ewellic.org Sender: unicode-bou...@unicode.org Date: Thu, 25 Aug 2011 01:39:24 To: verd...@wanadoo.fr Reply-To: d...@ewellic.org Cc: unicode@unicode.org Subject: Re: Multiple private agreements (was: RE: Code pages and Unicode) Philippe wrote: But my initial suggestion implied that condition 3 was not part of it. This is not me, but sriva that has modified the problem. The problem was changed later by adding new conditions that I have never intended. It is clear that this condition 3 is completely unsatisfiable in all cases. The problem was stated initially by srivas, yesterday, so it's hard to imagine how he modified it. But of course I agree, and said so first, that condition 3 (one font, two different characters, same font, plain text) is impossible. -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT -Original Message- From: Philippe Verdy verd...@wanadoo.fr Sender: unicode-bou...@unicode.org Date: Thu, 25 Aug 2011 02:10:27 To: Doug Ewelld...@ewellic.org Reply-To: verd...@wanadoo.fr Cc: unicode@unicode.org Subject: Re: Multiple private agreements (was: RE: Code pages and Unicode) 2011/8/24 Doug Ewell d...@ewellic.org: Philippe Verdy verdy underscore p at wanadoo dot fr wrote: (1) a plain-text file (2) using only plain-text conventions (i.e. not adding rich text) (3) which contains the same PUA code point with two meanings (4) using different fonts or other mechanisms (5) in a platform-independent, deterministic way One or more of the numbered items above must be sacrificed. The only numbered item to sacifice is number (3) here. that's the case where separate PUA agreements are still coordinated so that they don't use the same PUA assignments. This is the case of PUA greements in the Conscript registry. Number 3 was the entire basis for srivas's question: If same codes within PUA becomes standard for different purposes, how to get both working using same font? How to instruct text docs, what font if different fonts are used? Changing the question around, so that we are no longer talking about one code point with two meanings, doesn't accomplish anything. But my initial suggestion implied that condition 3 was not part of it. This is not me, but sriva that has modified the problem. The problem was changed later by adding new conditions that I have never intended. It is clear that this condition 3 is completely unsatisfiable in all cases.
Re: Multiple private agreements (was: RE: Code pages and Unicode)
2011/8/24 Doug Ewell d...@ewellic.org: As Richard said, and you probably already know, there is no chance that UTC will ever do anything with the PUA, especially anything that gives the appearance of endorsing its use. I'm just thankful they haven't deprecated it. The appearance of endorsing its use would only come if the website describing the registry was using a frame using the Unicode logo. It can act exactly like the CSUR registry, as an independant project (with its own membership and participation policies), that would also be helpful for collaborating with liaison members, ISO NB's, or some local cultural organizations or collaborative projects. The focus of this registry would only be for helping the encoding process: registered PUAs or PUA ranges would not survive to finalized proposals that were formally proposed and rejected by both the UTC and WG2, and abandonned as well by its iniital promoters in the registry (no new updated proposal), or to proposals that have been finally released in the UCS (and there would likely be a short timeframe for the death of these registrations, probably not exceeding one year). It would be different from the CSUR, because CSUR also focuses on supported PUAs that will never be suppoorted in the UCS (for example, due to legal reasons, such as copyright which would restrict the publication of any representative glyph in the UCS charts), or creative/artistic designs (For example, I'm still not convinced that Klingon qualifies for encoding in the UCS, because of copyright restrictions and absence of a formal free licence from right owners; the same would apply to any collection of logos, including the logos of national or international standard bodies that you can find on lots of manufactured products and in their documentation, because the usage of these logos is severely restricted and often implies contractual assessments by those displaying it on their products or publications; this would also apply to corporate logos, even if they are widely used, sometimes with permission, but this time because these logos frequently change for marketing reasons).
Re: Multiple private agreements (was: RE: Code pages and Unicode)
so you will end up with the CSUR AND the registry Pilippe is suggesting AND all the existing uses of PUA that will not end up in CSUR or the other registry. sounds like it will be a mess. its bad enough dealing with Unicode and pseudo-Unicode in the Myanmar script, adding PUA potentially into the mix ummm... On 25 August 2011 11:55, Philippe Verdy verd...@wanadoo.fr wrote: 2011/8/24 Doug Ewell d...@ewellic.org: As Richard said, and you probably already know, there is no chance that UTC will ever do anything with the PUA, especially anything that gives the appearance of endorsing its use. I'm just thankful they haven't deprecated it. The appearance of endorsing its use would only come if the website describing the registry was using a frame using the Unicode logo. It can act exactly like the CSUR registry, as an independant project (with its own membership and participation policies), that would also be helpful for collaborating with liaison members, ISO NB's, or some local cultural organizations or collaborative projects. The focus of this registry would only be for helping the encoding process: registered PUAs or PUA ranges would not survive to finalized proposals that were formally proposed and rejected by both the UTC and WG2, and abandonned as well by its iniital promoters in the registry (no new updated proposal), or to proposals that have been finally released in the UCS (and there would likely be a short timeframe for the death of these registrations, probably not exceeding one year). It would be different from the CSUR, because CSUR also focuses on supported PUAs that will never be suppoorted in the UCS (for example, due to legal reasons, such as copyright which would restrict the publication of any representative glyph in the UCS charts), or creative/artistic designs (For example, I'm still not convinced that Klingon qualifies for encoding in the UCS, because of copyright restrictions and absence of a formal free licence from right owners; the same would apply to any collection of logos, including the logos of national or international standard bodies that you can find on lots of manufactured products and in their documentation, because the usage of these logos is severely restricted and often implies contractual assessments by those displaying it on their products or publications; this would also apply to corporate logos, even if they are widely used, sometimes with permission, but this time because these logos frequently change for marketing reasons). -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: Multiple private agreements (was: RE: Code pages and Unicode)
2011/8/25 Andrew Cunningham lang.supp...@gmail.com: so you will end up with the CSUR AND the registry Philippe is suggesting AND all the existing uses of PUA that will not end up in CSUR or the other registry. sounds like it will be a mess. its bad enough dealing with Unicode and pseudo-Unicode in the Myanmar script, adding PUA potentially into the mix ummm... Where did someone speak about the Myanmar script case ? MAy be you're now the source of this mix. And anyway I've not said that the CSUR project, or a putative project to help the UCS encoding process are the only options for using PUAs. And not all PUA usages need to be coordinated: - in East Asia, PUAs are frequently used only for personal reasons, in a purely creative way (for example in personal ideographs), only for the final purpose of creating something else that will be communicated with others, and which will not be necessarily plain text. - people can encode their own photos or colorful drawings in a PUA, if they want it for their own uses... They dont require the authorization or approval from others. - applications may internally use their own PUAs as a simple part of their implementation, to make it work and produce the results wanted, without having to even expose how this internal use is effectively defined (others may try to investigate, by reverse engineering, or the application author may change this representation at any time, or remove it by using some other solutions, it should not matter). However, I am still convinced that the coordinated uses of PUAs is justified by the desire to create something else, as a temporary working tool, which can be justified by the current limitations of existing standards in their defined scope (including policies) or in their initially expected usage. In that case, PUAs are a very useful transition mechanism.
Re: Code pages and Unicode
On Mon, 22 Aug 2011 16:18:56 -0700 Ken Whistler k...@sybase.com wrote: How about Clause 12.5 of ISO/IEC 10646: 001B, 0025, 0040 You escape out of UTF-16 to ISO 2022, and then you can do whatever the heck you want, including exchange and processing of complete 4-byte forms, with all the billions of characters folks seem to think they need. Of course you would have to convince implementers to honor the ISO 2022 escape sequence... Which they only need to if the text is in an ISO 2022 or similar context. Your idea does suggest that a pattern of highhighSOlow would be reasonable. The shift-out code U+000E has no meaning as a Unicode character so it wouldn't be unreasonable to require a special check that one finds a full character if looking for a one-character string consisting only of U+000E. We could also have highhighSIlow to gives the full *two* thousand million odd characters that would be resupported by UTF-32. Richard.
Re: Code pages and Unicode
On 8/23/2011 12:00 PM, Richard Wordingham wrote: On Mon, 22 Aug 2011 16:18:56 -0700 Ken Whistlerk...@sybase.com wrote: How about Clause 12.5 of ISO/IEC 10646: 001B, 0025, 0040 You escape out of UTF-16 to ISO 2022, and then you can do whatever the heck you want, including exchange and processing of complete 4-byte forms, with all the billions of characters folks seem to think they need. Of course you would have to convince implementers to honor the ISO 2022 escape sequence... Which they only need to if the text is in an ISO 2022 or similar context. Your idea does suggest that a pattern of highhighSOlow would be reasonable. I don't see where Ken's reply (as quoted) suggests anything like that. What he wrote is that, formally, 10646 supports a mechanism to switch to ISO 2022. Therefore, formally, there's an escape hatch built in. If and when such should be needed, in a few hundred years, it'll be there. Until then, I find further speculation rather pointless and would love if it moved off this list (until such time). A./
RE: Code pages and Unicode
Asmus Freytag asmusf at netcom dot com wrote: Until then, I find further speculation rather pointless and would love if it moved off this list (until such time). +1 -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Multiple private agreements (was: RE: Code pages and Unicode)
srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote: If same codes within PUA becomes standard for different purposes, They aren't standard. Two different private agreements could assign different characters to the same PUA code points. how to get both working using same font? You can't. How to instruct text docs, what font if different fonts are used? There's no standard way to specify even one font or private agreement in plain text, let alone how to switch between them within the same document. This is not an intended use of the PUA. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Multiple private agreements (was: RE: Code pages and Unicode)
2011/8/23 Doug Ewell d...@ewellic.org: srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote: If same codes within PUA becomes standard for different purposes, They aren't standard. Two different private agreements could assign different characters to the same PUA code points. how to get both working using same font? You can't. I do agree. How to instruct text docs, what font if different fonts are used? There's no standard way to specify even one font or private agreement in plain text, let alone how to switch between them within the same document. This is not an intended use of the PUA. There exists such standard in the context of plain-text rendering, because of font fallback mechanisms (in Windows with Uniscribe, such fallback mechanism is not tunable per user preferences, as the list of alternative fonts that are tried is fixed by the implementation of Uniscribe; but anyway it still exists), which implies that multiple fonts will be scanned with an order of preference; font fallback is involved each time a character is not mapped on the selected font but may be mapped in another font. Such mechanism is exactly similar to the explicit fallback mechanism in CSS (where one provides an ordered comma-separated list of font-family names), but that also extends this list of fonts automatically using the default font fallback mechanisms used for plain-text rendering. In other words, even if you can't instruct a plain-text to use glyphs from one font or from another for the same code point (PUA here), such possibility still exists in rich-text rendering, because all glyphs can become selectable as variants (including the variants listed in the same font for the same glyph, in standardized OpenType features, provided that the rich-text application implements such glyph-selection mechanism). PUAs are effectively not meant to supply the PUA agreement. This has to be provided elsewhere, but a font can perfectly transport this agreement (for the font as a whole which is separately selectable, just like its designed glyph variants are individually selectable by some typographic feature tables, as well as by index, for example several swash variants of the same letter with more or less decorations). If you can use font fallbacks, then you can render the same text containing distinct PUAs designed for distinct PUA agreements (and this demonstrates the utility of the conscript registry, which allows cooperation between authors of separate agreements, that have accepted to encode their PUA characters with non-conflicting PUA code point assignments). -- Philippe.
RE: Multiple private agreements (was: RE: Code pages and Unicode)
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: There's no standard way to specify even one font or private agreement in plain text, let alone how to switch between them within the same document. This is not an intended use of the PUA. There exists such standard in the context of plain-text rendering, because of font fallback mechanisms (in Windows with Uniscribe, such fallback mechanism is not tunable per user preferences, as the list of alternative fonts that are tried is fixed by the implementation of Uniscribe; but anyway it still exists), which implies that multiple fonts will be scanned with an order of preference; font fallback is involved each time a character is not mapped on the selected font but may be mapped in another font. That's not a way to specify a font. Neither the creator nor the reader has control over which fallback font is used. In any event, if the document contains a PUA code point that is used with two or more different intended meanings (srivas's scenario), the engine will surely pick the same font for both instances. Such mechanism is exactly similar to the explicit fallback mechanism in CSS (where one provides an ordered comma-separated list of font-family names), but that also extends this list of fonts automatically using the default font fallback mechanisms used for plain-text rendering. In CSS the author can at least pick the fonts. In other words, even if you can't instruct a plain-text to use glyphs from one font or from another for the same code point (PUA here), such possibility still exists in rich-text rendering, because all glyphs can become selectable as variants (including the variants listed in the same font for the same glyph, in standardized OpenType features, provided that the rich-text application implements such glyph-selection mechanism). Then it's not plain text, which is all I was talking about. PUAs are effectively not meant to supply the PUA agreement. This has to be provided elsewhere, but a font can perfectly transport this agreement (for the font as a whole which is separately selectable, just like its designed glyph variants are individually selectable by some typographic feature tables, as well as by index, for example several swash variants of the same letter with more or less decorations). Not perfectly, unless you think that display is everything. If you can use font fallbacks, then you can render the same text containing distinct PUAs designed for distinct PUA agreements (and this demonstrates the utility of the conscript registry, which allows cooperation between authors of separate agreements, that have accepted to encode their PUA characters with non-conflicting PUA code point assignments). Coordinating private agreements so they don't conflict is clearly the ideal situation. But many different people and organizations have already claimed the same chunk of PUA space, as Richard exemplified yesterday with his Taiwan/Hong Kong example. There is no standard way to display: (1) a plain-text file (2) using only plain-text conventions (i.e. not adding rich text) (3) which contains the same PUA code point with two meanings (4) using different fonts or other mechanisms (5) in a platform-independent, deterministic way One or more of the numbered items above must be sacrificed. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Multiple private agreements (was: RE: Code pages and Unicode)
2011/8/24 Doug Ewell d...@ewellic.org: Coordinating private agreements so they don't conflict is clearly the ideal situation. But many different people and organizations have already claimed the same chunk of PUA space, as Richard exemplified yesterday with his Taiwan/Hong Kong example. There is no standard way to display: (1) a plain-text file (2) using only plain-text conventions (i.e. not adding rich text) (3) which contains the same PUA code point with two meanings (4) using different fonts or other mechanisms (5) in a platform-independent, deterministic way One or more of the numbered items above must be sacrificed. The only numbered item to sacifice is number (3) here. that's the case where separate PUA agreements are still coordinated so that they don't use the same PUA assignments. This is the case of PUA greements in the Conscript registry. With only this exception, you can perfectly have separate agreements (using multiple fonts transporting them), for rendering a plain-text document. Of course the PUA only agreement stored in the font are the set of glyphs, and the display properties. Other properties (for collation, case mappings, text segmentation, and so on...) are not suitable for being in the font, but they are not needed for correct editing (without automated case changes) or for correct rendering.
Re: Multiple private agreements (was: RE: Code pages and Unicode)
On Tuesday, August 23, 2011 10:29:58 PM Philippe Verdy wrote: 2011/8/24 Doug Ewell d...@ewellic.org: (3) which contains the same PUA code point with two meanings The only numbered item to sacifice is number (3) here. that's the case where separate PUA agreements are still coordinated so that they don't use the same PUA assignments. This is the case of PUA greements in the Conscript registry. Too bad the Conscript registry is censoring assignments the maintainer doesn't like for unspecified personal reasons, increasing the chances of an overlap.
Re: Multiple private agreements (was: RE: Code pages and Unicode)
2011/8/24 Luke-Jr l...@dashjr.org: On Tuesday, August 23, 2011 10:29:58 PM Philippe Verdy wrote: 2011/8/24 Doug Ewell d...@ewellic.org: (3) which contains the same PUA code point with two meanings The only numbered item to sacifice is number (3) here. that's the case where separate PUA agreements are still coordinated so that they don't use the same PUA assignments. This is the case of PUA greements in the Conscript registry. Too bad the Conscript registry is censoring assignments the maintainer doesn't like for unspecified personal reasons, increasing the chances of an overlap. It's their choice, their private decision. Nobody is required to accept the conditions of CSUR. In fact other groups could be created to coordinate other choices compatible with each other. Even the UTC could create its own PUA registry, probably coordinating it with WG2, and with the IRG, for experimenting new encodings, or working on proposals, helping document the needed features or difficulties, and cooperate better with non-technical people that have good cultural knowledge, or that have access to rare texts or corpus for which there still does not exist any numerisation (scans), or whose numerisation is restricted or not financed, and for which it is also impossible to create OCR versions. In order to get financements, some of those projects would need to exhibit only some fragments, explaining what is found in the rest of the corpus, using significant samples, but also new creating didactic documents, for which PUAs will be needed if they want to interchange with something else than handwritten papers, and photocopies or scans (which are not easy to handle via emails or in HTML pages, or that are to reproduce). Such PUA registry is not required to be stable for extensive periods. Its content will evolve so that the encoded documents will be valid for a limited time. This also means that the necessary fonts required to keep those texts in a legible way (and possible future reencoding, to new PUAs or to standard assignments in the UCS) would have to be kept with those PUA texts. Those fonts should be clearly versioned, containing an expected lifetime for which the PUA registry may warranty some stability (example: the PUA registry will make assignments only by early leases that will need to be renewed by interested people). Note that I clearly want that PUA fonts contain explicitly the character properties needed for proper rendering. Simply because it is expected that PUA documents will be created and interchanged for a limited time. There will be almost no transforms of those texts, only updates to their content via editing. Now which font format will be the best suited for this work with PUA texts? May be OpenType is not the best fit (tools to create them are too complex for most users, and often are too costly, probably a consequence of this complexity that destinate those tools only to very few specialists), when there are simpler formats that are easily editable from more tools (SVG fonts look promising, even if their typographic capabilities are not very advanced for now; I just hope that someday there will be support for this format in more renderers, even if those fonts are larger in size for less glyphs inside; but this SVG format can be easily zipped into a SVGZ format also recognized automatically). But some OSes or applications are offering simple accesory tools to create PUA glyphs stored in personal fonts that can be reedited, embedded, or uploaded to the recipients of a document needing these glyphs. This may be used as an extension to input method editors, notably for entering custom sinograms). Those tools won't let you create glyphs with perfect metrics, or fonts with ligatures/GSUB features, or advanced GPOS'itioning. Drawing tools are minimized to reproduce how we draw basic shapes with the circle head of a pen, of the elliptic head of a pencil, or the thin linear head of some highliting pens. Some other tools just let you use a scan and produce basic shapes.
Re: Code pages and Unicode
On 21 August 2011 02:14, Richard Wordingham richard.wording...@ntlworld.com wrote: On Fri, 19 Aug 2011 17:03:41 -0700 Ken Whistler k...@sybase.com wrote: O.k., so apparently we have awhile to go before we have to start worrying about the Y2K or IPv4 problem for Unicode. Call me again in the year 2851, and we'll still have 5 years left to design a new scheme and plan for the transition. ;-) It'll be much easier to extend UTF-16 if there are still enough contiguous points available. Set that wake-up call for 2790, or whenever plane 13 (better, plane 12) is about to come into use. Stymied by the Unicode® stability policies again: The General_Category property values will not be further subdivided. The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change. http://unicode.org/policies/stability_policy.html#Property_Value Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew
Re: Code pages and Unicode
On 08/22/2011 03:05 PM, Andrew West wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Why would anyone *need* to do so? UTF-16 can represent all codepoints upto Plane 16 right? -- Shriramana Sharma
Re: Code pages and Unicode
On 22 August 2011 12:51, Shriramana Sharma samj...@gmail.com wrote: On 08/22/2011 03:05 PM, Andrew West wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Why would anyone *need* to do so? UTF-16 can represent all codepoints upto Plane 16 right? To clarify, I was replying to Richard Wordingham's tongue in cheek suggestion to extend UTF-16 to go beyond Plane 16 in the year 2790 or when only one free plane remains. I am not advocating extending UTF-16 or the Unicode code space, or suggesting that it will ever be necessary to do so. But hypothetically, I don't see a way to extend UTF-16 without breaking the stability policy. The same stability policies would also prohibit the assignment of any area of the Unicode code space for code page usage as Srivas Sinnathurai has proposed. (If there was an automatic filter on ideas that break one or more stability policies this mailing list would be a far quieter place.) Andrew
RE: Code pages and Unicode
srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote: The true lifting of UTF-16 would be to UTF-32. Leave the UTF-16 un touched and make the new half versatile as possible. I think any other solution is just a patch up for the timebeing. There is no evidence whatsoever that this is a problem that needs to be solved, not in 700 or 800 years, not ever. Ken's words are again being ignored. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Code pages and Unicode
Christoph Päper 於 2011年8月20日 上午2:31 寫道: Mark Davis ☕: Under the original design principles of Unicode, the goal was a bit more limited; we envisioned […] a generative mechanism for infrequent CJK ideographs, I'd still like having that as an option. Et voilà! We have Ideographic Description Sequences. Or, if you're more ambitious, CDL. Generative mechanisms for Han are very attractive given the nature of the script, but once you try to support something other than display, or even try to write a rendering engine, all sorts of nasty problems crop up that have proven difficult to solve. We won't even get into the problem of wanting to discourage people from making up new ad hoc characters for Han. I won't say some sort of generative mechanism will never become the preferred way of handling unencoded ideographs, but there is a lot of work to be done before that would be practical. = John H. Jenkins jenk...@apple.com
Re: Code pages and Unicode
On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew How about a triple sequence of two high surrogates followed by one low surrogate? I suggest this as a solution to the problem that is posed by Andrew as I feel that it would be interesting to know if that would be possible or whether it would be forbidden due to an existing policy that has already been guaranteed to be unchangeable. William Overington 22 August 2011
Re: Code pages and Unicode
On 22/08/11 16:55, Doug Ewell wrote: srivas sinnathuraisisrivas at blueyonder dot co dot uk wrote: The true lifting of UTF-16 would be to UTF-32. Leave the UTF-16 un touched and make the new half versatile as possible. I think any other solution is just a patch up for the timebeing. There is no evidence whatsoever that this is a problem that needs to be solved, not in 700 or 800 years, not ever. Ken's words are again being ignored. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell I see at least one reason to extend the present 17 planes Unicode space: that would provide space for a RTL PUA. ☺ Presently, UTF-16 uses surrogate pairs to address non-BMP characters: HS LS (High Surrogate followed by Low Surrogate). What would happen if we imbricate them? Would HS1 HS2 LS1 LS2 be acceptable to address more characters?
Re: Code pages and Unicode
On 20/08/11 02:03, Ken Whistler wrote: O.k., so apparently we have awhile to go before we have to start worrying about the Y2K or IPv4 problem for Unicode. Call me again in the year 2851, and we'll still have 5 years left to design a new scheme and plan for the transition. ;-) --Ken I wonder whether you aren’t a little too optimistic. Have you considered the unencoded ideographic scripts? 1,071 hieroglyphs have already been encoded. I think there are approximately 4,000 more to encode. 1,165 Yi syllables and 55 Yi radicals have been encoded. But they only support one dialect of Yi and I read there are tens of thousands of Yi ideographs and that a proposal to encode 88,613 classical Yi characters was made 4 years ago. The threshold of 200,000 characters doesn’t seem very far.
Re: Code pages and Unicode
On 8/22/2011 9:58 AM, Jean-François Colson wrote: I wonder whether you aren’t a little too optimistic. No. If anything I'm assuming that the folks working on proposals will be amazingly assiduous during the next decade. Have you considered the unencoded ideographic scripts? Why, yes I have. 1,071 hieroglyphs have already been encoded. I think there are approximately 4,000 more to encode. A preliminary listing of 4548 additional hieroglyphs, based on Hieroglyphica (1993), was presented to WG2 in 1999. Twelve years have passed, and no additional document has been forthcoming to work through the issues in standardizing such a list as characters. I won't hold my breath, but somebody *might* get through that work by 2021. 1,165 Yi syllables and 55 Yi radicals have been encoded. But they only support one dialect of Yi and I read there are tens of thousands of Yi ideographs and that a proposal to encode 88,613 classical Yi characters was made 4 years ago. 88,613 classical Yi *glyphs*. This is just a collection of every glyph form noted from wherever. Even the proponents acknowledged that it was more on the order of maybe 7000 *characters* involved. They got feedback to do the homework to work through the character/glyph model for classical Yi, and come back when they have a documented, reliable listing of the Yi *characters* that need encoding, together with the list of variants for each character. Given the nature and scope of the work, and no (current) indication of the progress being made, this also *might* get done by 2021. The threshold of 200,000 characters doesn’t seem very far. Nah. It is still way over the extended horizon. The only big historic ideographic script that is close to being done is Tangut, and the wrangling even over that one has gone on for years now. --Ken
Re: Code pages and Unicode
On Mon, 22 Aug 2011 14:06:00 +0100 (BST) William_J_G Overington wjgo_10...@btinternet.com wrote: On Monday 22 August 2011, Andrew West andrewcw...@gmail.com wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew How about a triple sequence of two high surrogates followed by one low surrogate? The problem is that a search for the character represented by the code unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3). While there is no ambiguity, it does make searching more complicated to code. The same issue applies to the suggestion of using (H1,H2,L3,L4) sequences. Now, we could use (H1,H2,L3,L4) sequences and never assign the (H2,L3) combinations. They would therefore be category Cn, which currently consists of both the unassigned characters and the non-characters. However, I can't help feeling that they'd be almost a sort of surrogate. It's slightly more efficient to replace L3 by a single BMP character. Practically, I think that if we can change the semantics of the Myanmar script, our descendants can go back on the guarantee of no more surrogates. Richard.
Re: Code pages and Unicode
On 8/22/2011 3:15 PM, Richard Wordingham wrote: On Monday 22 August 2011, Andrew Westandrewcw...@gmail.com wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew How about a triple sequence of two high surrogates followed by one low surrogate? How about Clause 12.5 of ISO/IEC 10646: 001B, 0025, 0040 You escape out of UTF-16 to ISO 2022, and then you can do whatever the heck you want, including exchange and processing of complete 4-byte forms, with all the billions of characters folks seem to think they need. Of course you would have to convince implementers to honor the ISO 2022 escape sequence and liberate themselves into a high-level world of nosebleed character numerosity. But then I guess by the time this is needed, folks are counting on the need being self-evident. ;-) --Ken
Re: Code pages and Unicode
On 23/08/11 00:15, Richard Wordingham wrote: The problem is that a search for the character represented by the code unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3). While there is no ambiguity, it does make searching more complicated to code. The same issue applies to the suggestion of using (H1,H2,L3,L4) sequences. And what dou you think about (H1,H2,VS1,L3,L4)?
Re: Code pages and Unicode
About the research works. I alone (with with my colleagues) researching the fact that Sumerian is Tamil / Tamil is Sumerian This requires quite a lot of space. Additionally I do research on Tamil alphabet as based on scientific definitions and it only represents the mechanical parts , ie only represents the places of articulation as alphabet and not sound based. And, what is call a mathematical multiplier theory on expanding the alphabets leads to not just long-mathematics (nedung kaNaku), but also to extra long mathematics. This is just a sample requirement from me and my colleagues. How many others are there who would require Unicode support? Do you think allocating 32,000 to the code page model would help? Regards Sinnathurai On 20 August 2011 09:31, Christoph Päper christoph.pae...@crissov.dewrote: Mark Davis ☕: Under the original design principles of Unicode, the goal was a bit more limited; we envisioned […] a generative mechanism for infrequent CJK ideographs, I'd still like having that as an option.
Re: Code pages and Unicode
It sounds like you’re trying to encode glyphs or glyph fragments, not characters. There is a virtually endless repertoire of “shapes” that could be encoded, but unless each of these is a character actually used in a writing system (not just hypothetically), it’s probably not appropriate for a character encoding. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell From: srivas sinnathurai Sent: Saturday, August 20, 2011 3:35 To: Christoph Päper Cc: unicode@unicode.org Subject: Re: Code pages and Unicode About the research works. I alone (with with my colleagues) researching the fact that Sumerian is Tamil / Tamil is Sumerian This requires quite a lot of space. Additionally I do research on Tamil alphabet as based on scientific definitions and it only represents the mechanical parts , ie only represents the places of articulation as alphabet and not sound based. And, what is call a mathematical multiplier theory on expanding the alphabets leads to not just long-mathematics (nedung kaNaku), but also to extra long mathematics. This is just a sample requirement from me and my colleagues. How many others are there who would require Unicode support? Do you think allocating 32,000 to the code page model would help? Regards Sinnathurai
Re: Code pages and Unicode
On Fri, 19 Aug 2011 17:03:41 -0700 Ken Whistler k...@sybase.com wrote: O.k., so apparently we have awhile to go before we have to start worrying about the Y2K or IPv4 problem for Unicode. Call me again in the year 2851, and we'll still have 5 years left to design a new scheme and plan for the transition. ;-) It'll be much easier to extend UTF-16 if there are still enough contiguous points available. Set that wake-up call for 2790, or whenever plane 13 (better, plane 12) is about to come into use. Richard.
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
Doug, First of all flat code space is the primary functionality of Unicode and not calling for any changes to existing encodings. What I propose is assign about 16,000 codes to code-page switching model. Why this suggestion? With current flat space, one code point is only allocated to one and only one purpose. We can run out of code space soon. While processing the contemporary languages and other like mathamatical sym in flat space, the 16,000 codes in the portin that is code page switchable will be able to support 1000ands of different characters on each of the code. Ie, take 16 codes. with flat space only supports 16 characters. but with code page, can support 16 differnt purposes, each with a capacity of 14 characters. that is 140 characters instead of just 10 flat characters. Sinnathurai On 19 August 2011 15:27, Doug Ewell d...@ewellic.org wrote: srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote: PUA is not structured It's not supposed to be. It's a private-use area. You use it the way you see fit. and not officially programmable to accommodate numerous code pages. None of Unicode is designed around code-page switching. It's a flat code space. This is true even for ISO 10646, which nominally divides the space into groups and planes and rows. As a programmer, I don't understand what not officially programmable means here. I've written lots of programs that use and understand the PUA. Take the ISO 8859-1, 2, 3, and so on . These are now allocating the same code points to many languages and for other purposes. Character encodings don't allocate code points to languages. They allocate code points to characters, which are used to write text in languages. This is not a trivial distinction; it is crucial to understanding how character encodings work. Similarly, a structured and official allocations to any many requirements can be done using the same codes, say 16,000 of them. If you want to use ISO 2022, just use ISO 2022. I guess what I'm missing is why the code-page switching model is considered superior, in any way, to the flat code space of Unicode/10646. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
RE: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
srivas sinnathurai sisrivas at blueyonder dot co dot uk wrote: Why this suggestion? With current flat space, one code point is only allocated to one and only one purpose. We can run out of code space soon. Argument over. There are not 800,000 more characters that need to be encoded for storage or interchange. There may well be 800,000 glyphs, or images, or meanings, but that is not what any character encoding standard is for. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
srivas sinnathurai 於 2011年8月19日 上午9:40 寫道: Why this suggestion? With current flat space, one code point is only allocated to one and only one purpose. We can run out of code space soon. There are a couple of problems here. We currently have over 860,000 unassigned code points. Surveys of all known writing systems indicate that only a small fraction of these will be needed. Indeed, although it looks likely that Han will spill out of the SIP into plane 3, all non-Han will likely fit into the SMP. (Michael, you can correct me on this if I'm wrong.) Even if we allow for the possibility that there are a lot of writing systems out there we don't know about, there would have to be a *lot* of writing systems out there we don't know about to fill up planes 4 through 14. If the average script requires 256 code points, there would have to be some 2800 unencoded scripts to do that. Moreover, it's taken us 20 years to use 250,000 code points. Even if that rate remained steady (and it's been going down), it will take us something on the order of a century to fill up the remaining space, if that's even possible, and that hardly qualifies as soon. And there already is a code page switching mechanism such as you propose. It's called ISO 2022 and it supports Unicode. In order to get the UTC and WG2 to agree to a major architectural change such as you're suggesting, you'd have to have some very solid evidence that it's needed—not an interesting idea, not potentially useful, but seriously *needed*. That's how surrogates and the astral planes came about—people came up with solid figures showing that 65,536 code points was not nearly enough. So far, the evidence suggests that we're in no danger of running out of code points. = Siôn ap-Rhisiart John H. Jenkins jenk...@apple.com
Re: Code pages and Unicode
John H. Jenkins: there would have to be a *lot* of writing systems out there we don't know about to fill up planes 4 through 14 That’s quite possible, though, the universe is huge. The question rather is whether we will ever know about them. It’s quite possible we won’t.
RE: Code pages and Unicode
Maybe we should step back a bit: I'm not calling for any change to existing major aloocations. However, this is about time we allocate (not PUA) large number of codes to a code page based sub codes so that not only all 7000+ languages can Freely use it without INTERFERENCE from Unicode and have the freedom to carry out research works, like we were doing with the legacy 8bit codes. Can you provide some detail about these research works that have to do with encoding characters and are projected to require more than the 137,468 code points available in the PUA? What sort of INTERFERENCE from Unicode needs to be avoided, and are we talking about encoding or architectural decisions or what? (There must be a GREAT DEAL of perceived interference here, because of the capital letters.) -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 19 Aug 2011, at 18:24, John H. Jenkins wrote: We currently have over 860,000 unassigned code points. Surveys of all known writing systems indicate that only a small fraction of these will be needed. Indeed, although it looks likely that Han will spill out of the SIP into plane 3, all non-Han will likely fit into the SMP. (Michael, you can correct me on this if I'm wrong.) I wouldn't like to guarantee that non-Han won't spill over out of the SMP, but I doubt we'd fill Plane 4. Michael Everson * http://www.evertype.com/
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 08/19/2011 01:24 PM, John H. Jenkins wrote: In order to get the UTC and WG2 to agree to a major architectural change such as you're suggesting, you'd have to have some very solid evidence that it's needed—not an interesting idea, not potentially useful, but seriously *needed*. That's how surrogates and the astral planes came about—people came up with solid figures showing that 65,536 code points was not nearly enough. So far, the evidence suggests that we're in no danger of running out of code points. And indeed, it went the other way too, back when ISO-10646 had not 17, but 65536 *planes* and someone provided some reasonable evidence (or just plain reasoned arguments) that 4.3 *billion* characters was probably overkill. ~mark
RE: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
Mark E. Shoulson mark at kli dot org wrote: And indeed, it went the other way too, back when ISO-10646 had not 17, but 65536 *planes* and someone provided some reasonable evidence (or just plain reasoned arguments) that 4.3 *billion* characters was probably overkill. Technically, I think 10646 was always limited to 32,768 planes so that one could always address a code point with a 32-bit signed integer (a nod to the Java fans). Of course, 2.1 billion characters is also overkill, but the advent of UTF-16 was how we ended up with 17 planes. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
20.8.2011 0:07, Doug Ewell wrote: Of course, 2.1 billion characters is also overkill, but the advent of UTF-16 was how we ended up with 17 planes. And now we think that a little over a million is enough for everyone, just as they thought in the late 1980s that 16 bits is enough for everyone. -- Yucca, http://www.cs.tut.fi/~jkorpela/
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 08/19/2011 05:07 PM, Doug Ewell wrote: Mark E. Shoulsonmark at kli dot org wrote: And indeed, it went the other way too, back when ISO-10646 had not 17, but 65536 *planes* and someone provided some reasonable evidence (or just plain reasoned arguments) that 4.3 *billion* characters was probably overkill. Technically, I think 10646 was always limited to 32,768 planes so that one could always address a code point with a 32-bit signed integer (a nod to the Java fans). Whew! So I guess it wasn't THAT many characters anyway... :) (Like Hofstadter's story about the professor who says that she calculates that the sun will burn out in 5 billion years. A nervous voice in the back of the room asks h-how soon again? 5 billion years. Whew! says the voice, sounding relieved. For a minute I thought you said only 5 *million*) ~mark
Re: Code pages and Unicode
On 20 Aug 2011, at 00:35, Jukka K. Korpela wrote: And now we think that a little over a million is enough for everyone, just as they thought in the late 1980s that 16 bits is enough for everyone. Whenever somebody talks about needing 31 bits for Unicode, I always think of the hypothetical situation of discovering some extraterrestrial civilization and trying to add all of their writing systems to Unicode. I imagine there would be little to unify outside of U+002E FULL STOP. The point I'm getting at is that somebody always claims that U+..U+10 isn't enough, but I never see convincing evidence or rationale that an expansion is necessary—just speculation. —Ben Scarborough
RE: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
Jukka K. Korpela jkorpela at cs dot tut dot fi wrote: And now we think that a little over a million is enough for everyone, just as they thought in the late 1980s that 16 bits is enough for everyone. I know this is an enjoyable exercise — people love to ridicule Bill Gates for his comment in 1981 about 640K, even though that was an order of magnitude larger than any home computer of the day — but every time I hear someone protest that the Unicode code space won't be large enough, it eventually comes down to one of: 1. Expanding scope to cover extraterrestrial characters 2. Expanding scope to cover glyphs or other things that aren't currently considered characters I don't worry about item 1. I suppose I should worry some about item 2, ever since the emoji experience. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 8/19/2011 2:07 PM, Doug Ewell wrote: Technically, I think 10646 was always limited to 32,768 planes so that one could always address a code point with a 32-bit signed integer (a nod to the Java fans). Well, yes, but it didn't really have anything to do with Java. Remember that Java wasn't released until 1995, but the 10646 architecture dates back to circa 1986. So more likely it was a nod to C implementations which would, it was supposed, have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t, and which would have wanted a signed 32 bit type to work. I suspect, by the way, that that limitation was probably originally brought to WG2 by the U.S. national body, as they would have been the ones most worried about the C implementations of 10646 multi-octet forms. And the original architecture was also not really a full 32K planes in the sense that we now understand planes for Unicode and 10646. The original design for 10646 was for a 1- to 4-octet encoding, with all octets conforming to the ISO 2022 specification. It used the option that the working sets for the encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were not used except for the single-octet form, as in 2022-conformant schemes still used today for some East Asian character encodings. And the octets were then designated G (group) P (plane) R (row) and C. The 1-octet form thus allowed 95 + 96 = 191 code positions. The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions The Group octet was constrained to the low set of 94. (This is the origin of the constraint to half the planes, which would keep wchar_t implementations out of negative signed range.) The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions The grand total for all possible forms was the sum of those values or: *631,279,375* code positions (before various *other* set-asides for plane swapping and private use start getting taken into account) Of course, 2.1 billion characters is also overkill, but the advent of UTF-16 was how we ended up with 17 planes. So a lot less than 2.1 billion characters. But I think Doug's point is still valid: 631 million plus code points was still overkill for the problem to be addressed. And I think that we can thank our lucky stars that it isn't *that* architecture for a universal character encoding that we would now be implementing and debating on the alternative universe version of this email list. ;-) --Ken
Re: Code pages and Unicode
Benjamin M Scarborough 於 2011年8月19日 下午3:53 寫道: Whenever somebody talks about needing 31 bits for Unicode, I always think of the hypothetical situation of discovering some extraterrestrial civilization and trying to add all of their writing systems to Unicode. I imagine there would be little to unify outside of U+002E FULL STOP. Oh, I imagine they'll have one or two turtle ideographs. :-) Seriously, though, if and when we run into ETs with all their myriad writing systems, I really don't think that we'll be Unicode to represent them. = 井作恆 John H. Jenkins jenk...@apple.com
Re: Code pages and Unicode
On 8/19/2011 2:53 PM, Benjamin M Scarborough wrote: Whenever somebody talks about needing 31 bits for Unicode, I always think of the hypothetical situation of discovering some extraterrestrial civilization and trying to add all of their writing systems to Unicode. I imagine there would be little to unify outside of U+002E FULL STOP. It is the *terrestrial* denizens of this discussion list that I worry more about. Most of the proposals for filling up uncountable planes with numbers representing -- well, who knows? -- originate here. ;-) The point I'm getting at is that somebody always claims that U+..U+10 isn't enough, but I never see convincing evidence or rationale that an expansion is necessary—just speculation. Well, it is a late Friday afternoon in August. A slow news day, I guess. So it is time to trot out the periodically updated statistics that long ago convinced the folks who think 21 bits is just fine and dandy, and has a usefulness warranty that far exceeds our lifetimes, but which of course no matter how often repeated never convince the we-need-31-bits crowd. Newly updated to include the Unicode 6.1 repertoire in process for publication very early next year, the figures are: 110,181 characters encoded (graphic, format, and control codes counted) Now let's just assign that number an era of 2011, to make the math a little simpler. The first version of Unicode was published in 1991, so we've been at this for 20 years, not counting start up time. If you just divide 110,181 by 20 years, that is a rough average of 5509 characters added per year. But here is the interesting part: the rate of inclusion is declining, rather than being steady. Again, to make the math simpler, just compare the *first* decade of Unicode (1991 - 2001) and the *second* decade of Unicode (2001 - 2011). Unicode 3.1 (2001) had 94,205 characters in it. So: 1st decade: 94,205 characters, or roughly 9420 characters/year 2nd decade: 15,976 characters, or roughly 1598 characters/year Also keep in mind that the absolute numbers have always been completely dominated by CJK. 75.46% of the characters encoded in Unicode 3.1 are CJK ideographs (unified and compatibility). The IRG has been working mightily to keep adding to the total of encoded CJK ideographs, but they are starting to scrape the bottom even of that deep barrel. And look at the SMP Roadmap: http://www.unicode.org/roadmaps/smp/ We know there are a few big historic ideographic scripts to go: Tangut is the biggest and most advanced of the proposals, weighing in at something over 7000 characters. But even with East Asian heavyweights like Tangut, Jurchen, and Khitan given tentative allocations on the SMP roadmap, there is plenty of unassigned air on Plane 1 still. And frankly, a lot of very serious people have been looking hard for good, encodable candidate scripts to add to the roadmap, for a very long time. The upshot is, based on 20 years in the business, as it were, my best estimate of what we can expect for the next decade is something as follows: Two big chunks: roughly 10K more CJK ideographs nobody ever heard of, plus 7K+ Tangut ideographs. After that, the two committees (UTC and WG2) will be hard pressed to find and process many more than 1000 characters per year. Why? Because all the *easy* stuff was done long ago, during the first decade of Unicode. Everything from here on out is very obscure, hard to research, hard to document and review, hard to get consensus on, and is often fragmentary or even undeciphered, or consists of sets of notations that many folks won't even agree *are* characters. So: 10K + 7K + 1k/year for 10 years = 27,000 *maximum* additions by 2021. And that is to fill the gaping hole -- nay, gigantic chasm -- of 862,020 unassigned code points still left in the 21-bit space. Past 2021, who knows? Many of us will no longer be participating by then, but there are various possible scenarios: 1. The committees may creak to a halt, freeze the standards, and the delta encoding rate will drop from 1000/year to 0/year. This is actually a scenario with a non-zero probability. 2. Somebody with non-character agendas may seize control and start using numbers for, I don't know, perhaps localizable sentences, or something, just because over 835,000 numbers will be available and nature abhors a vacuum. I consider that a very low likelihood, because of the enormous vested interest there will be by the entire worldwide IT industry in keeping the character encoding standard stable. 3. Or, the committees may limp along more or less indefinitely, with more and more obscure scripts being documented and standardized, with a trickle of new ones always being invented, and new sets of symbols or notations being invented and stuck in. So maybe they could keep up the pace of 1000 characters encoded per year for some time off into the future. But at that rate, when do we have to start worrying? 835,000 divided
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 8/19/2011 2:35 PM, Jukka K. Korpela wrote: 20.8.2011 0:07, Doug Ewell wrote: Of course, 2.1 billion characters is also overkill, but the advent of UTF-16 was how we ended up with 17 planes. And now we think that a little over a million is enough for everyone, just as they thought in the late 1980s that 16 bits is enough for everyone. The difference is that these early plans were based on rigorously *not* encoding certain characters, or using combining methodology or variation selection much more aggressively. That might have been more feasible, except for the needs of migrating software and having Unicode-based systems play nicely in a world where character sets had different ideas of what constitutes a character. Allowing thousands of characters for compatibility reasons, more than ten thousand precomposed characters, and many types of other characters and symbols not originally on the radar still has not inflated the numbers all that much. The count stands at roughly double that original goal, after over twenty years of steady accumulation. Was the original concept of being able to shoehorn the world into sixteen bit, overly aggressive? Probably, because the estimates had always been that there are about a quarter million written elements. If you took the current repertoire and used code-space saving techniques in hindsight, you might be able to create something that fits into 16-bits. But it would end up using strings for many things that are now single characters. But the numbers, so far, show that this original estimate of a quarter million, rough as it was, appears to be rather accurate. Over twenty years of encoding characters have not been enough to exceed that. The million code points are therefore a much more comfortable limit and, from the beginning, assume a ceiling that has ample head-room (as opposed to the can we fit the world in this shoebox approach of earlier designs). So, no, the two cases are not as comparable. A./
Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)
On 8/19/2011 3:24 PM, Ken Whistler wrote: On 8/19/2011 2:07 PM, Doug Ewell wrote: Technically, I think 10646 was always limited to 32,768 planes so that one could always address a code point with a 32-bit signed integer (a nod to the Java fans). Well, yes, but it didn't really have anything to do with Java. Remember that Java wasn't released until 1995, but the 10646 architecture dates back to circa 1986. Yep. So more likely it was a nod to C implementations which would, it was supposed, have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t, and which would have wanted a signed 32 bit type to work. I suspect, by the way, that that limitation was probably originally brought to WG2 by the U.S. national body, as they would have been the ones most worried about the C implementations of 10646 multi-octet forms. No, it was the Japanese NB, as represented by the individual from Toppan Printing. This limitation was insisted upon in 1991, after the accord on the merger between Unicode and 10646, when 10646 was changed to use a flat codespace, not the ISO 2022-like scheme. And the original architecture was also not really a full 32K planes in the sense that we now understand planes for Unicode and 10646. The original design for 10646 was for a 1- to 4-octet encoding, with all octets conforming to the ISO 2022 specification. It used the option that the working sets for the encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were not used except for the single-octet form, as in 2022-conformant schemes still used today for some East Asian character encodings. And the octets were then designated G (group) P (plane) R (row) and C. The 1-octet form thus allowed 95 + 96 = 191 code positions. The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions The Group octet was constrained to the low set of 94. (This is the origin of the constraint to half the planes, which would keep wchar_t implementations out of negative signed range.) The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions The grand total for all possible forms was the sum of those values or: *631,279,375* code positions (before various *other* set-asides for plane swapping and private use start getting taken into account) This was so mind-bogglingly complicated that it was a deal breaker for many companies. Unicode's more restrictive concept of a character or its combining technology or many other innovations weren't initially seen as its primary benefits by people being faced with evaluating the differences between the formal ISO-backed project and the de-facto industry collaboration forming around Apple and Xerox. But the flat code space, now you were talking. Of course, 2.1 billion characters is also overkill, but the advent of UTF-16 was how we ended up with 17 planes. So a lot less than 2.1 billion characters. But I think Doug's point is still valid: 631 million plus code points was still overkill for the problem to be addressed. And I think that we can thank our lucky stars that it isn't *that* architecture for a universal character encoding that we would now be implementing and debating on the alternative universe version of this email list. ;-) Even remembering it makes my head hurt. A./