Re: unicode Digest V12 #108

2011-07-06 Thread Ken Whistler
On 7/6/2011 11:18 AM, Asmus Freytag wrote: The Danes, over a decade ago, when they made the official recommendation to use SHY appear to have come to the conclusion that AA can never occur accidentally, except at word division in compounds. Not really a safe conclusion. :)

Re: What are the issues in having U+FB06 fold to U+FB05?

2011-07-06 Thread Ken Whistler
On 7/6/2011 1:40 PM, Mark Davis ☕ wrote: The other two are special cases; they casefold together because of the way that the full case mapping is computed. Their equivalence is normally captured by a canonical-equivalent folding. Because the simple

Re: Proposed Update UAXes for Unicode 6.1

2011-07-08 Thread Ken Whistler
On 7/8/2011 10:26 AM, Philippe Verdy wrote: This is not related strictly related to this Unicode version update, but I have an interesting question about the Unicode Stability Policy. Summary: How does it apply to the exact value (or aliases) of the property Decomposition Type (dt), for

Re: Unicode 7.0 goals and ++

2011-07-11 Thread Ken Whistler
On 7/10/2011 4:58 PM, Ernest van den Boogaard wrote: For the long term, I suggest Unicode should aim for this: Unicode 6.5 should claim: There will be a *Unicode dictionary*, limiting and reducing ambiguous semantics within Unicode (Background: e.g. the word character will have one single

Re: Definition of character

2011-07-13 Thread Ken Whistler
On 7/13/2011 12:45 AM, Jukka K. Korpela wrote: For one thing, defining “Unicode character” as a technical term and using it consistently makes it possible to formulate clearly its relation to “character” in the common meaning, thereby helping people to understand and use Unicode better.

Re: Definition of character

2011-07-13 Thread Ken Whistler
On 7/13/2011 1:23 PM, Jukka K. Korpela wrote: I don’t see that biologists use the word “life” in any confusing manner comparable to the Unicode confusion around “character.” “Life” isn’t really a central concept in biology, and its use in biology hardly differs much from everyday use. Defining

Re: Definition of character

2011-07-13 Thread Ken Whistler
Since Jukka seemed to take issue with my responding to his proffered definitions by instead bringing up an analogy between life and character, I'll try responding directly to the attempted clarifications. On 7/13/2011 12:45 AM, Jukka K. Korpela wrote: That’s a completely different issue. The

Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Ken Whistler
On 7/15/2011 11:36 AM, Michael Everson wrote: Look at Figures 8-1 through 8-4 in the Unicode Standard 5.0. We see graphic characters shown, one representing space and two representing joiners. This is plain text. Bt. Thanks for playing! But the correct answer

Productive Glyph Design vs. Productive Character Representation (was: Re: Quick survey of Apple symbol fonts ... )

2011-07-18 Thread Ken Whistler
[changing the thread title to disentangle this issue from the Apple symbol font discussion] On 7/16/2011 1:08 AM, Julian Bradfield wrote: The other two could be proposed as unitary symbols, if anybody really needs to represent them. They are commensurate with a large number of similar symbols

Re: Gaps in Brahmic scripts section of SMP

2011-08-08 Thread Ken Whistler
On 8/2/2011 3:26 PM, stas624-...@yahoo.com wrote: [Mainly aimed at people who can change roadmaps] [I used online feedback form, but got no responce, so reposting it here.] Your feedback was forwarded to the roadmap committee, which will consider it in the context of other requests and

Re: How is NBH (U0083) Implemented?

2011-08-08 Thread Ken Whistler
On 8/1/2011 7:26 AM, Naena Guru wrote: This thread wandered off into an argument about whether U+FEFF ZWNBSP or U+2060 WJ is best supported and which should be used to inhibit line breaks. However, there are still several other issues which bear addressing in Naena Guru's questions: The

Re: on proposed new Arab script characters for African lanugages (n3882)

2011-08-12 Thread Ken Whistler
On 8/12/2011 3:19 PM, Lorna Priest wrote: Our original proposal had these unified, but for various reasons we were asked to disunify them. Lorna Original Message Subject: on proposed new Arab script characters for African lanugages (n3882) From: mmarx

Re: Proposed new characters updated in Pipeline Table

2011-08-15 Thread Ken Whistler
On 8/15/2011 10:38 AM, Philippe Verdy wrote: Unicode cannot encode a combining Wasla (because of various stability policies), so if Syriac needs a Wasla to be shown only over a letter or two, one needs to propose precomposed characters for them. Just like the existing Arabic Alef-Wasla.

Re: Greek Characters Duplicated as Latin

2011-08-15 Thread Ken Whistler
On 8/15/2011 8:50 AM, Andreas Prilop wrote: The Ohm sign should have been encoded as another example of squared letters and abbreviations. It comes from Asian character sets, I’d say the ohm sign comes from the MacRoman character set (0xBD).

Re: C1 Control Pictures Proposal

2011-08-17 Thread Ken Whistler
In general, I agree with Doug Ewell's assessment. I don't see a convincing case here for the need to encode more control picture characters for C1 controls. There seems to be a confusion here between the need for glyphs and the need for characters. Also, this would seem to me to be a receding

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Ken Whistler
On 8/19/2011 2:07 PM, Doug Ewell wrote: Technically, I think 10646 was always limited to 32,768 planes so that one could always address a code point with a 32-bit signed integer (a nod to the Java fans). Well, yes, but it didn't really have anything to do with Java. Remember that Java wasn't

Re: Code pages and Unicode

2011-08-19 Thread Ken Whistler
On 8/19/2011 2:53 PM, Benjamin M Scarborough wrote: Whenever somebody talks about needing 31 bits for Unicode, I always think of the hypothetical situation of discovering some extraterrestrial civilization and trying to add all of their writing systems to Unicode. I imagine there would be

Re: Code pages and Unicode

2011-08-22 Thread Ken Whistler
On 8/22/2011 9:58 AM, Jean-François Colson wrote: I wonder whether you aren’t a little too optimistic. No. If anything I'm assuming that the folks working on proposals will be amazingly assiduous during the next decade. Have you considered the unencoded ideographic scripts? Why, yes I

ALM (was: Re: RTL PUA?)

2011-08-22 Thread Ken Whistler
On 8/21/2011 3:31 PM, Richard Wordingham wrote: I expect ARABIC LANGUAGE MARK would not go down well - has it already been proposed and rejected?. ARABIC *LETTER* MARK, not *LANGUAGE* mark. (And suggested to just be renamed to AL MARK.) Proposed? Yes. Discussed? Yes. Rejected? No. The last

Re: Code pages and Unicode

2011-08-22 Thread Ken Whistler
On 8/22/2011 3:15 PM, Richard Wordingham wrote: On Monday 22 August 2011, Andrew Westandrewcw...@gmail.com wrote: Can anyone think of a way to extend UTF-16 without adding new surrogates or inventing a new general category? Andrew How about a triple sequence of two

Re: Code pages and Unicode

2011-08-24 Thread Ken Whistler
On 8/24/2011 10:48 AM, Richard Wordingham wrote: Those are two different claims. 'Never say never' is a useful maxim. So is Leave well enough alone. The problem would be in using maxims instead of an analysis of engineering requirements to drive architectural decisions. The extension of

Re: Code pages and Unicode

2011-08-24 Thread Ken Whistler
On 8/24/2011 3:51 PM, Richard Wordingham wrote: Well, in that case, the correct action is to work to ensure that code points are not squandered. Have there not already been several failures on that front? The BMP is littered with concessions to the limitations of rendering systems -

Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Ken Whistler
On 8/26/2011 3:13 PM, Philippe Verdy wrote: Isn't there an intersection between NameAliases.txt proposed in PRI202, and the informational table defined for UTR #25 at http://www.unicode.org/Public/math/revision-12/MathClassEx-12.txt which also lists other name aliases for other standards ? No.

Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Ken Whistler
On 8/26/2011 5:01 PM, Philippe Verdy wrote: we could as well include... are dangerous words here. Going encyclopedic is*completely* at odds with the normative intention of NameAliases.txt. Your statement then contradicts what PRI 202 says: the intent is to add various standard and de facto

Re: Continue: Glaring Mistake in the Code List of South Asian Script, Reply to Daug Ewell and Others

2011-09-12 Thread Ken Whistler
On 9/12/2011 9:13 AM, Philippe Verdy wrote: Well, wasn't the ISCII standard naming the script Bengali? It also gave the name Assamese, but was it a synonym or did it require a separate codepage switching code ? They were separate. Annex A of ISCII 1991 shows Bengali (BNG) and Assamese (ASM)

Re: Noticed improvement in the Code chart link http://www.unicode.org/charts/

2011-09-28 Thread Ken Whistler
On 9/28/2011 12:12 PM, delex r wrote: Not possible. Character and block names cannot be changed once they are assigned. It's two decades too late to make that change. The most that can be done now is adding a few annotations for Assamese. —Ben Scarborough ...It's two decades too late to

Re: definition of plain text

2011-10-14 Thread Ken Whistler
On 10/13/2011 10:49 PM, Peter Cyrus wrote: Is there a definition or guideline for the distinction between plain text and rich text? I think where you may be getting hung up is trying to define plain text versus rich text in terms of the content and/or appearance of the text (i.e. the

Re: definition of plain text

2011-10-14 Thread Ken Whistler
On 10/14/2011 11:47 AM, Joó Ádám wrote: Peter asked for what the Unicode Consortium considers plain text, ie. what principles it apllies when deciding whether to encode a certain element or aspect of writing as a character. In turn, you thoroughly explained that plain text is what the Unicode

Re: definition of plain text

2011-10-17 Thread Ken Whistler
On 10/17/2011 1:23 AM, Peter Cyrus wrote: Perhaps the idea of something embedded in the text that then controls the display of the subsequent run of text is the very definition of markup, whether or not that markup is a special character or an ASCII sequence like/spanspan style=gait:xxx;

Re: Yiddish digraphs

2011-10-19 Thread Ken Whistler
On 10/19/2011 12:08 PM, Mark E. Shoulson wrote: I think the issue here is (probably) a matter of legacy encodings, though someone else would need to confirm that. O.k., as self-appointed historian of the standard, I guess I need to be the one to answer that. ;-) The Yiddish digraphs were

Re: Default bidi ranges

2011-11-09 Thread Ken Whistler
On 11/9/2011 9:30 AM, Asmus Freytag wrote: On 11/9/2011 1:18 AM, Martin J. Dürst wrote: I tried to find something like a normative description of the default bidi class of unassigned code points. In UTR #9, it says

Economic Self-Interest (was: Re: combining: half, double, triple et cetera ad infinitum)

2011-11-14 Thread Ken Whistler
On 11/14/2011 2:39 PM, Naena Guru wrote: On the other hand, no company would send people to work at Unicode if they did not have an economic interest. One might as well rephrase that as: No company would send people to work at *any standard* if they did not have an economic interest. And

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-18 Thread Ken Whistler
On 11/17/2011 11:28 PM, Philippe Verdy wrote: Could the Unicode text specify that a left half mark, when it is followed by a right half-mark on the same line, has to be joined ? And which character can we select in a font to mark the intermediate characters between them ? No. This kind of

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-18 Thread Ken Whistler
On 11/18/2011 11:21 AM, Peter Cyrus wrote: Ken, you mention defined markup constructions, but nothing would prevent specialized rendering software from, for example, connecting a left half mark with the corresponding right half mark via titlo, even though the text is still only plain text with

Re: more flexible pipeline for new scripts and characters

2011-11-18 Thread Ken Whistler
On 11/18/2011 1:30 PM, Karl Williamson wrote: How is this different from Named sequences, which are published provisionally? Named sequences aren't character properties. When a newly encoded character is published in the standard, its code point, its name, and dozens of other properties all

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-18 Thread Ken Whistler
On 11/18/2011 5:24 PM, Philippe Verdy wrote: This arc in the example is definitely NOT mathematics Nor did I say it was. (even if you have read a version where it was attempted to represent it using a Math TeX notation in this page, an obvious error because it used an angular \widehat and

Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-18 Thread Ken Whistler
On 11/18/2011 5:36 PM, Philippe Verdy wrote: I have absolutely no clear way to represent sequences like in this example that use such elongated diacritic applied to runs of more than two characters. Nor should you expect to be able to represent such things in plain text. Such conventions are

Re: name change

2011-11-22 Thread Ken Whistler
On 11/22/2011 11:02 AM, a...@peoplestring.com wrote: In one of the discussions in this community, it was stated that once assigned, the name of a character cannot be changed. But I have noticed some characters have their name changed eg 'ARABIC LETTER YEH BARREE' (U+06D2) was previously named

Re: Archaic Pashto letter

2011-12-09 Thread Ken Whistler
On 12/9/2011 9:06 AM, Andreas Prilop wrote: Arabic letter U+0682 shows two dots above. It has the cryptic remark not used in modern Pashto. But was it ever used? To understand where the cryptic remark came from, you need to know more about the history of the character in the standard. U+0682

Re: Upside Down Fu character

2012-01-09 Thread Ken Whistler
On 1/9/2012 12:23 PM, Asmus Freytag wrote: So, my question remains, are there any other avenues besides hot-metal printed text I assume that was an exaggeration for rhetorical effect -- since hot-metal printing technology went out half a century ago, replaced first by phototypesetting and then

Re: Encoding Georgian and Nuskhuri letters for Ossetian and Abkhaz

2012-01-17 Thread Ken Whistler
On 1/17/2012 4:43 AM, satai wrote: I would like to address two textual issues in this proposal. These are not actually textual issues in the *proposal*, but rather issues regarding the annotation of the code charts for these additions. 1) U+10C8—U+10CC and U+2D28—U+2D2C are marked as

Re: UCA tertiary weight assignment vs. decomposition type definition in Unicode character database

2012-01-27 Thread Ken Whistler
On 1/27/2012 1:16 PM, Matt Ma wrote: Hi, There are a few characters having no decomposition type defined in UnicodeData.txt, but they were assigned tertiary weight in allkeys.text as if the characters had decomposition type. Here are a few examples (version 6.0.0), ... U+A733, U+A732,

Re: Question on U+33D7

2012-02-23 Thread Ken Whistler
On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote: It is defined as 33D7;SQUARE PH;So;0;L;square 0050 0048N;SQUARED PH in UnicodeData.txt, but it is shown as pH in code chart. Should it be 0070 0048 or PH? It should certainly be pH, i.e., square0070 0048/square, because that's

Re: Combining latin small letters with diacritics

2012-03-05 Thread Ken Whistler
On 3/5/2012 11:44 AM, Philippe Verdy wrote: So what do you propose ? It doesn't matter what *Michael* proposes at this point. These have already been approved by both the UTC and WG2 and are currently in DAM ballot. - Encoding the new precomposed pairs as a new combining character (there may

Re: Combining latin small letters with diacritics

2012-03-05 Thread Ken Whistler
On 3/5/2012 11:56 AM, Philippe Verdy wrote: Note that the first alternative is the one used in the DAM for encoding a separate COMBINING LATIN SMALL LETTER A/O/U WITH DIAERESIS Correct. But the document cited by Denis gives a much more productive way that allows stacking any kind of letters

Re: Combining latin small letters with diacritics

2012-03-05 Thread Ken Whistler
On 3/5/2012 12:17 PM, Benjamin M Scarborough wrote: On Mon, Mar 5, 2012 at 19:09, Michael Everson wrote: No, because both the combining-a and the combining-diaeresis are bound to the base letter; the combining diaeresis is not bound to the combining-a. Just like the proposed U+1ABB COMBINING

Re: Combining latin small letters with diacritics

2012-03-05 Thread Ken Whistler
On 3/5/2012 12:51 PM, Philippe Verdy wrote: You are so much attached to keep the existing encoding model unchanged, Yep. That's why I work on *standards*, after all. that now you are going to prepare for LOTS of additions of combining Latin characters with diacritics... The BMP won't be

Re: Combining latin small letters with diacritics

2012-03-05 Thread Ken Whistler
On 3/5/2012 2:01 PM, Denis Jacquerye wrote: Wouldn't CGJ be useful in some way in cases like that of the cedilla or the light centralization stroke 1AB9 ? Base character + combining letter + CGJ + combining cedilla would be clear, the cedilla would not be moved. How is that simpler than Base

Re: Combining latin small letters with diacritics

2012-03-05 Thread Ken Whistler
On 3/5/2012 2:32 PM, Denis Jacquerye wrote: I guess it's less messy than other situations. I just couldn't help wondering why combining letters with diacritics are being encoded but letters with diacritics or out of the question. Because the combining ones are *not* decomposed, and hence don't

Re: Combining latin small letters with diacritics

2012-03-06 Thread Ken Whistler
On 3/6/2012 2:34 PM, Leo Broukhis wrote: On 3/6/12, Doug Ewelld...@ewellic.org wrote: Speaking of U+17D2 KHMER SIGN COENG, what is a conforming renderer to do if someone writes A្B ? (U+0041 U+17D2 U+0042) Roll its eyes? I guess :), but how should it look on the screen? Just the way your

Fallback Display for COENG (was: Re: Combining latin small letters with diacritics)

2012-03-06 Thread Ken Whistler
On 3/6/2012 3:19 PM, Leo Broukhis wrote: On 3/6/12, Ken Whistlerk...@sybase.com wrote: On 3/6/2012 2:34 PM, Leo Broukhis wrote: On 3/6/12, Doug Ewelld...@ewellic.org wrote: Speaking of U+17D2 KHMER SIGN COENG, what is a conforming renderer to do if someone writes A្B ? (U+0041 U+17D2

Re: Fallback Display for COENG

2012-03-06 Thread Ken Whistler
On 3/6/2012 4:25 PM, Leo Broukhis wrote: What about Grapheme_Extend class characters placed out of context? It would be nice to see a dotted box in cases like AׁB (U+0041 U+05C1 HEBREW POINT SHIN DOT U+0042) That is pretty much up to the rendering system or font designer. --Ken

Re: Connector Punctuation and Overlines

2012-03-07 Thread Ken Whistler
On 3/6/2012 8:27 PM, fantasai wrote: Unicode has a Pc category into which it assigns various low lines: _U+005F LOW LINE ‿U+203F UNDERTIE ⁀U+2040 CHARACTER TIE ⁔U+2054 INVERTED UNDERTIE Those 4 are the actual connectors. The concept arose because of the

Re: Klingon on Unicode site?

2012-04-03 Thread Ken Whistler
On 4/3/2012 9:51 AM, Shawn Steele wrote: My assumption is the page uses JS to get the dates? Since my user locale happened to be set to Klingon, that’s what it displayed. Exactly. There is a call to: Date(document.lastModified).toLocaleString() in the Javascript. So for those who assumed

Re: Three character canonical decompositions in version 2 releases

2012-04-04 Thread Ken Whistler
On 4/3/2012 6:57 PM, Karl Williamson wrote: Is it an error on the web site that this policy was in effect in 2.0, and it really should be 3.0? (as there no such decompositions in the data files starting in 3.0). Yes. Or were these data files defective? No. The research to determine how

Re: Origins of ẘ

2012-04-16 Thread Ken Whistler
On 4/15/2012 10:04 PM, Asmus Freytag wrote: The 1E00 and 1F00 blocks were populated, in Unicode 1.1 by rejects from Unicode 1.0 that were re-admitted as part of the merger with ISO/IEC 10646. If you have anyone with access to the early (paper only) meeting documents of WG2, you might, just

Re: Support for non-BMP characters

2012-04-25 Thread Ken Whistler
On 4/25/2012 6:55 AM, Juanma Barranquero wrote: Ada 2012 is adding (quoting from the ARM): A.4.11 String Encoding [...] {AI05-0137-2} {AI05-0262-1} The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE

Re: Kaktovik Inupiaq numerals

2012-04-27 Thread Ken Whistler
On 4/27/2012 10:45 AM, Richard Wordingham wrote: If they are to be adopted by the CLDR, the digits need to be coded consecutively. I doubt this matters in any case, because this proposed use is for a vigesimal system, which has digits 0..19, not digits 0..9. Trying to treat the first 10 digits

Re: Writing Babylonian Numbers in Unicode

2012-04-30 Thread Ken Whistler
On 4/30/2012 3:33 PM, Richard Wordingham wrote: One is not compelled to construct U+3039 (〹) ,twenty' from two U+3038 (〸) ,ten', so a CUNEIFORM TWO U may well be missing. It looks as though it is. No, it isn't. It was present in Proposal N2664

Re: [unicode] Re: Canadian aboriginal syllabics in vertical writing mode

2012-05-01 Thread Ken Whistler
On 5/1/2012 11:19 AM, Michael Everson wrote: It does not matter if sideways text can be read as words, or just as gibberish. Good practice and typographic design will not rotate syllabic text because of the inherent confusability. Michael has a generally valid point. Rotating *small*

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Ken Whistler
On 5/16/2012 2:54 PM, Richard Wordingham wrote: Similar remarks apply to 'reorder'. What if I move 'Q' and 'q' into the Cyrillic sequence? (I've a recollection that this letter is used in Kurdish written in Cyrillic.) Obsolete recollection. See: 051A;CYRILLIC CAPITAL LETTER

Re: CaseFirst and CaseLevel Tailorings of UCA and LDML

2012-05-21 Thread Ken Whistler
On 5/21/2012 4:37 PM, Richard Wordingham wrote: Again, even the interpretation of uppercase in terms of weights is not certain, for the ISO/IEC 14651:2007 example of a tailoring for uppercase first does not adjust the collation elements with a tertiary weight of 1C, although they are listed as

Agile Processing for Standard Encoding of New Currency Signs (was: Re: Unicode 6.2 to Support the Turkish Lira Sign)

2012-05-23 Thread Ken Whistler
On 5/23/2012 7:05 AM, William_J_G Overington asked: For example, if a situation arose where a fast timetable is set for introducing one or more new currencies, each with a new currency symbol, is there a contingency plan in place such that what is preently set to be called Unicode 6.2 becomes

Re: Shift-JIS encoded text (was: RE: Tags and future new technologies [...])

2012-06-01 Thread Ken Whistler
On 6/1/2012 1:51 PM, Doug Ewell wrote: At what point does text encoded in a vendor's private-use extension to Shift-JIS become Shift-JIS encoded text? A possibly less confusing way to put this is: At what point does text encoded in a vendor's private-use extension to *JIS X 0208* become

Re: Are Named sequences always going to be graphemes?

2012-06-20 Thread Ken Whistler
On 6/20/2012 3:22 PM, Karl Williamson wrote: All current named sequences appear to be each a single grapheme. That seems like it should always be the case. Possibly, but keep in mind that neither the Unicode Standard nor UAX #29 in particular define what a grapheme is. UAX #29 specifies an

Re: Unicode Core

2012-06-22 Thread Ken Whistler
On 6/21/2012 11:22 PM, Julian Bradfield wrote: So, as long as code charts create production issues, print-on-demand for them is effectively not feasible. My hard-copy of the code charts was printed by Lulu - they're too big to print out on my office laserprinters! The only issue was joining

Re: Unicode Core

2012-06-22 Thread Ken Whistler
On 6/22/2012 3:55 PM, John H. Jenkins wrote: Wait a minute. Isn't 6.2 just adding the Turkish Lira? Does that really take the chart people more than about 10 minutes? The only *character* change is the Turkish lira. There are numerous updates to UAXes and other parts of the

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Ken Whistler
On 7/10/2012 4:22 PM, Mark Davis ☕ wrote: I would disagree about the preference for ratio; I think it is a historical accident in Unicode. Not really. The following pairs dating from Unicode 1.0 were deliberate: U+002D HYPHEN-MINUS U+2212 MINUS SIGN U+002F SOLIDUS (Unicode 1.0 called it

Re: BOM ambiguity?

2012-07-13 Thread Ken Whistler
On 7/13/2012 1:54 PM, Stephan Stiller wrote: So there is a BOM-ambiguity when a file starts with FF FE and then a couple of U+ characters, yes? Because this could be either UTF-16 or UTF-32 under little-endianness. Has this been pointed out and discussed beforehand? No, there is

Re: CLDR and ICU

2012-07-25 Thread Ken Whistler
On 7/25/2012 5:01 PM, Richard Wordingham wrote: What is the formal relationship between the Common Locale Data Repository (CLDR) and International Components for Unicode (ICU)? ... The ICU implementation of collation tailoring for changed ordering is bizarre in some complicated cases. (Life

Re: CLDR and ICU

2012-07-26 Thread Ken Whistler
On 7/26/2012 1:21 PM, Richard Wordingham wrote: I thought the Unicode Consortium had a formal policy of forbidding untrue (or misleading) claims of conformance to Unicode standards. No. What would be the point? Voluntary standards organizations have no mechanism for policing compliance. Sure,

Claims of Conformance (was: Re: CLDR and ICU)

2012-07-26 Thread Ken Whistler
On 7/26/2012 4:20 PM, Richard Wordingham wrote: Perhaps I've read too much into http://www.unicode.org/policies/logo_policy.html . The implication is that untrue or misleading claims using the word 'Unicode' are contravening the trademark. That's more on the level of making sure that when

Re: Claims of Conformance

2012-07-26 Thread Ken Whistler
On 7/26/2012 5:32 PM, Asmus Freytag wrote: However, such a misleading claim might subject someone to civil suit, don't you think? Sure, if someone could make a reasonable case that the misleading claim led to damages and wanted to litigate. But that isn't something that the Unicode Consortium

Re: U+25CA LOZENGE - why is it in the Mac OS Roman character set (and therefore widespread in current fonts)?

2012-08-13 Thread Ken Whistler
On 8/13/2012 10:11 AM, Peter Edberg wrote: I do not believe it was for accounting, logic, or mathematical use. It was included in the original Macintosh character set as shown in Figure 2 of the Font Manager chapter of Inside Macintosh, volume I (1985), but was not included in the shaded

Re: U+25CA LOZENGE - why is it in the Mac OS Roman character set (and therefore widespread in current fonts)?

2012-08-13 Thread Ken Whistler
On 8/13/2012 12:50 PM, Asmus Freytag wrote: In that context, you can't distinguish a lozenge from a squished diamond (*) from a diamond suit symbol. While the character is one a of a set, it was not uncommon to have people make do with somewhat similar characters standing in for each other.

Re: Why no combining‐character form for U+00F8?

2012-08-16 Thread Ken Whistler
On 8/16/2012 9:32 AM, Erkki I Kolehmainen wrote: Although the stroke is not a diacritic, keyboard drivers can be made to generate atomic characters with stroke by using a dead letter key for stroke together with the base character. And in addition to this observation by Erkki, it is also the

Re: Wrong plane numbers

2015-02-06 Thread Ken Whistler
Markus has already explained this. But the following explanation fills out some details. These @@ lines are conveniences for chart production. They are headers read by the unibook chart layout tool, which help guide where chart layout for a block starts and stops. The @@ lines are *NOT* block

Re: Unicode block for programming related symbols and codepoints?

2015-02-09 Thread Ken Whistler
I think this discussion is confusing the need for separate syntactic functions in formal language definitions with the need for *encoding* of characters. The distinction between assignment and test for equality has been around for decades in formal languages, and of course it is almost always

Language tags redux (was: Re: About cultural/languages communities flags)

2015-02-13 Thread Ken Whistler
Philippe may have overlooked the fact that this has been tried (years ago) in the Unicode Standard. See: language tags. http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf#G26419 The syntax for those even goes beyond just ISO 639-2/3 to incorporate the full range of BCP 47 tags, in

Re: About cultural/languages communities flags

2015-02-09 Thread Ken Whistler
To follow up on Doug Ewell's response, the mechanism currently standardized in the Unicode Standard for regional indicator codes has an interpretation tied to the two-letter codes of ISO 3166-1, and *not* to TLD's. The two are not directly connected. If anyone really wants to pursue getting a

Re: Meroitic cursive fractions numerical values

2015-03-28 Thread Ken Whistler
On 3/28/2015 1:05 PM, Karl Williamson wrote: In the 8.0 Beta files, some numerical values are not reduced to their lowest forms. Is there a compelling reason that 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R6/12;N; is not written as 109FB;MEROITIC CURSIVE FRACTION SIX

Re: Usage stats?

2015-03-27 Thread Ken Whistler
Search engine companies (and in particular, Google) have such information squirreled away in their index databases, at least as far as usage stats for Unicode characters on the web go -- but it is proprietary information, and they generally don't publish information about such statistics.

Re: Origin of the digital encoding of accented characters for Esperanto

2015-03-23 Thread Ken Whistler
On 3/23/2015 8:35 AM, William_J_G Overington wrote: Origin of the digital encoding of accented characters for Esperanto Twelve accented characters (uppercase versions and lowercase versions of six accented letters) used for Esperanto are encoded in Unicode. WJO is referring to U+0109,

Re: Origin of the digital encoding of accented characters for Esperanto

2015-03-23 Thread Ken Whistler
For ISO 8859-3, the answer is in the wiki: http://en.wikipedia.org/wiki/ISO/IEC_8859-3 It was designed to cover Turkish, Maltese and Esperanto, ... The answer for IBM CP905 is simple -- it is simply the EBCDIC code page of June, 1986 that corresponded to ISO 8859-3. That also covers the answer

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

2015-04-28 Thread Ken Whistler
Taking this thread back to the original question... The Line_Break property values for halfwidth katakana (lb=AL) and regular katakana (lb=ID) have been stable since they were first defined for Unicode 3.0 -- 15 years ago. Regardless of whether lb=AL is the optimal assignment for the halfwidth

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

2015-05-01 Thread Ken Whistler
Suzuki-san, On 5/1/2015 8:25 AM, suzuki toshiya wrote: Excuse me, there is any discussion record how UAX#14 class for halfwidth-katakana in 15 years ago? If there is such, I want to see a sample text (of halfwidth-katakana) and expected layout result for it. The *founding* document for the

A few emoji per year... (was: Re: Tag characters)

2015-05-15 Thread Ken Whistler
And to put Mark's comments in some statistical perspective, in the context of all the media hype, the true big bang for emoji in Unicode was Version 6.0, released over 4-1/2 years ago now. *That* was the Unicode release that added hundreds and hundreds of emoji for Japanese carrier

Custom characters (was: Re: Private Use Area in Use)

2015-06-03 Thread Ken Whistler
On 6/3/2015 5:17 PM, John wrote: so what? There should be a standard way to put custom characters anywhere that characters belong and have things “just work”. Well, that's the rub, isn't it? We (in IT) are still working pretty dang hard on the simpler problem, to wit: There

Re: Why aren't the emoji modifiers GCB=Extend?

2015-06-19 Thread Ken Whistler
Karl, This results from the fact that the fallback behavior for the modifiers is simply as independent pictographic blorts, i.e. the color swatch images. That is also related to why they are treated as gc=Sk symbol modifiers, rather than as combining marks or format characters. If you *support*

Re: trying to understand the relationship between the Version 1 Hangul syllables and the later versions'

2015-06-19 Thread Ken Whistler
Karl, As usual, the situation is way more complicated that perhaps it has any business being! It isn't just Version 1 Hangul that have to be considered, but also Version 1.1 Hangul. Version 1.0 contained 2350 Hangul syllables, encoded in the range 3400..3D2D. Version 1.1 contained 6646

Re: trying to understand the relationship between the Version 1 Hangul syllables and the later versions'

2015-06-24 Thread Ken Whistler
the early 1990's might know, however. --Ken On 6/24/2015 1:03 PM, Karl Williamson wrote: On 06/19/2015 04:12 PM, Ken Whistler wrote: The Unicode 2.0 set of 11,172 was known as the Johab set from KS C 5601-1992. That was an algorithmically designed replacement of the earlier sets from Korean

Unicode Terms of Use Clarification (was: Re: free download of ISO/IEC 10646)

2015-06-11 Thread Ken Whistler
and specifications in the development of products, but to discourage attempts to use the data in nonconformant or otherwise misleading implementations that would undermine the intended open interoperability of the Unicode Standard for all. Clear? --Ken Whistler, Technical Director, Unicode, Inc. On 6

Re: Arrow dingbats

2015-05-28 Thread Ken Whistler
Michel Suignard (editor of ISO/IEC 10646) responded to these questions, but let me augment his response with some more detailed history here. (Pardon the length of the reply, but these things tend never to be as simple as people assume and hope they are.) On 5/28/2015 2:08 PM, Chris wrote: So

Re: Tag characters

2015-05-27 Thread Ken Whistler
Doug, Read on in the minutes to the next day. 143-C27 and related actions. There are a few things to keep in mind here. 1. The un-deprecation of the tags U+E0020..U+E007E *is* part of the UCD for Unicode 8.0. The change has already taken place in the revised beta files now posted (see

Re: Some questions about Unicode's CJK Unified Ideograph

2015-05-29 Thread Ken Whistler
On 5/29/2015 5:20 PM, gfb hjjhjh wrote: 1. I have seen a chinese character ⿰言亜 from a Vietnamese dictionary NHAT DUNG THUONG DAM DICTIONARY** So, a.) In http://www.unicode.org/alloc/Pipeline.html , it show that CJK Extension E and F have already been accepted, but where can I check

Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Ken Whistler
On 6/2/2015 2:01 AM, William_J_G Overington wrote: Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: Um, that technology already exists. It is called a font. A mechanism to be able to use the method to define a glyph

Re: Adding RAINBOW FLAG to Unicode

2015-07-03 Thread Ken Whistler
On 7/2/2015 5:56 PM, Peter Constable wrote: Erkki, in this case, I think Philippe is making valid points. -For the proposal to be workable requires some means of ensuring stability of encoded representations. The way this would be done would be for CLDR to provide data with all valid

Re: PRI #299

2015-07-03 Thread Ken Whistler
On 7/3/2015 9:14 PM, Leo Broukhis wrote: On Fri, Jul 3, 2015 at 12:50 PM, Doug Ewell d...@ewellic.org wrote: Leo Broukhis leob at mailcom dot com wrote: What I don't like about PRI #399 is its proposing to use default- ignorable characters. On a non-vexillology-aware platform, I'd like to

Re: Adding RAINBOW FLAG to Unicode

2015-06-29 Thread Ken Whistler
Noah, Additional information you should have is that the UTC is about to publish a new Public Review Issue on the topic of an extended mechanism for the representation of more flag emoji with sequences of tag characters. (Note: *not* representation as encoded single character symbols.) That

Re: Adding RAINBOW FLAG to Unicode

2015-07-02 Thread Ken Whistler
On 7/2/2015 2:01 AM, Philippe Verdy wrote: The frozen status of Antarctica ... ... will be addressed separately by global warming. But be that as it may... In really there's still no standard way to encode flags unambiguously and in a stable way. We'd like to have FOTW (Flags of the

  1   2   3   >