Re: U+23D0 VERTICAL LINE EXTENSION

2003-07-23 Thread Kenneth Whistler
Jim Allan answered Alan Wood's question: Alan Wood posted on U+23D0 VERTICAL LINE EXTENSION: Is it intended as a Unicode replacement for Vertical arrow extender in Symbol font? Yes. See http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2508.htm for the proposal. The Unicode manual should

Re: Yerushala(y)im - or Biblical Hebrew

2003-07-23 Thread Kenneth Whistler
Peter Kirk cited Paul Nelson: On 23/07/2003 03:20, Paul Nelson (TYPOGRAPHY) wrote: Please look at the definition of GCJ and other such characters. Understand the differences between CGJ and ZWJ/ZWNJ. This discussion is very disturbing to me because after reading through the L2 document

Re: Yerushala(y)im - or Biblical Hebrew

2003-07-23 Thread Kenneth Whistler
I have been doing a little research into the defined properties of CGJ. I note also that according to http://www.unicode.org/book/preview/ch03.pdf it is defined in Unicode 4.0 as a Default Ignorable. Well, I am not surprised that some people are confused ... Yes, I'm not surprised,

Re: [Private Use Area] Audio Description, Subtitle, Signing

2003-07-17 Thread Kenneth Whistler
William spilled another ocean of digital ink. Found bobbing in that ocean was the comment: Roozbeh and I assigned two unencoded characters for Afghanistan to the PUA, and we encourage implementors to use them until such time as the characters are encoded. Yes. ... Now that at least one of

Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)

2003-07-17 Thread Kenneth Whistler
282 MES-2 is specified by the following ranges of code positions as indicated for each row... Philippe Verdy asked: As most of these characters are canonically decomposable, shouldn't this list include also the decomposed characters? Why is row 03 so resticted? Shouldn't it include

Re: Aramaic, Samaritan, Phoenician

2003-07-15 Thread Kenneth Whistler
Peter Kirk responded to Michael Everson: What is this thread for? We're going to encode Phoenician. It is the forerunner of Greek and Etruscan. Hebrew went its separate way. The fact that there is a one-to-one correspondence isn't important. We have that for Coptic and Greek too and we

Re: [Private Use Area] Audio Description, Subtitle, Signing

2003-07-14 Thread Kenneth Whistler
At 10:34 -0700 2003-07-14, Peter Kirk wrote: On 14/07/2003 09:04, Doug Ewell wrote: * Michael Everson's and Roozbeh Pournader's provisional PUA assignments for ARABIC PASHTO ZWARAKAY and AFGHANI SIGN, two legitimate characters that cannot be represented in Unicode by any other means.

Re: Aramaic, Samaritan, Phoenician

2003-07-14 Thread Kenneth Whistler
Peter Kirk asked: So is there a real justification for separate alphabets here? http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2311.pdf And Michael Everson can, no doubt, provide further justification beyond this sketch of how the roadmap has been structured for this script family. Note that when

Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Kenneth Whistler
Peter Kirk asked: In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any

Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures

2003-07-10 Thread Kenneth Whistler
and Philippe Verdy responded with another question: Isn't there a Grapheme Disjoiner format control character to force the absence of a ligature like fi, i.e. f, GDJ, i? The answer to Philippe's rejoinder question is no, there is not a Grapheme Disjoiner format control

RE: When is a character a currency sign?

2003-07-09 Thread Kenneth Whistler
Asmus wrote: Unicode assigns the general category value, Sk, or Symbol, [k]urrency to all characters whose *primary* function is to act as a currency symbol. recte: Sc, or Symbol, [c]urrency Sk is for Symbol, modifier, referring basically to spacing accents and other similar

Re: Reading Chinese Characters from a browser

2003-07-09 Thread Kenneth Whistler
Philippe Verdy responded to a question by SRIDHARAN Aravind: How can I differentiate whether a given character in chinese is simplified or traditional? Normally you can't with Unicode/ISO10646: They are unified now by the UniHan working group, to be used for Traditional or Simplied

Re: Biblical Hebrew

2003-06-27 Thread Kenneth Whistler
Karljürgen, 2. Consequently ANY OTHER solution than 'FIX the obvious mistake(s)' is a kludge (contra Philippe's (?) recent comment). One *pays* for all kludges, one way or the other. Digital encoding of writing systems is a kludge. And boy, do we seem to be paying for the Unicode version of

Re: Biblical Hebrew

2003-06-27 Thread Kenneth Whistler
Philippe Verdy said: I understand the frustration: if Unicode had not attempted to define combining classes, which were not necessary to Unicode, all existing combining characters would have been given a CC=0 (or all the same 220 or 230 value). Uh, no. Under this scheme, a, diaeresis,

Mongolian Rant (was: Biblical Hebrew... was: Tibetan... was: ...)

2003-06-27 Thread Kenneth Whistler
Andrew West wrote: I have to agree 100% with Peter on this. The potential fiasco with regards to Mongolian Free Variation Selectors is another area where our grandchildren are going to be weeping with despair if we are not careful. Well, I doubt that our grandchildren will be quite *that*

Re: Biblical Hebrew (U+034F Combining Grapheme Joiner works)

2003-06-27 Thread Kenneth Whistler
Peter countered: Could this finally be the missing killer ap for the CGJ? It will be perfect to allow an application like XML to encode Hebrew text using Unicode 4.0 rules (and before). It is not perfect. CGJ is supposed to be significant (and kept in the text) for a variety of

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Kenneth Whistler
Peter responded: Kenneth Whistler wrote on 06/26/2003 05:36:34 PM: Why is making use of the existing behavior of existing characters a groanable kludge, if it has the desired effect and makes the required distinctions in text? Why is it a kludge to insert some cc=0 control character

RE: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Kenneth Whistler
Elisha Berns asked: It would appear from your answer that even after implementing the algorithm to search the Unicode block coverage of a font, the actual comparison data, that is which blocks to compare and how many code points, is totally undefined. Is there any kind of standard for

Re: Revised N2586R

2003-06-26 Thread Kenneth Whistler
Doug, Peter, and Michael already provided good responses to this suggestion by William O, but here is a little further clarification. Well, certainly authority would be needed, yet I am suggesting that where a few characters added into an established block are accepted, which is what is

Re: Yerushala(y)im - or Biblical Hebrew (was Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Jony took the words right out of my mouth: How about RLM? Jony This already belongs, naturally, in the context of the Hebrew text handling, which is going to have to handle bidi controls. Another possibility to consider is U+2060 WORD JOINER, the version of the zero width non-breaking space

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Peter responded: Ken Whistler wrote on 06/25/2003 06:57:56 PM: People could consider, for example, representation of the required sequence: lamed, qamets, hiriq, final mem as: lamed, qamets, ZWJ, hiriq, final mem So, we want to introduce yet *another* distinct

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
Michael wrote: At 15:36 -0700 2003-06-26, Kenneth Whistler wrote: I now like better the suggestions of RLM or WJ for this. ZZZT. Thank you for playing. RLM is for forcing the right behaviour for stops and parentheses and question marks and so on. Introducing it between two

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-26 Thread Kenneth Whistler
John, At 03:36 PM 6/26/2003, Kenneth Whistler wrote: Why is making use of the existing behavior of existing characters a groanable kludge, if it has the desired effect and makes the required distinctions in text? If there is not some rendering system or font lookup showstopper here, I'm

Re: Biblical Hebrew

2003-06-26 Thread Kenneth Whistler
John Hudson wrote: At 03:52 PM 6/26/2003, Rick McGowan wrote: I'll weigh in to agree with Ken here. The solution of cloning a whole set of these things just to fix combining behavior is, to understate, not quite nice. No, but would be far from the not nicest thing in Unicode, and there's

Saguaros in Tucson (was Re: Revised N2586R)

2003-06-25 Thread Kenneth Whistler
Oh yeah, that reminds me. When are you going to propose the SUGUARO SYMBOL? My wife's from Arizona; I'll back that one. Recte SAGUARO. I lived in Tucson from junior high to my B.A. I guess I would propose one if it were, as the SHAMROCK is, used to indicate something in lexicography or

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler
Peter asked: How can things that are visually indistinguishable be lexically different? chat (en) chat (fr) We don't encode the phonological distinctions between homographs; we encode text. But I agree that we encode text. Both words above, which are *lexically* distinct, would have the

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler
At 18:26 +0100 2003-06-25, Michael Everson wrote: You'd like to think so. But Deprecate TIBETAN THINGY and add TIBETAN THINGY BIS so that we can fix the problem is utterly ridiculous. And by that I mean, given the TWO standards Unicode and ISO/IEC 10646, adding duplicate characters is

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler
John Hudson wrote: In Biblical Hebrew, it is possible for more than one vowel to be attached to a single consonant. This means that is it very important to maintain the ordering of vowels applied to a single consonant. The Unicode Standard assigns an individual combining class to every

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-25 Thread Kenneth Whistler
John Hudson wrote: At 02:36 PM 6/25/2003, Michael Everson wrote: Write it up with glyphs and minimal pairs and people will see the problem, if any. Or propose some solution. (That isn't add duplicate characters.) Peter Constable has written this up and submitted a proposal to the UTC.

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Kenneth Whistler
John Hudson wrote: This idea of Hebrew vowels as 'fixed' marks is problematical, because in Biblical Hebrew they are not fixed: they move relative to additional marks (other vowels or cantillation marks). It may be more *difficult* for applications to do correct rendering, but there was

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-25 Thread Kenneth Whistler
For example, the alleged problem of the vocalization order of the Masoretes might be amenable to a much less drastic solution. People could consider, for example, representation of the required sequence: lamed, qamets, hiriq, final mem as: lamed, qamets, ZWJ, hiriq, final mem

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-24 Thread Kenneth Whistler
Chris Fynn wrote: In Unicode's UnicodeData.txt ( http://www.unicode.org/Public/UNIDATA/Unicodea.Dattxt ) 0F7E has a Canonical Combining Class Value (CCCV) of 0; 0F71 a CCCV of 129; 0F72 0F7A 0F7B 0F7C 0F7D and 0F80 a CCCV of 130; 0F74 a CCCV of 132; and 0F82 and 0F83 have a CCCV of

RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK

2003-06-23 Thread Kenneth Whistler
Actually, there are a number of loose ends still, as it appears that some of Rob Mount's questions were not actually answered. I understand what you say about word formation, and combining marks, and that the Alphabetic classification should not be limited to Ls. But 30FC is of General

Wash Symbols and Iconography (was Re: Revised N2586R)

2003-06-23 Thread Kenneth Whistler
At 23:33 +0200 2003-06-23, Philippe Verdy wrote: What about the many symbols used to signal how clothes can be cleaned, And Michael Everson responded: A well-defined semantic set that I think deserves encoding. :-) If what you mean is: http://www.waschsymbole.de/en/index.html then some

Re: Problem with Arial Unicode MS font for BOLD/ITALICS in PDF

2003-06-20 Thread Kenneth Whistler
Philippe Verdy, But it's true that complex scripts like Han will be poorly rendered in Bold or Italic... But does someone actually wants to read Han text with Bold characters (or even worse slanted with Italic) ? What is true is that use of italicized text is unusual in Chinese or Japanese

Re: Chinese departing tone marks

2003-06-19 Thread Kenneth Whistler
John Cowan remarked: I've never seen these particular U+02EA and U+02EB signs, but from the names, I'd say U+02EA is Cantonese 33 tone, and U+02EB is 22; U+02e7 and U+02e8 might then be used for 3 and 2, respectively. They look weird. The U+02EB (yang) one looks like reversed or turned

Re: Arabic script web site hosting solution for all platforms

2003-06-19 Thread Kenneth Whistler
Philippe Verdy said: I can say that bulk+unsollicitated makes it fully qualifiable as SPAM. And Theodore Smith countered: No. I'd say spam also needs to be untargeted. Also spammers don't tend to come on list, identified with their full names and argue the relevance of their posts, as

Re: Looking for two mathematical characters

2003-06-16 Thread Kenneth Whistler
Philippe Verdy noted: In the APL subblock of the Misc.Technical block, The APL range (not subblock) of the Miscellaneous Technical block is U+2336..U+237A, so the following characters are not part of that APL range: the character ⌟ (U+231F) is also a small bottom-right corner operator, and

Re: When do you use U+2024 ONE DOT LEADER instead of U+002E FULL STOP?

2003-05-31 Thread Kenneth Whistler
Philippe Verdy vamped: For example I would not be shocked if a text using it was rendered with a monospaced font, where the base line of the character cell shows multiple tiny dots, that create a contiguous dotted line when multiple U+2024 characters (one per display cell) are used

Re: When do you use U+2024 ONE DOT LEADER instead of U+002E FULL STOP?

2003-05-31 Thread Kenneth Whistler
Michael, As a typesetter on Mac OS X, I see no reason to abandon the use of the three-dotted horizontal ellipsis character, Ken. Nor do I. It is fine for ellipses... And it was encoded for that. But in encodings which don't have an ellipsis character, it is roughly comparable to a sequence

Re: When do you use U+2024 ONE DOT LEADER instead of U+002E FULL STOP?

2003-05-31 Thread Kenneth Whistler
Philippe Verdy continued: What surprizes me the most in the Unicode spec is that it both says that its purpose is to create arbitrary length of leaders As in plain text, as can be seen in Table of Content listings in many RFCs, for example. (Which, however, use ASCII 0x2E for the same

RE: IPA Null Consonant

2003-05-30 Thread Kenneth Whistler
Kent: Others gave references where it in most cases did NOT look at all like the empty set symbol. Gustav Leunbach (1973), Morphological Analysis as a Step in Automated Syntactic Analysis of a Text.http://acl.ldc.upenn.edu/C/C73/C73-2022.pdf uses an empty set symbol to denote a morphological

Re: U+1D29

2003-05-30 Thread Kenneth Whistler
António asked: I've just downloaded the PDF files with 4.0 additions (U40-*.pdf). One question: How is one supposed to tell apart the glyphs for U+1D29 and U+1D18?... Or one isn't?... (OK, this question is probably more suited to be posed to IPA, but.) Visually, you usually couldn't, any

Re: “book end” or enclosing characters in most languages?

2003-05-30 Thread Kenneth Whistler
Ben Dougall asked: On Thursday, May 29, 2003, at 02:10 pm, Philippe Verdy wrote: Interestingly, the French first-level quotation marks use what we call chevrons (double angle brackets). are they something that's in unicode? apart from the less than and greater than symbols i can't

Re: book end or enclosing characters in most languages?

2003-05-30 Thread Kenneth Whistler
Philippe Verdy wrote: Code positions 0xAB and 0xBB (in ISO-8859-1) are canonically equivalent to Unicode U+00AB («) and U+00BB (») code points. One correction -- this has nothing to do with canonical equivalence. This (as for all other ISO/IEC 8859-1 encoded characters) is an example of

Re: [Not OT] localized names of the Unicode Control characters

2003-05-30 Thread Kenneth Whistler
Philippe Verdy said: So I think names in both Windows and this Hapax page come from a ISO10646 normative reference file in French, and it contains the names for Unicode3.2 characters (but still not new characters added or modified in Unicode 4.0) and then asked: Also, as this alternate

Re: Not snazzy (was: New Unicode Savvy Logo)

2003-05-27 Thread Kenneth Whistler
Theodore Smith wrote: My first reaction, is that the logos don't look like they compare to other logos in terms of style. For example Mac OSX logos, XML logos, and that generally do look more snazzy. They were loosely modelled on the W3C HTML validation logo, which is comparable, in some

Re: IPA Null Consonant

2003-05-27 Thread Kenneth Whistler
Thomas Widmann continued: [EMAIL PROTECTED] writes: Yes, I think you're right that an annotation is best -- but only if EMPTY SET is indeed the right character. I'm increasingly of the opinion that a different character might be needed. I would disagree. As would I. Oh

Re: Dutch IJ, again

2003-05-27 Thread Kenneth Whistler
Philippe Verdy continued: From: Mark Davis [EMAIL PROTECTED] From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] On 2003.05.25, 00:00, Philippe Verdy [EMAIL PROTECTED] wrote: even if the Dutch language considers it as a single letter, in a way similar to the Spanish ch I see

Re: ogonek vs. retroflex hook

2003-04-04 Thread Kenneth Whistler
Peter continued: Ken Whistler wrote on 04/02/2003 03:54:10 PM: That isn't the only convention. I am finding several samples of typographic retroflex hook being used to indicate nasalisation of vowels. Jim Allan is right. It is the *ogonek* which is used to signify the nasalization

Re: ogonek vs. retroflex hook

2003-04-04 Thread Kenneth Whistler
Peter, Why you would feel that such user sense of the characters they are using is belied by your analysis of the shape of the hooks used in the IJAL font is beyond me. I'm sorry I wasn't clearer. I was not referring to their status in terms of defining characters. I was *only*

Re: ogonek vs. retroflex hook

2003-04-04 Thread Kenneth Whistler
Peter, Note that the example you posted also had an h-ogonek, so the usage is not limited to vowels, per se. Indeed. (Although that particular entity itself is a little bizarre, since you cannot really nasalize a voiceless glottal fricative. Then you'd be even more surprised

Re: ogonek vs. retroflex hook

2003-04-02 Thread Kenneth Whistler
At 11:33 -0600 2003-04-02, [EMAIL PROTECTED] wrote: John Hudson [EMAIL PROTECTED] wrote on 04/02/2003 11:28:28 AM: Yes, I would consider those ogoneks. What do they signify in Dogrib? Nasalisation? Not yet sure, but waiting to find out. I would imagine they are nasals as in

Re: letters with palatal hook

2003-04-02 Thread Kenneth Whistler
Peter quoted me: As far as I know, the same completeness issue does not apply for the retroflex and palatal hooks -- so for those, use of the preformed base letters is probably the better recommendation, rather than use of the non-spacing diacritics together with ligature tables in the fonts.

RE: ogonek vs. retroflex hook

2003-04-02 Thread Kenneth Whistler
Jim Allan responded to Joe Becker: Joe posted: c. CEDILLAS AND HOOKS: Two cedillas and two hooks are required as diacritical marks for bibliographic transcription, and also for the proper representation of a number of languages (as documented in ANSI Z39.47-1985 and ISO

Re: letters with palatal hook

2003-04-02 Thread Kenneth Whistler
Creating palatal-hook v's, x's, k's, s's, and so on if they are not in significant use and when multiple, equally accurate, alternative representations are available, may not be the best thing to do. Incidentally, reviewing Pullum and Ladusaw (1986) to help provide the definitive answer on

Re: ogonek vs. retroflex hook

2003-04-02 Thread Kenneth Whistler
Peter, Jim Allan wrote on 04/02/2003 12:27:07 PM: This fits a normal convention in American linguistics to use ogonek to signify a nasal. That isn't the only convention. I am finding several samples of typographic retroflex hook being used to indicate nasalisation of vowels. Jim Allan

Re: New contribution

2003-04-01 Thread Kenneth Whistler
N258A Proposal to encode two COMBINING HEART characters in the UCS by Michael Everson, Roozbeh Pournader, and John Cowan http://www.evertype.com/standards/iso10646/pdf/n258a-heartdot.pdf Given the date this was submitted and the contents of the proposal, may I guess that this is but a

Re: Characters for Cakchiquel

2003-03-28 Thread Kenneth Whistler
But I do find, in the vocabulary and index, words starting with tz are sorting after quatrillo con coma (it goes z, tresillo, quatrillo, quatrillo con coma, tz). So even for this text, a tz ligature is marginal. irrelevant to the

Re: Characters for Cakchiquel

2003-03-28 Thread Kenneth Whistler
Stefan asked: Michael Everson wrote: Shavian has graduated to encoded status, and Tengwar and Cirth will likely also do so. Really? I thought that it would not until Unicode 4.0 is published. The Unicode 4.0 release is imminent -- we are anticipating mid-April for finalization of the

Re: Annotation

2003-03-26 Thread Kenneth Whistler
Michael, According to the American Heritage Dictionary of the English Language, page 1303, in the list of symbols and signs, it indicates that a symbol similar to the per-mille sign can be used to indicate salinity. Nice annotation. Having said that, the etymology of the percent sign

Re: Help needed with Davanagari glyph

2003-03-21 Thread Kenneth Whistler
Does anyone know how to make the Devanagari glyph indicated here http://www.hotpeachpages.net/lang/defn1.html#Hindi i.e. the glyph I have drawn a rectangle around three samples of? If yes, please tell me. U+0936 DEVANAGARI LETTER SHA (although you have just circled the left half of the

RE: ANSI requires licence fees to use ISO language and country code?

2003-03-21 Thread Kenneth Whistler
Michael, A representative of ISO sent this to me today. I do not know about ANSI but for ISO/CS the quote given below from http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/ind ex.html is certainly correct. We make a distinction between implementation and

Re: ANSI requires licence fees to use ISO language and country code?

2003-03-20 Thread Kenneth Whistler
ANSI has membership fees, accreditation fees, and a scheme for site licensing for access to standards documents. But I've never heard of a license fee for *use* of ISO 639 or ISO 3166 codes. Once you acquire the standard, you should be able to freely use it. That is how ISO standards work. Where

Re: ANSI requires licence fees to use ISO language and country code?

2003-03-20 Thread Kenneth Whistler
I'm guessing this may be related to the fact that ISO is now delivering ISO 3166-1/2 codes in the form of two Microsoft Access 2000 databases. (Although you can also order the standards without the database files.) http://www.iso.ch/iso/en/prods-services/iso3166ma/05database/index.html Trying to

Re: Re. and Rs. currency sign

2003-03-18 Thread Kenneth Whistler
Lateef Sagar Shaikh asked: For Rupees Rs. sign is used, and for Rupee Re. sign is used, where as in Unicode only onle code point is present for Rs. Shouldn't there be a separate place for Re. as well? No. Rather than using U+20A8 RUPEE SIGN, ordinary typographic practice would just be to use

Re: geometric shapes

2003-03-18 Thread Kenneth Whistler
Pim Blokland asked: I've got a few questions about the use of geometric shapes, like squares and such. Some of these look very similar to one another, and I don't know which ones to use in which circumstances! Are their any guidelines on their use? Just as an example, let's look at the

Re: Re. and Rs. currency sign

2003-03-18 Thread Kenneth Whistler
Stefan wrote: Kenneth Whistler wrote: DM was widely used for Deutschmarks, dkr for Danish kroner, and so on before the switch to euros, for example. I've only seen Danish kroner abbreviated as kr or DKK, never as dkr. kr is the most common abbreviation in Denmark today; DKK

Re: List of ligatures for languages of the Indian subcontinent.

2003-03-17 Thread Kenneth Whistler
William Overington asked: And nobody out there is volunteering to do it. I was told that I could commission it. That statement by Michael Everson was not a *permission*, but merely a statement of fact. Anyone can commission any expert they like, under contract to produce whatever output or

Re: U+00D0, U+01b7 -- variants or distinct chars?

2003-03-17 Thread Kenneth Whistler
Peter, U+00D0: The glyph that appears in the code charts for U+00D0 is shown in LtnCapEth_DStrk.gif. Now, the African Reference Alphabet document that was produced at a conference in Niamey in 1978 proposeda small letter that looks like U+00F0 LATIN SMALL LETTER ETH, but the capital

Re: Unicode 4.0 chapter headings and numbering.

2003-03-14 Thread Kenneth Whistler
William Overington asked: I wonder if you could please say whether the Unicode 4.0 book will have the same chapter headings and numbering as the Unicode 3.0 book? They will be largely similar -- and identical for Chapters 1 through 5 -- but there are various reorganizations in the latter part

Re: New document.

2003-03-14 Thread Kenneth Whistler
Otto Stolz wrote: The two scans under http://www.rz.uni-konstanz.de/Antivirus/tests/li.png http://www.rz.uni-konstanz.de/Antivirus/tests/re.png are from the authoritative (until July 1996) book on German orthography: Duden Rechtschreibung der deutschen Sprache und der

Re: Allocation of Georgian Extended block

2003-03-12 Thread Kenneth Whistler
The reason is that the Myanmar block was given four empty columns because we already *know* of numerous characters that will need to be added to the Myanmar script to support Shan, Karen, Mon, and other minority languages written with the script. Ending the Myanmar block at U+109F (instead of

Re: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Kenneth Whistler
We've asked. But you need to understand that publishers have their own rules and constraints. Paper is bought in huge quantities by publishers, and special purpose papers (such as lightweight, thin, high-opacity papers used in dictionaries) are expensive and carefully planned for. As important as

Re: Encoding: Unicode Quarterly Newsletter

2003-03-10 Thread Kenneth Whistler
Not to disagree publicly with Michael or Mark on this, but in the interests of accuracy, I should point out that if the rest mass of the Unicode 4.0 publication is assumed to be exactly 4.1 kg (which then would, indeed, also be the case on our moon, or even a Jovian moon), and ignoring any

Re: Unicode character transformation through XSLT

2003-03-10 Thread Kenneth Whistler
Well, I can't diagnose exactly what is going wrong, but Unicode character (\uFFE2\uFF80\uFF93) is a sequence of a full-width not sign, followed by a half-width katakana ta and a half-width katakana mo. What you are actually looking for is the UTF-8 sequence: 0xE2 0x80 0x93 which is the UTF-8

CGJ and ZWJ (was Re: Currency symbols)

2003-03-10 Thread Kenneth Whistler
Antonio asked: On 2003.02.25, 19:36, Asmus Freytag [EMAIL PROTECTED] wrote: At 12:55 PM 2/25/03 +, Anto'nio Martins-Tuva'lkin wrote: Most (all?) of them are composable, either by means of letter + slash (OSLI) or by ZWJ (for things like Pta or Pts, if anything), Using ZWJ

Re: Impossible combinations?

2003-03-02 Thread Kenneth Whistler
On Sun, 2 Mar 2003, Kevin Brown wrote: Does anyone know of a Latin-based language in which it is possible to have a lowercase immediately followed by an uppercase in the SAME word? In addition to the examples pointed out by Roozbeh and Michael, this pattern is growing increasingly common

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Kenneth Whistler
Frank Tang asked: This discussion has been centered around UTF-8. But I hope the corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0: . for UTF-32: occurrences of 'surrogates' are ill-formed. How about UTF-32 sequence which the 4 bytes represent value U+10 ?

UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Kenneth Whistler
Frank Tang responded to Kent Karlsson's response: The problem I need to deal with is not GENERATE those UTF-8, but how to handle these DATA when my code receive it. For example, when I receive a 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8 sequence in the line 990

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Kenneth Whistler
, thus reconstructing all the gaps. Of course, there are much better approaches to self-correcting data transmission, but you get the idea. This would be a perfectly valid and conformant way to use UTF-8 data. tex Kenneth Whistler wrote: Absolutely. Error handling is a matter of software

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Kenneth Whistler
Stefan Persson suggested: Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value sequences. There were two types: a. 0xC0 0x80 for U+ (instead of 0x00) b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80 0x80) Ah, but encoding NULL as a

Re: please review the paper for me

2003-02-26 Thread Kenneth Whistler
Frank Tang wrote: I think that is a very commn mistake people WILL make. Especially if they keep telling each other the wrong thing, and then rely on folklore about the standard as their source of information. The ultimate source of information about a standard is the standard itself. If

Re: Unicode 4.0 BETA available for review

2003-02-26 Thread Kenneth Whistler
Frank Tang continued: If you read through those definitions from Unicode 4.0 carefully, you will see that UTF-8 representing a noncharacter is perfectly valid, but UTF-8 representing an unpaired surrogate code point is ill-formed (and therefore disallowed). I see a hole here. How about

Re: Unicode 4.0 BETA available for review

2003-02-25 Thread Kenneth Whistler
Frank Tang asked: so the UTF-8 sequence which represent U+FFFE U+ and U+{1-11}FFF{E,F} are consider legal in Unicode 4.0 Yes. Such sequences are also legal in Unicode 3.0, 3.1, and 3.2. The Unicode Standard, Version 3.0 specified, on p. 46: To ensure that round-trip transcoding is

Re: guarani sign

2003-02-24 Thread Kenneth Whistler
Don't all overwhelm the sites at once, but here is the documentation people are looking for: http://www.birdtheme.org/country/paraguay.html Paraguay has published a lot of stamps with bird themes. If you look at the 1983 series of South American birds, you will see that they were using Gs. for

Re: Unicode 4.0 BETA available for review

2003-02-24 Thread Kenneth Whistler
Frank Tang asked: I am working on update the Mozilla UTF-8 code to incooperate the change of UTF-8 definitation in Unicode 3.1 (make non-shortest form illegal, and make 5-6 octets illegal) and Unicode 3.2 (make irregular form illegal) now. I wonder do have any change of the UTF-8

Re: CJK Unified Ideographs Range

2003-02-21 Thread Kenneth Whistler
Andrew followed up: Maybe what I'm really trying to ask is, if sometime in the future we start to run out of space in the BMP, could U+9FB0 through U+9FFF be reallocated to some new script, or is the allocation of these 80 codepoints to the CJK block permanent and irrevocable ? Please study

Re: Hot Beverage font

2003-02-19 Thread Kenneth Whistler
I know y'all are having fun with this thread, but in case Andrew's inquiry is at least half-serious: But why is the Hot Beverage character listed under the heading Weather Symbol in the Miscellaneous Symbols code chart ? Does it rain tea and coffee in North Korea ? Or does the annotation can

Re: CJK Unified Ideographs Range

2003-02-19 Thread Kenneth Whistler
Andrew asked: I've asked this question before, but I've never had a satisfactory response, so I'll ask it again now that Unicode 4 is due to be released soon. Section 10.1 of the Unicode Standard, as well as Blocks-4.0.0.txt, give the range of the CJK Unified Ideographs block as U+4E00

RE: Never say never

2003-02-12 Thread Kenneth Whistler
Andy continued: In principle, at some point in the future, either the phonology or the orthography or both could evolve to the point where the entire constructs start to get handled as basic orthographic units (or letters) for Bengali, but it isn't really the place of the Unicode

RE: Never say never

2003-02-11 Thread Kenneth Whistler
Marco Cimarosti wrote: It has been repeated a lot of times that no more precomposed character will never ever ever ever be added. ... I trust the clarification from John Cowan helped on this -- there is no prohibition against adding characters with *compatibility* decomposition mappings,

RE: Never say never

2003-02-11 Thread Kenneth Whistler
Andy White wrote: And I today see that the precomposed character '0B71 ORIYA LETTER WA' has been added to the UCS4.0 charts http://www.unicode.org/charts/PDF/U40-0B00.pdf This is clearly a composition of ORIYA LETTER O and ORIYA LETTER LETTER VA (BA). People on the list today are playing a

Re: LATIN LETTER N WITH DIAERESIS?

2003-02-10 Thread Kenneth Whistler
António MARTINS-Tuválkin (with no diaeresis !) asked: Anyway, I noted once more that many cyrillic letters I'd consider as base letter + diacritical composites are not decomposable according to Unicode. I planned to dwell deeper into this, but is there a short answer for it? The short answer

Re: LATIN LETTER N WITH DIAERESIS?

2003-02-10 Thread Kenneth Whistler
John Cowan noted: So formal canonical decompositions are almost entirely confined to separable, accent-like diacritics (acute, grave, diaeresis, and so on). The only significant exceptions are the cedilla and ogonek, which attach smoothly to letter bottoms without otherwise distorting

RE: discovering code points with embedded nulls

2003-02-05 Thread Kenneth Whistler
Erik followed up: From what I'm hearing from you all is that a null in UTF-8 is for termination and termination only. Is this correct? Not quite. A null byte (0x00) in UTF-8 is only a representation of the NULL character (U+). It can be present in UTF-8 for whatever purposes one might

Beta Coming (was Re: Public Review Issues update)

2003-02-04 Thread Kenneth Whistler
Doug Ewell noted: As for Issue #6, Unicode 4.0 Alpha data, there hasn't been much new to review so far. The first Unicode Data.txt file to contain the new character assignments in Unicode 4.0 was posted only a few hours ago! Eleven days might not be much time to check through 1200+ new

Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Kenneth Whistler
Erik Ostermueller asked: We have a large amount of C++ that currently has Unicode 2.0 support. Could you all help me figure out what types of operations will fail if we attempt to pass Unicode 3.0 thru this code? I can start the list off with -sorting -searching for text This

Re: urban legends just won't go away!

2003-01-30 Thread Kenneth Whistler
This is a simple example demonstrating my own personal method. //to upper case public char upper(int c) { return (char)((c = 97 c =122) ? VisitSewers(c) : c); } static int VisitSewers(int c) { return AlligatorByte(c); } static int AlligatorByte(int c) { // Remove

Re: LATIN LETTER N WITH DIAERESIS?

2003-01-28 Thread Kenneth Whistler
Curtis asked: I have a distinct memory of a precomposed Latin letter n with diaeresis (as in the band Spinal Tap), but now I can't find it. It doesn't matter to me whether it exists or not, other than helping me to understand my memory. Am I missing it? Did it exist once and is now gone?

<    1   2   3   4   5   6   7   8   >