Jim Allan answered Alan Wood's question:
Alan Wood posted on U+23D0 VERTICAL LINE EXTENSION:
Is it intended as a Unicode replacement for Vertical arrow extender in
Symbol font?
Yes.
See http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2508.htm for the proposal.
The Unicode manual should
Peter Kirk cited Paul Nelson:
On 23/07/2003 03:20, Paul Nelson (TYPOGRAPHY) wrote:
Please look at the definition of GCJ and other such characters.
Understand the differences between CGJ and ZWJ/ZWNJ.
This discussion is very disturbing to me because after reading through
the L2 document
I have been doing a little research into the defined properties of CGJ.
I note also that according to
http://www.unicode.org/book/preview/ch03.pdf it is defined in Unicode
4.0 as a Default Ignorable. Well, I am not surprised that some people
are confused ...
Yes, I'm not surprised,
William spilled another ocean of digital ink. Found bobbing
in that ocean was the comment:
Roozbeh and I assigned two unencoded characters for Afghanistan to
the PUA, and we encourage implementors to use them until such time as
the characters are encoded.
Yes. ... Now that at least one of
282 MES-2 is specified by the following ranges of code positions as
indicated for each row...
Philippe Verdy asked:
As most of these characters are canonically decomposable, shouldn't this
list include also the decomposed characters?
Why is row 03 so resticted? Shouldn't it include
Peter Kirk responded to Michael Everson:
What is this thread for? We're going to encode Phoenician. It is the
forerunner of Greek and Etruscan. Hebrew went its separate way. The
fact that there is a one-to-one correspondence isn't important. We
have that for Coptic and Greek too and we
At 10:34 -0700 2003-07-14, Peter Kirk wrote:
On 14/07/2003 09:04, Doug Ewell wrote:
* Michael Everson's and Roozbeh Pournader's provisional PUA assignments
for ARABIC PASHTO ZWARAKAY and AFGHANI SIGN, two legitimate characters
that cannot be represented in Unicode by any other means.
Peter Kirk asked:
So is there a real justification for separate alphabets here?
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2311.pdf
And Michael Everson can, no doubt, provide further
justification beyond this sketch of how the roadmap has
been structured for this script family.
Note that when
Peter Kirk asked:
In Turkish and Azeri the sequences f - i and f - dotless i both occur,
and are fairly frequent. So it is inappropriate in these languages to
use fi ligatures in which the dot on the i is lost or invisible, at
least where the second character is a dotted i. Has any
and Philippe Verdy responded with another question:
Isn't there a Grapheme Disjoiner format control character to
force the absence of a ligature like fi, i.e. f, GDJ, i?
The answer to Philippe's rejoinder question is no, there is not
a Grapheme Disjoiner format control
Asmus wrote:
Unicode assigns the general category value, Sk, or Symbol, [k]urrency
to all characters whose *primary* function is to act as a currency symbol.
recte: Sc, or Symbol, [c]urrency
Sk is for Symbol, modifier, referring basically to spacing accents
and other similar
Philippe Verdy responded to a question by SRIDHARAN Aravind:
How can I differentiate whether a given character in chinese is
simplified or traditional?
Normally you can't with Unicode/ISO10646:
They are unified now by the UniHan working group, to be used
for Traditional or Simplied
Karljürgen,
2. Consequently ANY OTHER solution than 'FIX the obvious mistake(s)' is a
kludge (contra Philippe's (?) recent comment). One *pays* for all kludges,
one way or the other.
Digital encoding of writing systems is a kludge. And boy, do we
seem to be paying for the Unicode version of
Philippe Verdy said:
I understand the frustration: if Unicode had not attempted to define
combining classes, which were not necessary to Unicode, all
existing combining characters would have been given a CC=0
(or all the same 220 or 230 value).
Uh, no.
Under this scheme, a, diaeresis,
Andrew West wrote:
I have to agree 100% with Peter on this. The potential fiasco with regards to
Mongolian Free Variation Selectors is another area where our grandchildren are
going to be weeping with despair if we are not careful.
Well, I doubt that our grandchildren will be quite *that*
Peter countered:
Could this finally be the missing killer ap for the CGJ?
It will be perfect to allow an application like XML to encode Hebrew
text using Unicode 4.0 rules (and before).
It is not perfect. CGJ is supposed to be significant (and kept in the
text) for a variety of
Peter responded:
Kenneth Whistler wrote on 06/26/2003 05:36:34 PM:
Why is making use of the existing behavior of existing characters
a groanable kludge, if it has the desired effect and makes
the required distinctions in text?
Why is it a kludge to insert some cc=0 control character
Elisha Berns asked:
It would appear from your answer that even after implementing the
algorithm to search the Unicode block coverage of a font, the actual
comparison data, that is which blocks to compare and how many code
points, is totally undefined. Is there any kind of standard for
Doug, Peter, and Michael already provided good responses to
this suggestion by William O, but here is a little further
clarification.
Well, certainly authority would be needed, yet I am suggesting that where a
few characters added into an established block are accepted, which is what
is
Jony took the words right out of my mouth:
How about RLM?
Jony
This already belongs, naturally, in the context of the Hebrew
text handling, which is going to have to handle bidi controls.
Another possibility to consider is U+2060 WORD JOINER, the
version of the zero width non-breaking space
Peter responded:
Ken Whistler wrote on 06/25/2003 06:57:56 PM:
People could consider, for example, representation
of the required sequence:
lamed, qamets, hiriq, final mem
as:
lamed, qamets, ZWJ, hiriq, final mem
So, we want to introduce yet *another* distinct
Michael wrote:
At 15:36 -0700 2003-06-26, Kenneth Whistler wrote:
I now like better the suggestions of RLM or WJ for this.
ZZZT. Thank you for playing.
RLM is for forcing the right behaviour for stops and parentheses and
question marks and so on. Introducing it between two
John,
At 03:36 PM 6/26/2003, Kenneth Whistler wrote:
Why is making use of the existing behavior of existing characters
a groanable kludge, if it has the desired effect and makes
the required distinctions in text? If there is not some
rendering system or font lookup showstopper here, I'm
John Hudson wrote:
At 03:52 PM 6/26/2003, Rick McGowan wrote:
I'll weigh in to agree with Ken here. The solution of cloning a whole set
of these things just to fix combining behavior is, to understate, not quite
nice.
No, but would be far from the not nicest thing in Unicode, and there's
Oh yeah, that reminds me. When are you going to propose the SUGUARO
SYMBOL? My wife's from Arizona; I'll back that one.
Recte SAGUARO. I lived in Tucson from junior high to my B.A. I guess
I would propose one if it were, as the SHAMROCK is, used to indicate
something in lexicography or
Peter asked:
How can things that are visually indistinguishable be lexically different?
chat (en)
chat (fr)
We don't encode the phonological distinctions between homographs; we
encode text.
But I agree that we encode text. Both words above, which are
*lexically* distinct, would have the
At 18:26 +0100 2003-06-25, Michael Everson wrote:
You'd like to think so. But Deprecate TIBETAN THINGY and add
TIBETAN THINGY BIS so that we can fix the problem is utterly
ridiculous.
And by that I mean, given the TWO standards Unicode and ISO/IEC
10646, adding duplicate characters is
John Hudson wrote:
In Biblical Hebrew, it is possible for more than one vowel to be attached
to a single consonant. This means that is it very important to maintain the
ordering of vowels applied to a single consonant. The Unicode Standard
assigns an individual combining class to every
John Hudson wrote:
At 02:36 PM 6/25/2003, Michael Everson wrote:
Write it up with glyphs and minimal pairs and people will see the problem,
if any. Or propose some solution. (That isn't add duplicate characters.)
Peter Constable has written this up and submitted a proposal to the UTC.
John Hudson wrote:
This idea of Hebrew vowels as 'fixed' marks is problematical, because in
Biblical Hebrew they are not fixed: they move relative to additional marks
(other vowels or cantillation marks).
It may be more *difficult* for applications to do correct rendering,
but there was
For example, the alleged problem of the vocalization order of
the Masoretes might be amenable to a much less drastic
solution. People could consider, for example, representation
of the required sequence:
lamed, qamets, hiriq, final mem
as:
lamed, qamets, ZWJ, hiriq, final mem
Chris Fynn wrote:
In Unicode's UnicodeData.txt (
http://www.unicode.org/Public/UNIDATA/Unicodea.Dattxt )
0F7E has a Canonical Combining Class Value (CCCV) of 0;
0F71 a CCCV of 129;
0F72 0F7A 0F7B 0F7C 0F7D and 0F80 a CCCV of 130;
0F74 a CCCV of 132;
and 0F82 and 0F83 have a CCCV of
Actually, there are a number of loose ends still, as it appears
that some of Rob Mount's questions were not actually answered.
I understand what you say about word formation, and
combining marks, and that the Alphabetic
classification should not be limited to Ls. But
30FC is of General
At 23:33 +0200 2003-06-23, Philippe Verdy wrote:
What about the many symbols used to signal how clothes can be cleaned,
And Michael Everson responded:
A well-defined semantic set that I think deserves encoding. :-)
If what you mean is:
http://www.waschsymbole.de/en/index.html
then some
Philippe Verdy,
But it's true that complex scripts like Han will be poorly rendered in Bold
or Italic... But does someone actually wants to read Han text with Bold
characters (or even worse slanted with Italic) ?
What is true is that use of italicized text is unusual
in Chinese or Japanese
John Cowan remarked:
I've never seen these particular U+02EA and U+02EB signs, but from the
names, I'd say U+02EA is Cantonese 33 tone, and U+02EB is 22; U+02e7 and
U+02e8 might then be used for 3 and 2, respectively.
They look weird. The U+02EB (yang) one looks like reversed or turned
Philippe Verdy said:
I can say that bulk+unsollicitated makes it
fully qualifiable as SPAM.
And Theodore Smith countered:
No. I'd say spam also needs to be untargeted.
Also spammers don't tend to come on list, identified with
their full names and argue the relevance of their posts,
as
Philippe Verdy noted:
In the APL subblock of the Misc.Technical block,
The APL range (not subblock) of the Miscellaneous Technical block
is U+2336..U+237A, so the following characters are not part of that
APL range:
the character ⌟ (U+231F)
is also a small bottom-right corner operator, and
Philippe Verdy vamped:
For example I would not be shocked if a text using it was rendered with
a monospaced font, where the base line of the character cell shows
multiple tiny dots, that create a contiguous dotted line when multiple
U+2024 characters (one per display cell) are used
Michael,
As a typesetter on Mac OS X, I see no reason to abandon the use of
the three-dotted horizontal ellipsis character, Ken.
Nor do I. It is fine for ellipses...
And it was encoded for that. But in encodings which don't have
an ellipsis character, it is roughly comparable to a sequence
Philippe Verdy continued:
What surprizes me the most in the Unicode spec is that it
both says that its purpose is to create arbitrary length
of leaders
As in plain text, as can be seen in Table of Content listings
in many RFCs, for example. (Which, however, use ASCII 0x2E for the
same
Kent:
Others gave references where it in most cases did NOT look at all like the
empty set symbol.
Gustav Leunbach (1973), Morphological Analysis as a Step in
Automated Syntactic Analysis of a
Text.http://acl.ldc.upenn.edu/C/C73/C73-2022.pdf
uses an empty set symbol to denote a morphological
António asked:
I've just downloaded the PDF files with 4.0 additions (U40-*.pdf). One
question: How is one supposed to tell apart the glyphs for U+1D29 and
U+1D18?... Or one isn't?... (OK, this question is probably more suited
to be posed to IPA, but.)
Visually, you usually couldn't, any
Ben Dougall asked:
On Thursday, May 29, 2003, at 02:10 pm, Philippe Verdy wrote:
Interestingly, the French first-level quotation marks use what we call
chevrons (double angle brackets).
are they something that's in unicode? apart from the less than and
greater than symbols i can't
Philippe Verdy wrote:
Code positions 0xAB and 0xBB (in ISO-8859-1) are
canonically equivalent to Unicode U+00AB («) and
U+00BB (») code points.
One correction -- this has nothing to do with canonical equivalence.
This (as for all other ISO/IEC 8859-1 encoded characters)
is an example of
Philippe Verdy said:
So I think names in both Windows and this Hapax page come
from a ISO10646 normative reference file in French, and it
contains the names for Unicode3.2 characters (but still not
new characters added or modified in Unicode 4.0)
and then asked:
Also, as this alternate
Theodore Smith wrote:
My first reaction, is that the logos don't look like they compare to
other logos in terms of style. For example Mac OSX logos, XML logos,
and that generally do look more snazzy.
They were loosely modelled on the W3C HTML validation logo, which
is comparable, in some
Thomas Widmann continued:
[EMAIL PROTECTED] writes:
Yes, I think you're right that an annotation is best -- but only
if EMPTY SET is indeed the right character. I'm increasingly of
the opinion that a different character might be needed.
I would disagree.
As would I.
Oh
Philippe Verdy continued:
From: Mark Davis [EMAIL PROTECTED]
From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED]
On 2003.05.25, 00:00, Philippe Verdy [EMAIL PROTECTED] wrote:
even if the Dutch language considers it as a single letter, in a
way similar to the Spanish ch
I see
Peter continued:
Ken Whistler wrote on 04/02/2003 03:54:10 PM:
That isn't the only convention. I am finding several samples of
typographic
retroflex hook being used to indicate nasalisation of vowels.
Jim Allan is right. It is the *ogonek* which is used to signify
the nasalization
Peter,
Why you would feel that such user sense of the characters they
are using is belied by your analysis of the shape of the hooks
used in the IJAL font is beyond me.
I'm sorry I wasn't clearer. I was not referring to their status in terms of
defining characters. I was *only*
Peter,
Note that the example you posted also had an h-ogonek, so the
usage is not limited to vowels, per se.
Indeed.
(Although that particular
entity itself is a little bizarre, since you cannot really
nasalize a voiceless glottal fricative.
Then you'd be even more surprised
At 11:33 -0600 2003-04-02, [EMAIL PROTECTED] wrote:
John Hudson [EMAIL PROTECTED] wrote on 04/02/2003 11:28:28 AM:
Yes, I would consider those ogoneks. What do they signify in Dogrib?
Nasalisation?
Not yet sure, but waiting to find out.
I would imagine they are nasals as in
Peter quoted me:
As far as I know, the same completeness issue does not apply for the
retroflex and palatal hooks -- so for those, use of the preformed
base letters is probably the better recommendation, rather than use
of the non-spacing diacritics together with ligature tables in the fonts.
Jim Allan responded to Joe Becker:
Joe posted:
c. CEDILLAS AND HOOKS:
Two cedillas and two hooks are required as diacritical marks
for bibliographic
transcription, and also for the proper representation of a
number of languages
(as documented in ANSI Z39.47-1985 and ISO
Creating
palatal-hook v's, x's, k's, s's, and so on if they are not
in significant use and when multiple, equally accurate,
alternative representations are available, may not be the best
thing to do.
Incidentally, reviewing Pullum and Ladusaw (1986) to help
provide the definitive answer on
Peter,
Jim Allan wrote on 04/02/2003 12:27:07 PM:
This fits a normal convention in American linguistics to use ogonek to
signify a nasal.
That isn't the only convention. I am finding several samples of typographic
retroflex hook being used to indicate nasalisation of vowels.
Jim Allan
N258A Proposal to encode two COMBINING HEART characters in the UCS
by Michael Everson, Roozbeh Pournader, and John Cowan
http://www.evertype.com/standards/iso10646/pdf/n258a-heartdot.pdf
Given the date this was submitted and the contents of the
proposal, may I guess that this is but a
But I do find, in the vocabulary and
index, words starting with tz are sorting after quatrillo con coma (it
goes z, tresillo, quatrillo, quatrillo con coma, tz). So even for this
text, a tz ligature is marginal.
irrelevant to the
Stefan asked:
Michael Everson wrote:
Shavian has graduated to encoded status, and Tengwar and Cirth will
likely also do so.
Really? I thought that it would not until Unicode 4.0 is published.
The Unicode 4.0 release is imminent -- we are anticipating
mid-April for finalization of the
Michael,
According to the American Heritage Dictionary of the English
Language, page 1303, in the list of symbols and signs, it indicates
that a symbol similar to the per-mille sign can be used to indicate
salinity. Nice annotation.
Having said that, the etymology of the percent sign
Does anyone know how to make the Devanagari glyph indicated here
http://www.hotpeachpages.net/lang/defn1.html#Hindi i.e. the glyph I have
drawn a rectangle around three samples of? If yes, please tell me.
U+0936 DEVANAGARI LETTER SHA
(although you have just circled the left half of the
Michael,
A representative of ISO sent this to me today.
I do not know about ANSI but for ISO/CS the quote given below from
http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/ind
ex.html is certainly correct.
We make a distinction between implementation and
ANSI has membership fees, accreditation fees, and a scheme for
site licensing for access to standards documents. But I've never
heard of a license fee for *use* of ISO 639 or ISO 3166 codes.
Once you acquire the standard, you should be able to freely
use it. That is how ISO standards work.
Where
I'm guessing this may be related to the fact that ISO is now
delivering ISO 3166-1/2 codes in the form of two
Microsoft Access 2000 databases. (Although you can also order
the standards without the database files.)
http://www.iso.ch/iso/en/prods-services/iso3166ma/05database/index.html
Trying to
Lateef Sagar Shaikh asked:
For Rupees Rs. sign is used, and for Rupee Re. sign is
used, where as in Unicode only onle code point is
present for Rs. Shouldn't there be a separate place
for Re. as well?
No. Rather than using U+20A8 RUPEE SIGN, ordinary typographic
practice would just be to use
Pim Blokland asked:
I've got a few questions about the use of geometric shapes, like
squares and such.
Some of these look very similar to one another, and I don't know
which ones to use in which circumstances!
Are their any guidelines on their use?
Just as an example, let's look at the
Stefan wrote:
Kenneth Whistler wrote:
DM was widely used for Deutschmarks, dkr for Danish kroner,
and so on before the switch to euros, for example.
I've only seen Danish kroner abbreviated as kr or DKK, never as
dkr. kr is the most common abbreviation in Denmark today; DKK
William Overington asked:
And nobody out there is volunteering to do it.
I was told that I could commission it.
That statement by Michael Everson was not a *permission*, but
merely a statement of fact. Anyone can commission any expert
they like, under contract to produce whatever output or
Peter,
U+00D0: The glyph that appears in the code charts for U+00D0 is shown in
LtnCapEth_DStrk.gif. Now, the African Reference Alphabet document that was
produced at a conference in Niamey in 1978 proposeda small letter that
looks like U+00F0 LATIN SMALL LETTER ETH, but the capital
William Overington asked:
I wonder if you could please say whether the Unicode 4.0 book will have the
same chapter headings and numbering as the Unicode 3.0 book?
They will be largely similar -- and identical for Chapters 1 through
5 -- but there are various reorganizations in the latter part
Otto Stolz wrote:
The two scans under
http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
are from the authoritative (until July 1996) book on German
orthography: Duden Rechtschreibung der deutschen Sprache
und der
The reason is that the Myanmar block was given four empty columns
because we already *know* of numerous characters that will need
to be added to the Myanmar script to support Shan, Karen, Mon,
and other minority languages written with the script. Ending the
Myanmar block at U+109F (instead of
We've asked. But you need to understand that publishers
have their own rules and constraints. Paper is bought in
huge quantities by publishers, and special purpose papers
(such as lightweight, thin, high-opacity papers used in
dictionaries) are expensive and carefully planned for.
As important as
Not to disagree publicly with Michael or Mark on this, but
in the interests of accuracy, I should point out that if the
rest mass of the Unicode 4.0 publication is assumed to be exactly
4.1 kg (which then would, indeed, also be the case on our
moon, or even a Jovian moon), and ignoring any
Well, I can't diagnose exactly what is going wrong, but
Unicode character (\uFFE2\uFF80\uFF93)
is a sequence of a full-width not sign, followed by a
half-width katakana ta and a half-width katakana mo.
What you are actually looking for is the UTF-8 sequence:
0xE2 0x80 0x93
which is the UTF-8
Antonio asked:
On 2003.02.25, 19:36, Asmus Freytag [EMAIL PROTECTED] wrote:
At 12:55 PM 2/25/03 +, Anto'nio Martins-Tuva'lkin wrote:
Most (all?) of them are composable, either by means of letter +
slash (OSLI) or by ZWJ (for things like Pta or Pts, if
anything),
Using ZWJ
On Sun, 2 Mar 2003, Kevin Brown wrote:
Does anyone know of a Latin-based language in which it is possible to
have a lowercase immediately followed by an uppercase in the SAME word?
In addition to the examples pointed out by Roozbeh and Michael,
this pattern is growing increasingly common
Frank Tang asked:
This discussion has been centered around UTF-8. But I hope the
corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
. for UTF-32: occurrences of 'surrogates' are ill-formed.
How about UTF-32 sequence which the 4 bytes represent value U+10 ?
Frank Tang responded to Kent Karlsson's response:
The problem I need to deal with is not GENERATE those UTF-8, but how to
handle these DATA when my code receive it. For example, when I receive a
10K UTF-8 file which have 1000 lines of text, if there are one UTF-8
sequence in the line 990
, thus reconstructing all the gaps. Of course,
there are much better approaches to self-correcting data
transmission, but you get the idea. This would be a perfectly
valid and conformant way to use UTF-8 data.
tex
Kenneth Whistler wrote:
Absolutely. Error handling is a matter of software
Stefan Persson suggested:
Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
sequences. There were two types:
a. 0xC0 0x80 for U+ (instead of 0x00)
b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80
0x80)
Ah, but encoding NULL as a
Frank Tang wrote:
I think that is a very commn mistake people WILL make.
Especially if they keep telling each other the wrong thing,
and then rely on folklore about the standard as their source
of information.
The ultimate source of information about a standard is the
standard itself.
If
Frank Tang continued:
If you read through those definitions from Unicode 4.0 carefully,
you will see that UTF-8 representing a noncharacter is perfectly
valid, but UTF-8 representing an unpaired surrogate code point
is ill-formed (and therefore disallowed).
I see a hole here. How about
Frank Tang asked:
so the UTF-8 sequence which represent U+FFFE U+ and U+{1-11}FFF{E,F}
are consider legal in Unicode 4.0
Yes. Such sequences are also legal in Unicode 3.0, 3.1, and 3.2.
The Unicode Standard, Version 3.0 specified, on p. 46:
To ensure that round-trip transcoding is
Don't all overwhelm the sites at once, but here is
the documentation people are looking for:
http://www.birdtheme.org/country/paraguay.html
Paraguay has published a lot of stamps with bird themes.
If you look at the 1983 series of South American birds,
you will see that they were using Gs. for
Frank Tang asked:
I am working on update the Mozilla UTF-8 code to incooperate the change
of UTF-8 definitation in Unicode 3.1 (make non-shortest form illegal,
and make 5-6 octets illegal) and Unicode 3.2 (make irregular form
illegal) now. I wonder do have any change of the UTF-8
Andrew followed up:
Maybe what I'm really trying to ask is, if sometime in the future we
start to run out of space in the BMP, could U+9FB0 through U+9FFF be
reallocated to some new script, or is the allocation of these 80 codepoints to
the CJK block permanent and irrevocable ?
Please study
I know y'all are having fun with this thread, but in
case Andrew's inquiry is at least half-serious:
But why is the Hot Beverage character listed under the heading Weather Symbol
in the Miscellaneous Symbols code chart ? Does it rain tea and coffee in North
Korea ? Or does the annotation can
Andrew asked:
I've asked this question before, but I've never had a satisfactory response, so
I'll ask it again now that Unicode 4 is due to be released soon.
Section 10.1 of the Unicode Standard, as well as Blocks-4.0.0.txt, give the
range of the CJK Unified Ideographs block as U+4E00
Andy continued:
In principle, at some point in the future, either the
phonology or the orthography or both could evolve to
the point where the entire constructs start to get handled
as basic orthographic units (or letters) for Bengali,
but it isn't really the place of the Unicode
Marco Cimarosti wrote:
It has been repeated a lot of times that no more precomposed character
will
never ever ever ever be added. ...
I trust the clarification from John Cowan helped on this -- there
is no prohibition against adding characters with *compatibility*
decomposition mappings,
Andy White wrote:
And I today see that the precomposed character '0B71 ORIYA LETTER WA'
has been added to the UCS4.0 charts
http://www.unicode.org/charts/PDF/U40-0B00.pdf
This is clearly a composition of ORIYA LETTER O and ORIYA LETTER LETTER
VA (BA).
People on the list today are playing a
António MARTINS-Tuválkin (with no diaeresis !) asked:
Anyway, I noted once more that many cyrillic letters I'd consider as
base letter + diacritical composites are not decomposable according to
Unicode. I planned to dwell deeper into this, but is there a short
answer for it?
The short answer
John Cowan noted:
So formal canonical decompositions are almost entirely
confined to separable, accent-like diacritics (acute,
grave, diaeresis, and so on). The only significant exceptions are
the cedilla and ogonek, which attach smoothly to letter
bottoms without otherwise distorting
Erik followed up:
From what I'm hearing from you all is that a null
in UTF-8 is for termination and termination only.
Is this correct?
Not quite. A null byte (0x00) in UTF-8 is only a
representation of the NULL character (U+). It can
be present in UTF-8 for whatever purposes one might
Doug Ewell noted:
As for Issue #6, Unicode 4.0 Alpha data, there hasn't been much new to
review so far. The first Unicode Data.txt file to contain the new
character assignments in Unicode 4.0 was posted only a few hours ago!
Eleven days might not be much time to check through 1200+ new
Erik Ostermueller asked:
We have a large amount of C++ that currently has Unicode 2.0 support.
Could you all help me figure out what types of operations will fail
if we attempt to pass Unicode 3.0 thru this code?
I can start the list off with
-sorting
-searching for text
This
This is a simple example demonstrating my own personal method.
//to upper case
public char upper(int c)
{
return (char)((c = 97 c =122) ? VisitSewers(c) : c);
}
static int VisitSewers(int c)
{
return AlligatorByte(c);
}
static int AlligatorByte(int c)
{
// Remove
Curtis asked:
I have a distinct memory of a precomposed Latin letter n with diaeresis
(as in the band Spinal Tap), but now I can't find it. It doesn't matter
to me whether it exists or not, other than helping me to understand my
memory. Am I missing it? Did it exist once and is now gone?
301 - 400 of 750 matches
Mail list logo