I remembered that I had done something with making a Unicode Poster some time
ago. Dusted it off, and posted the results.
Voila, every Unicode character in 4.0:
http://www.macchiato.com/unicode/UnicodeChart.zip
Columns: 256, Rows: 410
all unassigned rows are skipped (with double line showing
Unfortunately, charset names -- including IANA names -- are in general not
well-defined, in the sense that
- one can access a mapping table to/from Unicode/10646 for them
- that mapping table is guaranteed to represent what a vendor actually does in
conversion APIs.
Thus, what we base our aliases
Addison and I have been working on a proposed successor to RFC
3066 (language tags), which should be of interest to many people on this
list.
http://www.ietf.org/internet-drafts/draft-langtags-phillips-davis-01.txtFeedback
is welcome.Mark
Note: we submitted a PDF version at the same time,
- Original Message -
From: Theodore H. Smith [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tue, 2003 Nov 18 08:42
Subject: Re: Ternary search trees for Unicode dictionaries
Hi Mark,
Your tries are nice, however they are being used for single unicode
characters
We tend to use tries, which have very good performance characteristics. See
bits of unicode on my site: www.macchiato.com.
Mark
__
http://www.macchiato.com
- Original Message -
From: Theodore H. Smith [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Mon,
Phillipe, instead of trying to sound authoritative by making up a whole-cloth
definition -- one that is completely and utterly wrong -- and thereby confuse
and mislead a beginner, you should either be silent or simply point the person
to the Unicode glossary:
Message -
From: Kent Karlsson [EMAIL PROTECTED]
To: 'Peter Kirk' [EMAIL PROTECTED]; 'Mark Davis' [EMAIL PROTECTED]
Cc: 'Unicode List' [EMAIL PROTECTED]; 'Roozbeh Pournader'
[EMAIL PROTECTED]
Sent: Mon, 2003 Nov 10 03:01
Subject: RE: ZWJ, ZWNJ, CGJ and combination
...
I would see this use
I agree -- this is pointless. The UTC has discussed this before, and I don't
think there is any chance that the UTC would add either:
(a) made-up hexadecimal digits that differ in shape from A-F, or
(b) glyphic clones of A-F that were hexadecimal digits.
Mark
__
]
To: Mark Davis [EMAIL PROTECTED]
Cc: Unicode List [EMAIL PROTECTED]
Sent: Sun, 2003 Nov 09 09:19
Subject: Re: ZWJ, ZWNJ, CGJ and combination
On 08/11/2003 17:09, Mark Davis wrote:
I agree with the first part of your analysis. By the phrase requesting
ligation
of combining characters
The UTC just approved a clarification
of the base character definition, as follows:
D13a Graphic character: a character with the General Categories of
Letter (L), Combining Mark (M), Number (N), Punctuation (P), Symbol (S), or
Space Separator (Zs).
Graphic characters specifically exclude
(followup) And for checking character properties without having to delve into
the UCD data files, try the ICU Demo at:
http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=200B
Mark
__
http://www.macchiato.com
- Original Message -
From: Peter Kirk
- Original Message -
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Unicode List [EMAIL PROTECTED]
Sent: Sat, 2003 Nov 08 16:09
Subject: Re: ZWJ, ZWNJ, CGJ and combination
On 08/11/2003 15:52, Mark Davis wrote:
The UTC just approved a clarification of the base
You are stating many things as if they were facts, when they are simply not
true. You should verify them against the definitions before stating them in such
a 'definitive' way.
Examples:
- VS1 is a combining character, and not a base character.
Thank you for the interesting thoughts. As I understand your suggestion,
and bearing in mind that dagesh (and the rare rafe) are also consonant
modifiers, you are effectively suggesting an order (already normalised):
consonant dagesh rafe shin/sin-dot CGJ right-meteg CGJ vowel accent CGJ
Check out ICU4J (http://oss.software.ibm.com/icu4j/).
There is a demo of transliteration at http://oss.software.ibm.com/cgi-bin/icu/tr.
For Cyrillic, we currently only do an ISO-based transliteration, but you can do
your own custom ones.
(The demo will store custom rules that people have
I want to caution people that the chart should *not* be taken as an exact guide.
The percentage of language speakers within a country, and the percent of GDP
ascribable to those language speakers are all pretty fuzzy. In addition, I had
excluded countries that were at or below 0.05% of world GDP,
__
http://www.macchiato.com
- Original Message -
From: Marco Cimarosti [EMAIL PROTECTED]
To: 'Mark Davis' [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Wed, 2003 Oct 22 02:17
Subject: RE: GDP by language
Mark Davis wrote:
BTW, some time ago I had
BTW, some time ago I had generated a pie chart of world GDP divided up by
language.
Someone on this list asked for a copy, so I posted it here in case others might
find it interesting:
http://www.macchiato.com/economy/GDP_PPP_by_language.pdf
Mark
__
It is PPP. (You get a very different pie chart with other measures of GDP, of
course).
Mark
__
http://www.macchiato.com
- Original Message -
From: Patrick Andries [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED
multilingual populations.
Still, I think it is close enough to get an useful overall picture.
Mark
__
http://www.macchiato.com
- Original Message -
From: John Cowan [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent
I can even read Mark
Davis' signature - that is, it appears correctly, I'd love to know what
it means!
shiSyAdicchetparAjayam
shiSyAt from the student
icchet one should desire
parAjayam defeat
A teacher should wish to be defeated by his own student in scholarship
I got this from
I don't think it is quite that simple. Look at India, for example.
Mark
__
http://www.macchiato.com
- Original Message -
From: John Cowan [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tue, 2003 Oct 21 12:36
Subject: Re
With respect to the issues we raised in
http://www.unicode.org/consortium/utc-positions.html, the IAB has taken the
following positions:
http://www.iab.org/documents/correspondance/2003-09-25-iso-cs-code.html
http://www.iab.org/documents/correspondance/2003-09-23-isocodes.html
IBM currently doesn't currently offer regular public ICU training. We do provide
overviews of ICU at the Unicode conferences (and will do so again at the
upcoming Atlanta GA (USA) meeting). If enough people at that conference are
interested, we may also be able to hold an ad hoc session there.
If
There is a minor update to http://www.unicode.org/reports/tr29/tr29-5.html to
use the new UCD property.
Mark
__
http://www.macchiato.com
Eppur si muove
The purpose of the Pattern Syntax characters is *not* to list everything that is
a symbol or punctuation mark. That exists independently. Think of them as
operators in the engine syntax, as ? or * are used today in Perl, or as
+, -, /, * could be used in math expressions.
The goal is to have a
Technical Report issues would be fine.
I think #1 is worth considering. For #2, see other message to Peter Kirk.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Marco Cimarosti [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent:
There is a new version of the default Unicode collation element table at:
http://www.unicode.org/reports/tr10/allkeys-4.0.0d5.txt
with corresponding charts at
http://www.unicode.org/charts/collation/beta/
Mark
__
http://www.macchiato.com
Eppur si muove
Agreed. Maybe we could have an [EMAIL PROTECTED] just so that people can
shift there conversations over there when they depart from discussions of
Unicode. Then people can discuss conventions for the price of a metric pint of
beer with hexidecimal euro number formats to their heart's content, and
be on those groups anyway.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Matitiahu Allouche [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Thursday, August 21, 2003 00:55
There is one open issue I'd like to draw people's attention to: whether to have
a narrow or broader approach to the whitespace in a pattern environment. The
narrower definition would be:
0009..000D ; Pattern_White_Space # CHARACTER TABULATION..CARRIAGE RETURN
(CR)
0020 ; Pattern_White_Space
I suspect your distinction is a bit too subtle to be useful. Having, for
example, a RLM only have affect when adjacent to a space in a regular expression
would be pretty prone to error; expecially since the character would be
invisible.
The reason for allowing LRM and RLM is to be able to make
,
then raised to the bidi list once there is more consensus.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Matitiahu Allouche [EMAIL PROTECTED]; [EMAIL PROTECTED
Remember, this is *not* when using the pattern to parse, this is in constructing
the pattern itself.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Ben Dougall [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL
Could the [Way OT] discussion be moved to a egroup or other forum?
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 21, 2003 13:00
or not?
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Matitiahu Allouche [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, August 19, 2003 01:21
Subject
Yes, I am sick and tired of dealing with this horrible non-decimal measurement
system the US has for time: the number of units per other unit vary all across
the board: 60..61 : 1, 60 : 1, 24 : 1, 28..31 : 1, 12 : 1, 365..366 : 1 --
awful. At least with inches, feet, and miles, the number of feet
://www.unicode.org/reports/tr10/tr10-10.html for more
information.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Matitiahu Allouche [EMAIL PROTECTED]; [EMAIL PROTECTED
!
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Matitiahu Allouche [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED]; Joan Wardell
]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Monday, August 18, 2003 10:08
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
I have submitted the following text on the Unicode Reporting form.
This report relates to the collation
There are also beta collation charts in:
http://www.unicode.org/charts/collation/beta/
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Rick McGowan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, August 15, 2003 19:27
comments below.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Michael (michka) Kaplan [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Sunday, August 17
Peter, in XML you really don't want to use attributes for any general
text; there are too many restrictions on the content. For example, we
never put translatable text into them. Attributes should really be
treated more like sequences of symbols, with a constrained syntax.
This is also not in
Moreover, as I wrote before, the wording in that one paragraph in 3.0
is not clearly stated, but it is clear from a reading of the rest of
the standard -- with numerous examples -- and from the UCD 3.0
properties, that space *is not* a format character, and *is* a
suitable base for combining
Some of this seems to be in reference to an earlier contention that
Text Boundaries (inc. Lines) break between the space and the
non-spacing mark. I think this was attributed to Phillipe.
[This may not be true: I don't actually read his email, because the
information content per line falls below
Questions [at
http://www.unicode.org/faq/];. (Which I did).
Now, if it is true, as Mark Davis suggests, that the Frequently
Asked
Questions list at http://www.unicode.org/faq/; is unrelated to this
list,
then:
(1) This should be made clear on the consortium's web page
(http://www.unicode.org
Where did you get the notion that space is not a base character? And
base characters include those that are not control or format
characters. Space is neither one.
The standard specifically states in a number of places that to exhibit
a combining mark in isolation you use a space (or NBSP).
Mark
I repeat again. Nothing on this list has any guarantee that it will be
seen by anyone in the UTC. If you want to submit a FAQ question that's
great -- and I strongly encourage it. But please use:
http://www.unicode.org/reporting.html to make sure it is tracked.
The same goes for comments from
As for oe-ligature, the
French representative to WG3 (or its predecessor) said that France
could
live without it.
Even worse; the story I heard was that the committee had planned from
the start to have and in positions D7 and F7, but that late in the
process the representative from France
- Original Message -
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Unicode List [EMAIL PROTECTED]
Sent: Tuesday, August 05, 2003 14:50
Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions
on ZWNBS...)
On 05/08/2003 14:40, Mark Davis wrote:
Where
The ZWSP and Word Joiner (plus ZWNBSP in its discouraged usage) are
targeted specifically at encouraging or avoiding *line break*. Their
names may be misleading; people intending to use them for any other
function should carefully read the sections of the Unicode Standard
that discuss their usage.
I would remind the people interested in Hebrew issues that a list has
been set up for their benefit, and recommend that they use it.
Cf.
Darling Unicadetti...
By popular demand, considering the deluge of Biblical
Hebrew issues cropping up recently on the Unicode list,
I have created a new
ok, np
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: John Cowan [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Peter Kirk [EMAIL PROTECTED]; Ted Hopp
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Friday, August 01
Various people have demonstrated a certain amount of confusion around
the notion of a deterministic sort vs a deterministic comparison. This
is an important issue for Unicode sorting and string comparison, so I
put together some material into a tech note and passed it by the
editorial committee.
This depends on who you mean by we. It's not just you and me, Ted.
If
in discussions on this list a consensus is reached that this is the
best
way to go, then we have the top people in Unicode behind us and
We should make sure that you all understand that this email list is an
open disucssion
muove
- Original Message -
From: Peter Kirk [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: Ted Hopp [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, July 30, 2003 18:19
Subject: Re: From [b-hebrew] Variant forms of vav with holem
On 30/07/2003 17:04, Mark Davis wrote:
We
Changing the canonical order is not going to happen. If you want to
read about the problems that that would cause, there has been plenty
written about it on this list if you consult the archives.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original
Peter,
Effectively we'd be looking at some amendment to the normalization
algorithms to insert CGJ in certain enumerated contexts.
The standard normalization forms (NFC, NFD, NFKC, NFKD) will certainly
not change in this regard.
On the other hand, it would be possible to add additional
Peter,
This all depends on whether the UTC approves, at the upcoming meeting
in August, the proposal to extend the use of CGJ to allow for
inclusion within sequences of combining marks in order to prevent
reordering of those marks.
Of course, it could be used right now for that purpose, in the
Exactly. See http://www.unicode.org/faq/normalization.html#8, for
example. (Note: the last FAQ would change if the UTC accepts the
proposal for usage of CGJ.)
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Peter Kirk [EMAIL
There is a new proposed draft TR available for public comment on
http://www.unicode.org/reports/tr31/.
This document describes specifications for recommended defaults for
the use of Unicode in the definitions of identifiers and in
pattern-based syntax. It incorporates the Identifier section of
and Azeri, was: Accented ij ligatures)
On Monday, July 14, 2003 5:34 AM, Mark Davis [EMAIL PROTECTED]
wrote:
...
Of course
Java already includes some parts of ICU, but other things are in
ICU4J are difficult now to integrate in Java, simply because IBM
forgot to modularize ICU so
...
Of course
Java already includes some parts of ICU, but other things are in
ICU4J are difficult now to integrate in Java, simply because IBM
forgot to modularize ICU so that it can be integrated slowly.
Accepting ICU4J as part of the core is a big decision choice,
because ICU4J is quite
Another consequence is that it separates the sequence into two
combining sequences, not one. Don't know if this is a serious problem,
especially since we are concerned with a limited domain with
non-modern usage, but I wanted to mention it.
Mark
__
1. I agree with Ken about the current lack of precedent for Cfs before
combining marks. Interestingly, that we do have a proposal to do just
that, in
http://www.unicode.org/review/pr-9.pdf
However, note that the whole purpose of putting the Cf after the Ra is
to separate it from the halant, so
Michael, that is like saying move the bloody character or remove
the bloody character.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003
this was the case
Someone might misread your statement. We did not change the combining
classes for Hebrew.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Michael (michka) Kaplan [EMAIL PROTECTED]
To: [EMAIL PROTECTED];
If you start on http://www.unicode.org/ and click on Start Here,
you'll get to a page about the Unicode Standard.
In the left-hand column, clicking on Versions of the Unicode Standard
will get you to http://www.unicode.org/standard/versions/.
In the left-hand column you will see the different
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Mount, Rob (Robert F) [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Thursday, June 05, 2003 11:57
Subject: RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND MARK
Thanks
for 2.x are archived?
If so, is that a good idea?
Barry
At 11:09 AM 6/5/2003 -0700, Mark Davis wrote:
If you start on http://www.unicode.org/ and click on Start Here,
you'll get to a page about the Unicode Standard.
In the left-hand column, clicking on Versions of the Unicode
Standard
A few items:
I agree with your main point, which is that UCS-2 is, for all
practical purposes, just a repertoire subset of UTF-16; the code units
and bit-width are the same.
Some Java classes that assume that the char arithmetic will
automatically roll after 16 bits are wrong. The JVM spec
Rick posted a message recently he intended as a personal contribution,
but it may have been interpreted as an official statement. Here is
some clarification of what he wrote.
1. His point about compliance and conformance was intended to indicate
that using the savvy logo would only indicate that
Well, I don't know who told you, but WORD JOINER only affects
linebreak behavior, not intercharacter spacing.
Mark
__
http://www.macchiato.com
Eppur si muove
- Original Message -
From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
One minor correction:
However, it's true that ECMAScript will allow you to create invalide
Unicode strings.
More precisely, ECMAScript (and other systems) will allow you to
create 16-bit Unicode strings that are not UTF-16.
See Section 2.7 in http://www.unicode.org/book/preview/ch02.pdf.
Mark
Can you respond back to them with the information as to the languages
involved?
Mark
( )
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent:
By the way, a few people have been discussing possible solutions to some of
the problems with language codes (and their relation to locales), which may
be of interest to some people here. The discussion has just been switched to
http://www.alvestrand.no/mailman/listinfo/ietf-languages, which has
Some people asked me where they could see copies of my Unicode 4.0 slides
and ICU Overview slides from the Prague conference. I posted them on my
website, at http://www.macchiato.com/.
Once Steven gets back, we'll post copies of the LDML slides (and the ICU
Overview slides) on the ICU site.
Mark
-0799
- Original Message -
From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]; Mark Davis
[EMAIL PROTECTED]
Sent: Friday, March 28, 2003 17:31
Subject: Inherited-script characters
Last December, Mark Davis indicated that a passage similar to the
following would
Claude Tardif
[EMAIL PROTECTED]To: Mark Davis/Cupertino/[EMAIL
PROTECTED]
cc: [EMAIL
He keeps them all ;-)
Mark
()
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, March 19, 2003 12:23
Subject: Re: Re. and Rs.
I do say that if a webpage has U+E000 defined as banana and I have it
defined as apple, that then their range U+E000-U+F8FF is a different PUA,
belonging to a different extension of unicode than my range U+E000-U+F8FF
It is *not* a different PUA. The PUA is defined to be simply a range of
code
The following UAXes have beta drafts available for review. These documents
are updated to the latest UTC decisions for Unicode 4.0.0. (#14 will have a
few more changes soon to account for UTC decisions made in the last
meeting.) Each document has a Modifications section that describes the
latest
If you have questions as to particular normalizations, I'd suggest looking
at the normalization charts on the Unicode website.
Mark
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
From: David J. Perry [EMAIL
This might be worth writing a Technical Note to start with; see
http://www.unicode.org/notes/
Mark
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
From: Frank da Cruz [EMAIL PROTECTED]
To: Pim Blokland
I want to point out two things.
1. UCA provides a mechanism for producing a deterministic sort (there
called semi-stable). See step 3.10
(http://www.unicode.org/reports/tr10/#Step_3).
2. A deterministic sort is actually not needed very often; people confuse
it with a stable sort. See
Well, maybe 3 things ;-)
Mark
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
From: Mark Davis [EMAIL PROTECTED]
To: Markus Scherer [EMAIL PROTECTED]; unicode
[EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent
No.
One cannot make such a black and white statement (correctly, at least). The
OED does use Csar, for example. While most people would consider it
slightly old-fashioned to use that form, it is done.
Mark
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
In the interests of internationalization, I suppose I should point
out that the weight of the Unicode 4.0 book, while 9 Lbs in the US,
will be 4.1 kg everywhere else in the world.
In the interests of precision:
- The weight would be 9 lb anywhere on the earth.
- The *mass* would be 4.1 kg,
- Original Message -
From: Asmus Freytag [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; Kent Karlsson
[EMAIL PROTECTED]; 'Michael (michka) Kaplan' [EMAIL PROTECTED]
Cc: 'Yung-Fong Tang' [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Sunday, March 02, 2003 21:10
Subject: Re: UTF-8 Error Handling
that they can do more complex
processing.
Mark
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
From: Asmus Freytag [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; Kent Karlsson
[EMAIL PROTECTED]; 'Michael
to.
Mark
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
From: Asmus Freytag [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; Kent Karlsson
[EMAIL PROTECTED]; 'Michael (michka) Kaplan' [EMAIL PROTECTED]
Cc
I agree with Kent that it is somewhat less robust to simply remove
ill-formed sequences, since it removes any indication that the data was
corrupted. Either better to signal an error, or insert some other indication
like a REPLACEMENT CHARACTER or SUB at that point. (And in my reading, C12a
does
And also RFC is FREE of charge but not Unicode standard itself.
The Unicode Standard *is* free of charge; the entire text is posted on
www.unicode.org.
Mark
[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
- Original Message -
We should remember that blocks do not necessarily contain a consistent set
of characters. See http://www.unicode.org/reports/tr18/#Character_Blocks. If
we really need space for characters, then we can allocate them in 'related'
blocks. (We also do not guarantee that block boundaries are
I have a chart at http://www.macchiato.com/unicode/composition_chart.html
that makes it pretty easy to find all those odd precomposed characters.
Mark
__
http://www.macchiato.com
► “Eppur si muove” ◄
- Original Message -
From: Curtis Clark [EMAIL
John, we've communicated a number of errors in the pinyin readings on
previous occasions. Since you said you were going to be looking at the
Mandarin readings, I just dumped a complete file of what we are currently
using so that you can look at it. (Since it is rather large for email, I
stored it
]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: Mark Davis [EMAIL PROTECTED]
Sent: Saturday, December 07, 2002 09:15
Subject: Re: Script of U+0951 .. U+0954
There were some errors in my suggested update to Scripts.txt. A
correction has been posted. Sorry about that.
Mark Davis mark dot davis at jtcsv
Those are fun -- if you like them I have a link to a site full on
www.macchiato.com (I also put up Hu's on First, if you haven't seen it).
Mark
__
http://www.macchiato.com
► “Eppur si muove” ◄
- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL
with MS people (and not only me, but also Pothana's designer), MS answered
that the Unicode standard seemed to imply that these accents apply to
Devanagari script only.
That is incorrect; all non-spacing marks should inherit the script of their
base character. We need to make this clear in
Ken is correct: the default properties are somewhat different for ideographs
than for PUAs. In addition, PUAs are a special case compared to other
characters; implementations are free, within very broad limits, to change
the default properties associated with a PUA code point to whatever is
501 - 600 of 920 matches
Mail list logo