These characters are purely coded for compatibility. Unicode does not distinguish
letters by the abbreviations that they happen to be used in. There is no difference in
semantics between the "g" in "go" vs. the "g" in "12g", nor between the "Å" in "Århus"
vs. the "Å" in "15Å", nor -- for that
, these data nevertheless have the
risk always run by corruption.
Mark
___
Mark Davis, IBM Center for Java Technology, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED]
http://maps.yahoo.com/py/maps.py?Pyt=Tmapaddr=10275+N.+De+Anzacsz=95014
Almost all international functions (upper-, lower-, titlecasing, case folding,
drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.)
should take *strings* in the API, NOT single code-points. Single code-point APIs
almost always malfunction once you get outside of
This is very much like how we did the multlingual content in
http://www.unicode.org/unicode/standard/WhatIsUnicode.html, which currently has
English, French, German, Italian, Russian, and Arabic; with more to follow.
Mark
Herman Ranes wrote:
[EMAIL PROTECTED] skreiv:
I am mixing
In Asmus's defense, there are fewer recipients that will understand SCSU right now, so
one needs to be a bit more carefull about slinging it around. On the other hand, for
anything outside of plain English, it is quite a handy mechanism for interchanging
Unicode text, so it can reduce memory
Such lists of translations for the glossary terms in Unicode would be quite
useful. If these are produced, be sure to request their addition to Useful
Resources on the Unicode site.
Mark
Antoine Leca wrote:
Patrick Andries wrote:
- Original Message -
From: "Marco Piovanelli"
We haven't used the notion of Planes and Groups. These actually derived, as far
as I can remember from early days in L2, from later-discarded mechanisms that
would let you swap in planes into the BMP. Thus it was important to distinguish
these levels. Planes and Groups are themselves not
I ALY FND ANMs HRD2 DL WTH. WD PFR NML WDS.
Michael Everson wrote:
Ar 07:53 -0800 2000-07-11, scríobh John H. Jenkins:
At the same time, it would be nice to have a Unicodally correct way
of referring to planes 1 and 2, since there is an important boundary
between them.
Just use the
We are pleased to announce that the Unicode web site has been
redesigned to improve navigation and usability. Our new look features
a more accessible layout and color scheme, with related links in the
side bar on most pages to help you learn about other information
available on the site. Longer
Narrowing in on it, with one amendation. UTF-8 code units are 8 bits, so we
can't say that.
Mark
Becker, Joseph wrote:
| C1 says "A process shall interpret Unicode code values as 16-bit
| quantities."
DE I think the focus here was supposed to be on the fact that Unicode code
DE values are
Unicode has changed and evolved over the years. At this point, UCS-2 is a funny
beast, because it shares precisely the same encoding space as UTF-16. That is,
in code units there is absolutely no difference between them. The only real
difference is whether you interpret the code units in the
Because of its usage, ZWNBSP is extremely unlikely at the start of a file,
but that doesn't mean it can't occur. A question mark is also extremely
unlikely, as are many other characters. However, they can occur. Unicode
doesn't forbid any sequence of characters from occurring. Stripping, say,
You could define a UTF that mapped scalar values below to the same as
UTF-8, and values above to a 6 byte value. It would *not* be UTF-8, but it
can be well defined.
If you look below D29 -- p. 46 at the first full paragraph -- you find that for
round tripping, UTFs are required to map
If you just want one or two characters, I have a chart webpage on my site
(www.macchiato.com). You type in the code number and ENTER, and it presents a
chart of 128 characters, with that character in green. Copy and paste, and here
it is.
女
[Visible if your mailer handles UTF-8]
Mark
[EMAIL
Interestingly for tax forms, the fallback mapping for many Windows encodings
has Lira (₤) converting up to pound (£), cf.
http://oss.software.ibm.com/icu/charset/CharMaps-HTML/windows-1252-2000.html.
There are some other interesting fallbacks there...
Mark
[EMAIL PROTECTED] wrote:
Mark Davis
m/products/jdk/1.1/docs/guide/intl/fontprop.html]
on how to edit them to add new fonts. (This may take some patience:
the description is not exactly straightforward.)
Edward Cherlin wrote:
At 6:41 AM -0800 7/25/2000, Mark Davis wrote:
The issue of how to get Java to display Unicode character
BTW, saw the following press release from Peoplesoft
PeopleSoft Implements Unicode
link to
http://checkers.peoplesoft.com/events.nsf/07dd07bae4e2a86b8825666700767bbf/f59d0dfabda3a051882569190047a690?OpenDocument
We do not currently have a character that would serve the purpose being discussed.
The functions of the ZWNBSP and ZWSP are to forbid/allow linebreak, which is
orthogonal to the issue of whether two characters form a grapheme. Although
graphemes shouldn't linebreak, not every pair of letters
Indic support is in IBM's JDK, I believe in 1.3.
Mark
Vinit Bhatt wrote:
Hi Addison,
Thanks for really descriptive and explanatory email.
It helped me a lot in grasping basics of Unicode and Internationalization.
I also got good link from the site you gave me. That is :-
Before people get either excited or dismayed by these two drafts, one
should note that they are simply drafts: it is by no means assured that
they will ever be approved, or used if approved.
Mark
Keld Jørn Simonsen wrote:
On date and time formatting:
The forthcoming ISO TR 14652 can
The Unicode 3.0.1 beta period is closing on August 25. We encourage
everyone who uses the Unicode Character Database files to download and
examine these files in detail. In particular, some files have recently
been added to this beta as directed by the Unicode Technical Committee
at its 84th
Thanks. That code does need to be fixed, once we get the time.
Oliver Steinau wrote:
I have a question concerning the CVTUTF.C file that is on the CD in the
Unicode 3.0 book. There's a piece of code which I don't think is correct...
Function ConvertUTF8toUTF16 contains the following piece
If some noble soul volunteers to act as a sports reporter, I'm sure we
can work up something. It's probably a bit much to web-cam it, but that
may come in the future.
Mark
Otto Stolz wrote:
On Thu, 31 Aug 2000 17:31:49 (GMT-0800), Sarasvati has written:
As part of the reception on Thursday,
Unicode 3.0.1 has been released. This version is described on
http://www.unicode.org/unicode/standard/versions/Unicode3.0.1.html,
and is linked from the Unicode home page. Here is a short excerpt from
that page:
Unicode 3.0.1 is an update version of Unicode 3.0. It does
not contain
, and there is a note explaining them. It might
not go into enough detail -- we can supply that in the FAQ:
http://www.unicode.org/unicode/faq/casemap_charprop.html
Mark
John Cowan wrote:
On Fri, 1 Sep 2000, Mark Davis wrote:
Unicode 3.0.1 has been released. This version is described on
http
In HTML or XML you always use the code point (e.g. UTF-32), not a series of
code units (UTF-8 or UTF-16). Thus you would use:
#x10123;
not #xD800;#xDD23; from UTF-16
nor #xF0;#x90;#x84;#xA3; from UTF-8
Mark
Brendan Murray/DUB/Lotus wrote:
How can one encode a surrogate character as an
Mark Davis wrote:
Hello all,
I have been trying to input unicode from a browser and store it in a database.
The problem is the different encodings used to represent the unicode.
The input text is in the UTF-8 format. I have read on the Microsoft support site
that SQL Server 7.0
Take a look at the Unicode FAQ on the web, at www.unicode.org
"Gary P. Grosso" wrote:
Hi Unicoders,
I am working on software to emit HTML in the encoding
and character set of the user's choice, from SGML/XML
documents which can contain any Plane 1 Unicode character.
The question is what
Good point. In the past, I have used "surrogate characters" to refer to the
characters encoded above , and surrogate code units to refer to the UTF-16
units D800-DFFF. However, I think that leads to confusion. Nobody has come up
with a good term for all characters above . "Plane 1-16
Not all code points are assigned (or even assignable) to characters. U+xx
is used to refer to code points, which range from 0 to 10. Of these code
points, some are assigned to characters (including regular characters, control
characters, format characters, and private use characters
I share the concern about combinatorial explosions. Look a Spanish, Arabic or
English, for example:
http://oss.software.ibm.com/developerworks/opensource/icu/localeexplorer/
I agree that de-*-sp1996 makes more sense. For us, the variant should go before
the country only if the variant is -- in
I'd like to remind everyone to look at the latest version of the Unicode
Standard, especially when looking at fine points. To cite Unicode 3.0.1
(http://www.unicode.org/unicode/standard/versions/Unicode3.0.1.html)
"Section 13.2 Controlling Ligatures, page 318: the text is superseded by the
I am curious why you feel so strongly that the Hebrew points should be ignored
in domain names. Prima facie, it seems that there is little harm in treating
them no differently from other characters. What problem would arise if the
domain was ABC.COM and I could not get it by typing AB*C.COM?
the last sentence. I had thought that the vowel marks were used to
get the exact pronunciation. If that is not true, it may be part of my
misunderstanding of the situation.
Jony
-Original Message-
From: Mark Davis [mailto:[EMAIL PROTECTED]]
Sent: Sunday, September 17, 2000 7:58 PM
quot;, in TUC 3.0, p. 318.
Am 2000-09-15 um 14:40 UCT hat Mark Davis geschrieben:
I'd like to remind everyone to look at the latest version of the Unicode
Standard, especially when looking at fine points. To cite Unicode 3.0.1
(http://www.unicode.org/unicode/standard/versions/Unicode3.0
UCA (#10) already handles that. You will get a "fuzzy" compare if you
mask off less important weights, and you will get a much better ordering
than binary compare as well.
Mark
Hart, Edwin F. wrote:
Is there a need for a "fuzzy" comparison where names with and without
points in Hebrew? Is
er scripts such as
Arabic?
Mark Davis replied
UCA (#10) already handles that. You will get a "fuzzy" compare if you
mask off less important weights, and you will get a much
better ordering
than binary compare as well.
But then, why does the W3 Consortium want to *forbid* some Unic
If those can be confirmed, then the SpecialCasing file should be modified to add
them. Could you verify this in time for the next UTC?
Mark
Cathy Wissink wrote:
I believe Azeri also uses the dotless i/dotted i Turkish-style casing.
Cathy
-Original Message-
From: Carl W. Brown
There are a number of similarities between this XNS and IDN, so
http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-00.txt would be
worth reading.
On locales: using them is dangerous for matching. The only reason to add
locale is if it were to make a difference which letters match. But
It would be more accurate to say that it does not support all of Unicode
3.0. Just using the phrase "doesn't support 3.0" suggests that it is not
compliant. A product can be compliant to a particular version of Unicode
while only supporting a subset of the characters.
Even compliant products
If there are specific areas where the BIDI algorithm has flaws, that should
be communicated to the UTC bidi subcommittee, ideally with a proposal to fix
the problem.
Mark
- Original Message -
From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent:
Please take a look at www.unicode.org
- Original Message -
From: "Karambir Rohilla" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Tuesday, October 03, 2000 21:17
Subject: help me !!!
hello
Please help me anyone
waht is UTF8 UTF16 ?
regard
karambir singh
Thanks to the industriousness of volunteer translators and to Magda and
Julie's editorial work, we have many more translations of "What is Unicode"
on www.unicode.org (all in UTF-8, of course).
Check out http://www.unicode.org/unicode/standard/WhatIsUnicode.html. If you
have problems displaying
UTF-8, UTF-16, and UTF-32 all support exactly the same character repertoire.
Please look at www.unicode.org, on the front page is a link to the FAQs.
Mark
- Original Message -
From: "George Zeigler" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Thursday, October 05, 2000
For the purpose specified, isLatin1 should just test for = 0xFF. After all,
one would not want to exclude TAB, CR or LF ☺
Mark
- Original Message -
From: "John Cowan" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Thursday, October 05, 2000 10:33
Subject: Re: Correct
One of the main features of XML is that it has quite strict rules about how
to handle errors. The goal, I believe, is to ensure that we are not awash in
malformed files that have no clear interpretation.
And this is clearly an error: the acceptable code points are quite clearly
stated:
Zumindest die Hälfte der Namen im Lande kann so oder auch so ausgesprochen
werden
- je nachdem, wie es der Namensträger wünscht.
Much the same in America; you very often don't know how someone's last name
is pronounced (or spelt):
Stein = shtyn? styn? steen?
- Original Message -
From:
Can someone write up a description of the proposed change, with the
attandant glyphs. There is a UTC meeting next week in San Diego, so now's
the time.
Mark
- Original Message -
From: "Antoine Leca" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Tuesday, October 31, 2000
3:15
Subject: Re: Normative vs Informative
Ar 00:04 -0700 2000-10-26, scríobh Mark Davis:
I am leary of using normative your way unless we find strong evidence of
this.
Well, that's just wrong, Mark. (Sorry, it's beat-up Mark day I guess.)
Ken explained Normative and Infor
ICU has a list of these. If you take a look at
http://oss.software.ibm.com/icu/charset/CharMaps-HTML/windows-1252-2000.html
, for example, you will see some other interesting cases.
Mark
- Original Message -
From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL
The Unicode Standard does define the rendering of such combinations, which
is in the absence of any other information to stack outwards.
Implementations that can't do that will either overstrike, or use some other
fallback rendering.
A sophisticated rendering will use positioning such as control
programmatically, the program is wrong.
Mark
- Original Message -
From:
J.
William Semich
To: Rick H Wesson ; Mark Davis
Cc: Unicore ; Unicode ; [EMAIL PROTECTED] ; w3c-i18n-ig
Sent: Wednesday, November 15, 2000
09:32
Subject: Re: [idn] Javascript code
charts
That agrees with the results I get on http://www.macchiato.com/unicode/convert.html.
Mark
- Original Message -
From:
J.
William Semich
To: Mark Davis ; Rick H Wesson
Cc: Unicore ; Unicode ; w3c-i18n-ig
Sent: Wednesday, November 15, 2000
22:46
Subject: Re: [idn
We have found that it works pretty well to have a uchar32 datatype, with
uchar16 storage in strings. In ICU (C version) we use macros for efficient
access; in ICU (C++) version we use method calls, and for ICU (Java version)
we have a set of utility static methods (since we can't add to the Java
I haven't had time to read this list recently, so here is a somewhat belated
response.
But, even if you do so, we are left with a "wrong" canonical decomposition:
1FBC;GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI;Lt;0;L;0391
0345N1FB3;
According to James' statement (which is not
The UTC will be using the terms "supplementary code points", "supplementary
characters" and "supplementary planes". The term it is "deprecating with
extreme prejudice" is "surrogate characters".
See http://www.unicode.org/glossary/ for more information.
Mark
- Original Message -
From:
These are good points.
TR 21 deliberately does not specify the language conventions for using
titlecase, which as you note will change the effect of its use (see
http://www.unicode.org/unicode/reports/tr21/#TitlecaseCaveats). Most
products will have some smarts, but also leave it up to the user
We would like to call two items to people's attention.
1. The Unicode Technical Committee has modified the definition of UTF-8 to
forbid conformant implementations from interpreting non-shortest forms for
BMP characters, and clarified some of the conformance clauses. For more
information, see
-
From: "G. Adam Stanislav" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, November 29, 2000 22:42
Subject: Re: UTF-8 Corrigendum, new Glossary
At 21:08 29-11-2000 -0800, Mark Davis wrote:
1. The Unicode Technical Committee has modified the d
The soft hyphen is not sufficient, since in other languages the case where
two letters must be distinguished in collation may not fall on a syllable
boundary, or allow hyphenation between them.
The UTC looked at all the possible existing boundary-control characters;
none of them really work for
Have you tried looking at the Unicode home page, at "Display Problems", or
the FAQ "Unicode on the Web"?
- Original Message -
From: "sreekant" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Thursday, November 30, 2000 22:27
Subject: display problems on browser
hi,
I am
OTECTED]
Cc: "Unicode List" [EMAIL PROTECTED]
Sent: Friday, December 01, 2000 2:30 PM
Subject: Re: Transcriptions of "Unicode"
Sad to report, my browser (Netscape 4.7) shows the Yiddish as
Daw-key-nu-ye (It's left to right not rtl...)
I am using the Monotype Andal
quot;Unicode List" [EMAIL PROTECTED]
Sent: Friday, December 01, 2000 22:46
Subject: Re: Transcriptions of "Unicode"
Cool. Now if you also add LANG attributes, Mozilla/Netscape 6 will use
the fonts that have been set up for those languages. E.g.:
span lang="ja" title=&quo
ill use
the fonts that have been set up for those languages. E.g.:
span lang="ja" title="Japanese".../span
Erik
Mark Davis wrote:
Done.
From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
I would suggest adding a span title="{insert lang name
===
Globalization Engineering Consulting Services
On Sat, 2 Dec 2000, Mark Davis wrote:
Won't Modzilla pick fonts based on character code? The only ones in the
list
that couldn't be deduced from that would be the Yiddish and the Chinese.
Mark
- Original Message --
As per the instructions of the Unicode Technical Committee, TR#22: Character
Mapping Markup Language (CharMapML) has been advanced from draft TR to full
TR. See http://www.unicode.org/unicode/reports/tr22/ for more information.
Note: The UTC intends to continue development this TR to also
isplaying Unicode text (was Re: Transcriptions of "Unicode")
Mark Davis wrote:
Let's take an example.
- The page is UTF-8.
- It contains a mixture of German, dingbats and Hindi text.
- My locale is de_DE.
From your description, it sounds like Modzilla works as follows:
- The local
ember 12, 2000 09:01
Subject: Re: Transcriptions of Unicode
Ar 07:11 -0800 2000-12-12, scríobh Mark Davis:
ARMENIAN
BULGARIAN
CHEROKEE
ETHIOPIC
GREEK
GUJARATI
GURMUKHI
INUKTITUT
OGHAM
RUNIC
RUSSIAN
SINHALA
UCAS
See http://www.egt.ie/standards/iso10646/pdf/junikod.pdf
Michael Everson ** E
That matches what I have on
http://www.macchiato.com/unicode/Unicode_transcriptions.html, right?
(circle?)
Mark
- Original Message -
From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
To: "Mark Davis" [EMAIL PROTECTED]; "Unicode List" [EMAIL PROTECTED]
Sen
rom Maurice Bauhahn, but have some outstanding questions that need
to be resolved before attempting to roll the results into the table.
The resolution of Khmer sorting should also shed some light on
what to do with Myanmar, which shares a number of structural
similarities with Khmer.
Some s
In specific cases you may use one character conversion mapping instead of
two, but you should be very careful about that. See
http://www.unicode.org/unicode/reports/tr22/, especially "1.2.1 Best-Fit
Mappings"
Mark
- Original Message -
From: "Lars Marius Garshol" [EMAIL PROTECTED]
To:
ICU offers a reverse BIDI algorithm. (http://oss.software.ibm.com/icu/)
Mark
- Original Message -
From: "Roozbeh Pournader" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: "Behdad Esfahbod" [EMAIL PROTECTED]
Sent: Monday, January 08, 2001 20:12
Subject: Reverse Bidi Algorithm
nal Message -
From: "Marco Cimarosti" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Friday, January 12, 2001
03:11
Subject: Re: Transcriptions of
"Unicode"
Hallo everybody! I don't
fully agree with Mark Davis' API transcription of "Unic
Thanks for your detailed note; I'll have to think it over.
...
But there's another inconsistency in the transcription: the vowels in the
first ("u-") and third ("-code") syllable are both phonemically long.
Either you put the length mark on both (recommended for *phonetic*
transcription), or
Unicode is always serialized in a UTF: UTF-8, UTF-16*, or UTF-32*. The
definition of each of these is invariant across systems: in UTF-8 an 'a' is
always stored as 0x61. There is a special UTF for use on EBCDIC systems.
Check out the technical reports and FAQs on www.unicode.org.
Mark
-
Yes, I have already proposed an agenda item for the next UTC, to get this
fix into 3.1.
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED]
http://maps.yahoo.com/py/maps.py?Pyt=Tmapaddr=10275+N.+De+Anzacsz=95014
Roozbeh Pournader [EMAIL
BTW, we have settled on a term for characters with code points above .
See
http://www.unicode.org/glossary/#supplementary_character
http://www.unicode.org/glossary/#supplementary_code_point
Mark
- Original Message -
From: "David Starner" [EMAIL PROTECTED]
To: "Unicode List"
This appears to have bounced the first time I sent it.
- Original Message -
From: "Mark Davis" [EMAIL PROTECTED]
To: "Unicore" [EMAIL PROTECTED]; "Unicode" [EMAIL PROTECTED]
Sent: Monday, January 22, 2001 08:04
Subject: Time Intervals
After a reque
It doesn't add any value to insert joiners. Just add the IDS itself to the
font table.
Mark
- Original Message -
From: "John Cowan" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, January 24, 2001 11:21
Subject: Re: Unicode 3.1: IDS and ZW(N)J
John Jenkins
Title: Unicode Benefits
Allows for multilingual documents
using any or all the languages you desire. Invoice or ticketing applications can
print native language names.
*"multilingual documents" are rare --
as most people understand the term 'documents'. What more people care about is
that
This is not an omission. This issue was debated at great length in the
Unicode technical committee, and the precise wording was agreed to by the
committee.
Mark
- Original Message -
From: "John Cowan" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, January 31,
John,
It's interesting how we find ways to get around rules that bother us
This is a misrepresentation. The symbol was always intended to be the
Weierstrass elliptic function. It was misnamed, and is thus annotated with
the correct information. Nobody is winking.
... If I had read the
te format conversion
routine and noticed that ICU has no week based year support. Fortunately I
don't think my client needs it.
Carl
-Original Message-----
From: Mark Davis [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 25, 2001 9:18 PM
To: Carl W. Brown; Unicode List
Subject: Re: Time I
Did you not receive the GIF in the original message.
Mark
- Original Message -
From: "David Starner" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Thursday, February 01, 2001 11:01
Subject: Re: Property error for U+2118?
On Thu, Feb 01, 2001 at 10
The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation
was for performance (having a form that reproduces the binary order of
UTF-16). We have yet to see a formal proposal for this, though.
Mark
- Original Message -
From: "J M Sykes" [EMAIL PROTECTED]
To: "Unicode
It is the set of code points that can be addressed using surrogate code
points. For more information, see the glossary at www.unicode.org.
Mark
- Original Message -
From: "nikita k" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Tuesday, February 06, 2001 01:51
Subject:
as a Unicode Standard Annex. However, it
has not undergone final editorial review: it is not a stable document and
may not be used as reference material nor cited as a normative reference
from another document.
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED
The whole principle of tagging individual
strings with NF* is a bit odd to me; not sure I like it. The K forms in
particular are reallya folding operation, much like casing. I would not
expect to find a model where someone tagged every string in a database with its
Case, and then had some
I have not been following this discussion up until now. Typically the issue
with syllables is like that with word-sorting. With word sorting, no matter
what is in the second word, any difference in the first word swamps it.
Example:
ab xyz ghi
abc def ghi
In many cases, UCA does handle syllabic
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the
Linebreak TR. For linebreak the only tricky language is Thai, since it
requires a dictionary lookup (much like hyphenation in English). Java (and
ICU) supply linebreak mechanisms as a part of the standard API. They also
Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
recommended in my last message. The Unicode standard is online, as is the
TR. Both can be found by going to www.unicode.org, and selecting the right
topic. The TR in particular discusses the recommended approach to line break
I agree with Tex that the algorithm is small, if implemented in the
straightforward way. I also agree with his #1, #2, and #3. I will add two
things:
1. Where performance is important, and where people start adding options
(e.g. uppercase lowercase vs. the reverse), the implemenation of
On Sun, 11 Feb 2001, Mark Davis wrote:
MD Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
MD recommended in my last message. The Unicode standard is online, as is
the
MD TR. Both can be found by going to www.unicode.org, and selecting the
right
MD topic. The TR in
ot; [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 20:30
Subject: Re: Korean linebreking and UTR14(was Re: extracting words)
On Mon, 12 Feb 2001, Mark Davis wrote:
Thank you for your answer.
Asmus Freytag is the one to talk to; he can look into this.
Do you think I should contact him
I am still missing
Bopomofo, Khmer, Mongolian, Myanmar, Sinhala, Syriac, Thaana
on http://www.macchiato.com/unicode/Unicode_transcriptions.html
If anyone could supply one of these, I would appreciate it.
Also, Ken suggested that the Bopomofo should be a Bopomofo transcription of
the Chinese
For those interested in collation, we have a new version of the ICU
collation design document on
http://oss.software.ibm.com/icu/develop/collation/. Feedback is welcome.
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED]
http
many comments
- Original Message -
From: "Tom Lord" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, February 21, 2001 21:15
Subject: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)
We've seen several posts about the perception that Unicode is
Message -
From: "John Cowan" [EMAIL PROTECTED]
To: "Mark Davis" [EMAIL PROTECTED]
Cc: "Unicode List" [EMAIL PROTECTED]
Sent: Friday, February 23, 2001 08:21
Subject: Re: An Aburdly Brief Introduction to Unicode (was Re: Perception
...)
Mark Davis wrote:
A _code_po
Ken has done a nice job of fleshing out the issues. I would add a bit to
that.
The glossary entry for "abstract character", as he points out, was inherited
from 10646.
"Abstract Character. A unit of information used for the organization,
control, or representation of textual data. (See
You can use the same collation sequence for two languages, even if they use
different sets of letters, as long as they don't *conflict*. For example,
you can't have Swedish and German with the same sequence, since they differ
in how they deal with a-umlaut. If there are any words x and y, both in
1 - 100 of 920 matches
Mail list logo