OK, just for fun
Quiz for Unicode Guru
Here is the quiz for the Unicoder. It is not a hard quiz. Everyone will
get it right eventually. So, use stop watch to measure how long it will
take for you figure out the right answer.
Note: You can find the information of Unicode and UTF-8 from
Looking at
http://www.unicode.org/review/
33
UTF Conversion
Code Update
2004.06.08
The C
language source code example for UTF conversions (ConverUTF.c) has been
updated to version 1.2 and is being released for public review and
comment. This update
For sure no one in this
mailling list want to see your xml got treated as US-ASCII when the
data is really in UTF-8.
If I have an xml file like the following
?xml version="1.0"?
and send over the HTTP protocol with the following content type header:
Content-Type: text/xml;
(without
Is there any standard effort try to standardize Time Zone ID? I am not
talking about the Time Zone which refer to a particular time (that could
be done by GMT offset or addressed by ISO 8601) itself, but rather
talking about an id refer to a particular time zone/ day light saving
time rule.
any one know who can fix
http://www.unicode.org/reports/index.html ?
all the links are broken
Raymond Mercier wrote on 4/22/2004, 7:35 AM:
I enquired about the 'super font' created by a Beijing foundry,
http://font.founder.com.cn/english/web/index.htm, and am fairly
astonished
at the prices, as you see from the attached.
The cost of produce these fonts are much higher than
I saw the announcment of publishing
" ISO/IEC 10646: 2003, Information technology --
Universal Multiple-Octet Coded Character Set (UCS)"
>From http://anubis.dkuug.dk/jtc1/sc2/open/02n3729.htm
I expect there are no difference from Unicode 4.0, am I right?
In case you want to test
your GB18030 font, you can use Netscape 7 (or lateset Mozilla) and then
visit my GB18030 test pages at
http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=10
It should be page to page compatable to the paper copy of GB18030-2000
standard. I also
Kenneth Whistler wrote on 4/22/2004, 3:26 PM:
Frank asked:
I expect there are no difference from Unicode 4.0, am I right?
Correct. Please see Appendix C of Unicode 4.0, p. 1348 and p. 1350,
which already explicitly makes this statement.
--Ken
I don't see ISO10646-2003 in the
are you talking about
http://www.unicode.org/charts/unihangridindex.html
and
http://www.unicode.org/charts/unihanrsindex.html
?
Gary P. Grosso wrote on 4/14/2004, 1:18 PM:
Hi,
I am looking for an up-to-date, online version of the sort of thing
I see in the back of the printed Unicode
Be careful here, for Unicode support in the browser (at least
Netscape/Mozilla) there are some code fork between 2000/XP and Win98/ME.
Philippe Verdy wrote on 3/23/2004, 5:39 AM:
From: Edward H. Trager [EMAIL PROTECTED]
Also, I would not bother testing Windows OSes prior to Windows
Chris Jacobs wrote on 3/15/2004, 10:08 PM:
- Original Message -
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, March 16, 2004 2:28 AM
Subject: Re: in the NEW YORK TIMES today, report of a USA patent for a
met
hod to
May be I should file an US patent application to write Arabic from left
to right to make it more simplified :) I guess that will have more
adoption rate compare to this font design patent since most software
which does not support Bidi already implement them. :)
Mark E. Shoulson wrote on
Wow.
It seems not a very new idea. Similar idea have been used in Chinese 40
years ago and create the differences between Simplifed Chinese And
Traditional Chinese.
Michael Everson wrote on 3/15/2004, 12:40 PM:
In the NEW YORK TIMES today
comes a report of a USA patent for a new version of
many different reason you will see ? there.
read my paper http://people.netscape.com/ftang/paper/unicode25/a302.htm
to see a list.
Manga wrote on 3/15/2004, 10:07 AM:
I use UTF-8 encoding in java code to store multi byte characters in the
db . When i retreive the multi byte characters
Mike Ayers wrote on 3/15/2004, 2:50 PM:
From:
[EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Frank
Yung-Fong Tang
Sent: Monday,
March 15, 2004 11:16 AM
It seems not a very
new idea. Similar idea have been used in
Chinese 40
years ago
Not sure how to find the information paper. But one way to check the
degree of the support is to do a GetStringTypeEx agasinst some
characters defined in 2.0, 2.1, 3.0, 3.1, 3.2, 4.0 to see does those
return result reflect what it should be.
Antoine Leca wrote on 3/5/2004, 8:35 AM:
Hi
you can also use 'nsconv' which come with mozilla source code with GB18030.
see http://www.mozilla.org/projects/l10n/mlp_tools.html for details
Zhang Weiwu wrote on 3/5/2004, 6:43 AM:
Hello. I believe this must be a frequent question, but I googled around
and I didn't find a satisfying
BDF is also widly used,
although the quality and features is not that powerful these day.
Also, there are other "standard" about the font:
1. Glyph set "standard"- how to make sure one font contains all the
glyph for a particular group of users- for example- WGL4 is a glyph set
standard from
oh. This is the first time I hear about this. Thanks about your
information. Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__
is defined? or does it only mean wchar_t hold the character in ISO_10646
(which mean it could be 2 bytes, 4 bytes or more than that?)
Noah Levitt wrote on
not prevent someone to make it 16
bits or 64 bits when that macro is defined, right?
And what does the year and month mean?
On Mar 03, 2004, at 12:38, Frank Yung-Fong Tang wrote:
oh. This is the first time I hear about this. Thanks about your
information. Does it also mean wchar_t is 4
Clark Cox wrote on 3/3/2004, 4:33 PM:
[I swap the reply order to make my new question clearer]
And what does the year and month mean?
It indicates which version of ISO10646 is used by the implementation.
In the above example, it indicates whatever version was in effect in
December
I
Rick Cameron wrote on 3/1/2004, 2:13 PM:
Hi, all
This may be an FAQ,
but I couldn't find the answer on unicode.org.
The reason is there are
"NO answer" to the question you ask.
It seems that most
flavours of
unix define wchar_t to be 4 bytes.
Depend on which UNIX
John Cowan wrote:
steve scripsit:
Could someone please clarify the difference between UTF8 and UFT16
please? If it is possible to encode everything in UTF8 and it is more
efficient what is the need for UTF16?
It is more efficient to PROCESS in UTF16.
joe wrote:
(Hmm, in Russian mother language (maternij jazik) means something
*verry* different.
Watch your language! ;-)
He write this in English not Russian, right?
How can I watch Chinese (my language) ?
Joe
As a native Chinese person. I believe
1. The so called eight basic stroke is very standard in concept.
But that is only 8.
2. They list 8 different varients for each of the 8 basic stroke. But
if you read that page carefully, it does not mean that there are only 8
variants for each stroke,
Yes, TEC. look at developer.apple.com and look at Text Encoding Converter
Paramdeep Ahuja wrote:
Hi
Can anyone tell if there is any API available on MAC to convert from
UTF-8
to UTF-16
thnx
-P
Consider CR and LF too.
Mark Davis wrote on 1/14/2004, 9:25 AM:
I'm not sure which one suggested heuristic method you are referring
to, but
you are bounding to conclusions. For example, one of the heuristics is
to judge
what are more common characters when bytes are interpreted as if
Does Thai use CR and LF?
Peter Kirk wrote on 1/14/2004, 8:12 AM:
On 14/01/2004 07:16, John Burger wrote:
...
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?
I mentioned Thai because it is the only language I know of which does
John Burger wrote on 1/14/2004, 7:16 AM:
Mark E. Shoulson wrote:
If it's a heuristic we're after, then why split hairs and try to make
all the rules ourselves? Get a big ol' mess of training data in as
many languages as you can and hand it over to a class full of CS
graduate
looks like an old idea people in Taiwan gave up long time
ago because of the issue of the quality of glyph will never be
good enough.
Tom Emerson wrote on 1/2/2004, 6:06 PM:
The following paper, Chinese Character Synthesis using METAPOST, was
recently mentioned in a thread on the teTeX
come on, take my joke. but that is a perfect example of language
specific variant glyph, right?
Michael Everson wrote:
At 17:13 -0800 2003-12-02, Frank Yung-Fong Tang wrote:
come on, use language specific glyph substution on the last resort
font to show Irish last resort glyph
Peter Kirk wrote:
On 02/12/2003 16:25, Frank Yung-Fong Tang wrote:
...
a barrier to proper internationalisation ?
My opinion is reverse, I think it is a strategy to proper
internationalization. Remember, people can always choose to stay with
ISO-8859-1 only or go to UTF-8
, it will be 1% of efforts for me
to fix it later, right? :)
Michael Everson wrote:
At 15:38 -0800 2003-12-03, Frank Yung-Fong Tang wrote:
I am encouraging QA to test MES-1 with UTF-8 instead of only ISO-8859-1.
I am encouraging product ship with MES-1 support out of the box instead
than 10 scripts ?
I think the value is it show poeple it is not a ? ASCII
question mark itself.
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
Subject: Re: MS Windows and Unicode 4.0 ?
I'm interested in knowing whether the following features
would soon be found
in Windows : fonts for scripts covered by Unicode 4.0,
corresponding
rendering engine to display all Unicode 4.0 scripts
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntin
-8
gzip of SCSU
gzip of BOCU-1
gzip of Legacy encoding
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
John Jenkins wrote:
On Dec 1, 2003, at 4:24 PM, Frank Yung-Fong Tang wrote:
John What 'cmap' format Apple use in the MacOS X
Devanagari and Bangla fonts?
The formats are irrelevant; the Mac supports all the 'cmap' subtable
formats for all subtables. For rendering complex
Michael Everson wrote:
At 14:23 -0800 2003-12-02, Frank Yung-Fong Tang wrote:
It's better than not knowing what range the thing is in. It helps
the
user know he has received, say, Telugu data or whatever.
Only if the user know what Telugu may look like. How many users other
Doug Ewell wrote:
Frank Yung-Fong Tang ytang0648 at aol dot com wrote:
Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be
fixed. Plain and simple. If a system like Tcl only supports the BMP,
that is its choice, but it *must not* accept non-shortest UTF-8 forms
Peter Kirk wrote:
On 02/12/2003 14:19, Frank Yung-Fong Tang wrote:
A better approach than asking Does product X support Unicode 4.0
which in some way you can always get a NO answer is to
1. Define a smaller set of functionality (Such as MES-1, MES-2, MES-3A)
2. Ask 'Does
://homepage..mac.com/jhjenkins/
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
Philippe Verdy wrote:
Frank Yung-Fong Tang writes:
But how about the UTF-16 vs UCS4 battle?
Forget it: nearly nobody uses UCS-4 except very internally for string
processing at the character level. For whole strings, nearly everybody
uses
UTF-16 as it performs better with less
NT\CurrentVersion\LanguagePack]
SURROGATE=dword:0002
[HKEY_CURRENT_USER\Software\Microsoft\Internet
Explorer\International\Scripts\42]
IEFixedFontName=Code2001
IEPropFontName=Code2001
/code
Andrew
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
rendering, it cannot support them.
John H. Jenkins
John What 'cmap' format Apple use in the MacOS X
Devanagari and Bangla fonts?
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
should also compare the
same for
things like keyword searches and file systems even though it is
technically
incorrect.
Carl
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
the questioning party is thinking must be given as a
part of said question.
oh... really, what kind of Unicode support in Windows 2.0? (since you
said- *any*)... No... I don't really care. Don't try to answer me.
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta
with this
weired specification - ISCII. (if you don't think it is weired, look
at the E-1 Display Attributes session in Annex-E of ISCII which is worst
than the E-2 Font Attributes I mentioned here.)
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto
:
Frank Yung-Fong Tang wrote,
If you visit
http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596
and your machine have surrogate support install correctly and
surrogate
font install correctly then you should see surrogate characters
show up
match the gif
Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.
Does your
Michael (michka) Kaplan wrote:
From: Frank Yung-Fong Tang [EMAIL PROTECTED]
so.. in summary, how is your concusion about the quality of GB18030
support on IE6/Win2K ? If you run the same test on Mozilla / Netscape
7.0, what is your conclusion about that quality of support
.
If you still think adding 4 bytes UTF-8 support is 1% of the task,
then please join the Tcl project and help me fix that. I appreciate your
efforts there and I beleive a lot of people will thank for your
contribution.
Doug Ewell wrote:
Frank Yung-Fong Tang YTang0648 at aol dot com wrote
.
_
Charla con tus amigos en lnea mediante MSN Messenger.
http://messenger.microsoft.com/es
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
John 3:16 For God so loved the world
bandied about a lot.
It is a short hand for "Irn " because it is too hard for most of the people to type the "r" part. :) [and if your software can
save that string retrive it correct later, 50% of the i18n problem is
addressed]
--
Frank Yung-Fong Tang
about fonts.
Could someone recommend a good tutorial or 'font creator' application
that addresses surrogate pairs?
Thanks,
Erik Ostermueller
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg
are you using Netscape7 / Mozilla or IE?
If you use IE, then IE may have a bug about that.
I think Mozilla should not have the problem since I develope and test it
by myself.
[EMAIL PROTECTED] wrote:
.
Frank Yung-Fong Tang wrote,
If you visit
http://people.netscape.com/ftang
Philippe Verdy wrote:
From: Frank Yung-Fong Tang [EMAIL PROTECTED]
It is not that easy for you from don't know beans about fonts to
creat a test font that contains ... \u20050. If you are lucky, it
will
take you several month if not year. There are commercial base font
tool
# ftxinstalledfonts
# ftxruler
# ftxvalidator
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage..mac.com/jhjenkins/
--
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo
hum a very stupid (but work) way.
1. use vi
2. type #x + the Unicode text + ; for each characters
3. save it as .html
4. open the file by using browser
5. copy the text
6. paste into your software.
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv
Srvies
AIM:yungfongta mailto
We add GB18030 support into Mozilla and also add 32 bit cmap support on
windows into Mozilla about a year ago. The Linux and Mac 32-bit cmap
support is a little bit behind
I think we first have GB18030 encoding support in Netscape in Netscape 6.2
You should be able to see whatever the
I think that is depending on the application support the newly defined UTF8_STRING
for selection or not.
The Linux verion of mozilla implement it so it can copy/paste with the recent
version of xterm w/o problem
Notice that UTF8_STRING is defined AFTER X11 R6 ICCCM.
See the spec in
Jain, Pankaj (MED, TCS) wrote:
Hi,
I am generating pound sign in html preview using XML XSLT transformation
and its working fine in windows using #163; in XML but same thing is
not working in unix server.
What do you mean in unix server ? display the text on the Unix Xterm ?
or you are
url please
Rick McGowan wrote:
The Unicode Public Review Issues page has been updated today.
Highlights:
Closed issue #1 (Language tag deprecation) without any change.
Updated some deadlines on other issues to June 1, 2003.
Added a document for issue #7 (tailored normalizations).
I think that is a hard problem
First of all. Take a look at
http://www.unicode.org/Public/4.0-Update/UCD-4.0.0d5b.html
and find the vertical one
Second, anything which need to be Symmetric Swap in Bidi probably need to
be change in the vertical form. (If they need to be change in horizontal
Otto Stolz wrote:
The two scans under
http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
are from the authoritative (until July 1996) book on German
orthography: Duden Rechtschreibung der deutschen Sprache
und der Fremdwörter / hrsg.
Which pinyin system the rua is in?
I use simpchinese win XP and if I switch to Full Spell (??)Simplified
Chinese IME and type rua', then I got (read this email in UTF-8)
which is U+633C
I am not sure that is correct. At least, as a native Mardarin speaker,
that sound is not nature for me at
Dominikus Scherkl wrote:
Anyone know why the sort order is different under that two systems?
As I mentioned: a new feature, keeping numbers ordered numerical.
I won't mind if they ALSO give me a flag to control that behavior.
Number could be used for many different
Anyone know is there a way to make them sort in the same
order?
Why should anybody want that?
Because user expect a cross platforms (or I should said cross windows version)
product display the same sorting order in Win98 and on WinXP.
For example, the Netscape7
Michael (michka) Kaplan wrote:
From: "Yung-Fong Tang" [EMAIL PROTECTED]
One of my colleague ask me this question.
Not much to do with Unicode, though. Is it?
It will be an Unicode issue if the cause is the new software try to implement
http://unicode.o
We cannot use that. The function you mention is to compare two Unicode strings.
We need the function to "generate sort key" from unicode strings instead
of compare two string.
Michael (michka) Kaplan wrote:
From: "Yung-Fong Tang" [EMAIL PROTECTED]
One of
Doug got my point. What I care is the "difference" instead of which one is
better.
Doug Ewell wrote:
Dominikus Scherkl Dominikus dot Scherkl at glueckkanja dot com wrote:
It is not deterministic string ordering
?!?
What's non-deterministic in numeric
I have not touch Java for years (probably 5 years) ... so, I could be wrong.
Jain, Pankaj (MED, TCS) wrote:
Hi ftang/james..
thanks for the details
explanation. and now I the root problem of my error.
I have following string
is in database as Long in which
check http://emr.cs.iit.edu/home/reingold/calendar-book/second-edition/
Paul Hastings wrote:
does anybody know of any java farsi calendar components? thanks.
Paul Hastings [EMAIL PROTECTED]
CTO Sustainable Development
on the same data and return the same
results.
Your colleague is mistaken.
MichKa
- Original Message -
From: "Yung-Fong Tang" [EMAIL PROTECTED]
To: "Michael (michka) Kaplan" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, March 13, 2003 4:31 PM
Subject: Re: s
Mary McCarter wrote:
Hi Friends,
My phone (Motorola i550,i30sx,i85,i60c) doesn't show correctly the neither
#243; and it shows the instead of .
Is that a LATIN CAPITAL A WITH TILD and a SUPERSCRIPT THREE?
ISO-8859-1 use 0xc3 to encode LATIN CAPITAL A WITH TILD
ISO-8859-1 use 0xb3
Hope they can reduce the weight next time by change the type of the
paper. My Bible is about 500 pages (about 1500+ pages) more than the
unicode 3.0 standard but only 50% of it's thick. Same as my
Chinese/English dictionary.
Otto Stolz wrote:
Kenneth Whistler wrote:
we can
calculate the
John H. Jenkins wrote:
I certainly think it would be good published with a leather cover,
onion-skin paper, and gilt edges, yes. First we have to have Ken
divide it into verses, though.
I thought we already have verses dividied in Chapter 3. Those
C1-C13/D1-2 stuff
One of my colleague ask me this question. We use LCMapStringW on WinXP
and LCMapStringA on Win98 (by using LCMAP_SORTKEY ). And we got
different sorting order for the following
Example of message list ordering in Win98:
TESTING #1
TESTING #10
TESTING #100
TESTING #11
While, the message list
Because the following code got apply to your unicode data
1. convert \u to unicode -
\uFFE2\uFF80\uFF93
become
three unicode characters-
U+FFE2, U+FF80, U+FF93
This is ok
2. a "Throw away hihg 8 bits got apply to your code" so
it became 3 bytes
E2 80 93
3. and some code treat it as UTF-8
.10; Supplementary Private Use Area-B
Also, I doubt we should allow
E..E007F; Tags
to be used as NameStartChar
Frank Yung-Fong Tang
Ram Viswanadha wrote:
There is also some information at
http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results
Not sure if this is what you are looking
for.
thanks. not really. I am not look into the
1. open you file with n7 and change the encoding to UTF-8
2. select and copy all the text
3. paste into the first textarea of the attached html file
David Oftedal wrote:
Hello!
Sorry to make this a mass spam, but I need a program to convert UTF-8
to hex sequences. This is useful for embedding
Francois Yergeau wrote:
[EMAIL PROTECTED] wrote:
I remember there were some study to show although UTF-8 encode each
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
LESS characters in writting to communicate information than
alphabetic base langauges.
Francois Yergeau wrote:
http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU)
Misha Wolf
Unfortunately, no information about Germany or Japanese. :(
It only have Chinese, Frasi, Urdu, Russian, Arabic, Hindi, Korean ,
Creole, Thai, French, Czech,
thanks, everyone. But I want to point out the punct and itself
should also be consider in your future caculation. Japanese and Chinese,
Thai do not use between word, and Latin based (or Greek,
Koeran,Cyrillic, Arabic, Armenian Georgian, etc) does use and when
used for estimate size,
I remember there were some study to show although UTF-8 encode each
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
LESS characters in writting to communicate information than alphabetic
base langauges.
Any one can point to me such research? Martin, do you have some paper
-Fong Tang [EMAIL PROTECTED] wrote:
Thanks to let me know. I guess I didn't spend enugh time with www.unicode.org
these days :) when do you add those PDF there ? It used to have only partial
sesssion available... but that is probably story several years ago
Roozbeh Pournader wrote:
On Thu, 27 Feb 2003, Mark Davis wrote:
Doug Ewell wrote:
Yung-Fong Tang ftang at netscape dot com wrote:
So... in the future, in order to ensure we have a good software
environment, we not only need to make the Unicode 4.0 clear, but also
need to speed up the revision of those RFCs.
But the Unicode
Kenneth Whistler wrote:
Think of it this way. Does anyone expect the ASCII standard to tell,
in detail, what a process should or should not do if it receives
data which purports to be ASCII, but which contains an 0x80 byte
in it? All the ASCII standard can really do is tell you that
0x80 is not
My test data generator in
http://people.netscape.com/ftang/testscript/arabic/arabic.html
probably can also help people to look at the Arabic behavior
Unfortuatelly, it is currently coded against Windows-1256 instead of the
unicode.
I think you have both problem in 1 and 2
1. I think you use the wrong way to encode, you probably should encode figure
2 by using
U+0644-U+0654-U+0627
and figure 3 by using
U+0644-U+0627-U+0654
2. I think there are also font problem. From my test, all the font ship with
MS windows does not
Stefan Persson wrote:
Kenneth Whistler wrote:
Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
sequences. There were two types:
a. 0xC0 0x80 for U+ (instead of 0x00)
b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90
0x80 0x80)
Ah, but encoding NULL
This discussion has been centered around UTF-8. But I hope the
corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
. for UTF-32: occurrences of 'surrogates' are ill-formed.
How about UTF-32 sequence which the 4 bytes represent value U+10 ?
Are they considered ill-formed?
Kent Karlsson wrote:
The Unicode 4.0 text further strengthens Conformance Clause
C12, to make this crystal clear:
C12 When a process generates a code unit sequence which
purports to be in a Unicode character encoding form, it shall
not emit ill-formed code unit sequences.
C12a
Likewise, the Unicode Standard tells you what a well-formed
UTF-8 byte sequence is. But it is the software designer who has
to be smart about determining what his/her software will do when
it encounters an error condition and finds itself dealing
with a sequence which is ill-formed according to
I can keep answering these questions, but I can also assure
everyone that the UTC worked *very* hard this time around to
make the character encoding model much clearer in the Unicode 4.0
text, and to anticipate all these edge cases.
--Ken
The problem in the past come from two (or more
Not sure this is the right fourm to discuss this issue. I found this "problem"
when I debugging a UTF-8 email message.
When I look into some email that we have problem with, I just saw some Content-Type
header like the following:
Content-Type: text/html; charset="UTF-8"
As I
Kenneth Whistler wrote:
If you read through those definitions from Unicode 4.0 carefully,
you will see that UTF-8 representing a noncharacter is perfectly
valid, but UTF-8 representing an unpaired surrogate code point
is ill-formed (and therefore disallowed).
I see a hole here. How about
I think that is a very commn mistake people WILL make.
Doug Ewell wrote:
Thanks to all who pointed out that noncharacters, unlike surrogate code
points, are NOT illegal or invalid in UTF-8 or any other CES. I don't
know why I said they were. (Bad brain! Bad, bad brain!)
-Doug Ewell
Fullerton,
1 - 100 of 160 matches
Mail list logo