From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]
Why would UTF-16 be easier for internal processing than UTF-8?
Both are variable-length encodings.
Performance tuning is easier with UTF-16. You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]
Why would UTF-16 be easier for internal processing than UTF-8?
Both are variable-length encodings.
Performance tuning is easier with UTF-16. You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]
Why would UTF-16 be easier for internal processing than UTF-8?
Both are variable-length encodings.
Performance tuning is easier with UTF-16. You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]
Why would UTF-16 be easier for internal processing than UTF-8?
Both are variable-length encodings.
Performance tuning is easier with UTF-16. You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that
Miikka-Markus,
I'd suggest that you write this up as a PDF document (with scanned
examples) and submit it to the UTC and WG2 for consideration.
From: Asmus Freytag [mailto:[EMAIL PROTECTED]]
Sent: Sunday, September 23, 2001 02:24 AM
The typical situation involves cases where large data sets
are cached in
memory, for immediate access. Going to UTF-32 reduces the
cache effectively
by a factor of two, with no comparable
For this situation you have a good point. For others, however, the
extra data space of UTF-32 is bound to be lower cost than having to check
every character for special meaning (i.e. surrogate) before passing it on.
First, it is generally faster to test something in a cache than it is to
Andy Heninger writes:
Performance tuning is easier with UTF-16. You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that it's OK for them take a bail-out slow path.
Sure, but if you are using UTF-16 (or any other multibyte encoding)
you loose the
I forwarded Carl's note to a Typewriter list, and received this response.
At 12:49 -0500 2001-09-24, Eric Fischer wrote:
Michael Everson [EMAIL PROTECTED] quotes Carl W. Brown:
This is logical. Originally typewrites had no 1 or 0. You code use
the letters l and O. They look the same so
Three fonts walk into a bar. The barman, wiping a glass, shakes his
head and says to them: I'll have none of your type in here.
Mike,
The typical situation involves cases where large data sets
are cached in
memory, for immediate access. Going to UTF-32 reduces the
cache effectively
by a factor of two, with no comparable increase in processing
efficiency to
balance out the extra cache misses. This is because
-Original Message-
From: Michael Everson [mailto:[EMAIL PROTECTED]]
Three fonts walk into a bar. The barman, wiping a glass, shakes his
head and says to them: I'll have none of your type in here.
Gee, and I thought he was going to say:
Why the long face?
Tom,
Andy Heninger writes:
Performance tuning is easier with UTF-16. You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that it's OK for them take a bail-out slow path.
Sure, but if you are using UTF-16 (or any other multibyte encoding)
you
If you think you have the answer to all the problems, then you
don't know all the problems.
I tried to make a point, and apparently made it poorly. I will try
again. It seems that some people are arguing that UTF-16 is the ideal
solution for all computing, and that UTF-8 and
At 13:59 -0500 2001-09-24, Ayers, Mike wrote:
It seems that some people are arguing that UTF-16 is the ideal
solution for all computing, and that UTF-8 and UTF-32 exist only for
network transport.
I tend to think that because I have to make web pages using UTF-8, I
wish that I had better
From: Suzanne M. Topping [EMAIL PROTECTED]
From: Michael Everson [mailto:[EMAIL PROTECTED]]
Three fonts walk into a bar. The barman, wiping a glass, shakes his
head and says to them: I'll have none of your type in here.
Gee, and I thought he was going to say:
Why the long face?
Actually, there is no need of the digits at all! Why even have them?
With my Japanese IME, I simply type the number as words (japanese number words are
much shorter than the English) and convert!
Or, you can say, why in English do we have the words "two", "three", etc., when we can
merely write
Carl W. Brown writes:
If you implement an array that is directly indexed by Unicode code point it
would have to have 1114111 entries. (I love the number) I don't think that
many applications can afford to have over a megabyte of storage per byte of
table width. If nothing else it would be
Michael,
I was over simplifying. If you look at the older teletype keyboards you
will notice that the shift is between letters (mono case) and figures. You
will also see three rows of keys. With 5 bit encoding you had a letters and
figures shift. If I remember correctly the space and charage
From: Ayers, Mike [EMAIL PROTECTED]
Analyze problem. Pick solution. In that order.
Wiser advise was ne'er spoken, on *this* topic at least.
I wonder is there is some way that a policy decision can be made to declare
a moratorium on the whole *My* UTF is better than *your* UTF for a while?
From: Tom Emerson [EMAIL PROTECTED]
But if I have a text string, and that string is encoded in UTF-16, and
I want to access Unicode character values, then I cannot index that
string in constant time.
To find character n I have to walk all of the 16-bit values in that
string accounting for
Mike,
If you think you have the answer to all the problems, then you
don't know all the problems.
I tried to make a point, and apparently made it poorly. I will try
again. It seems that some people are arguing that UTF-16 is the ideal
solution for all computing, and that
Michael \(michka\) Kaplan writes:
To find character n I have to walk all of the 16-bit values in that
string accounting for surrogates. If I use UTF-32 I don't need to do
that. This very issue came up during the discussion of how to handle
surrogates in Python.
Would this not be the
Yung-Fong Tang wrote:
bascillay GB18030 is design to encode All Unicode BMP in a encoding which is
backward compatable with GB2312 and GBK.
Correction: to encode _all_ of Unicode, not just all Unicode BMP - GB 18030 covers
all 17 planes, not just the BMP.
markus
Markus Scherer wrote:
Yung-Fong Tang wrote:
bascillay GB18030 is design to encode All Unicode BMP in a encoding which is
backward compatable with GB2312 and GBK.
Correction: to encode _all_ of Unicode, not just all Unicode BMP - GB 18030
covers all 17 planes, not just the BMP.
Does
On Mon, Sep 24, 2001 at 06:18:19PM -0700, Yung-Fong Tang wrote:
Markus Scherer wrote:
Correction: to encode _all_ of Unicode, not just all Unicode BMP - GB 18030
covers all 17 planes, not just the BMP.
Does GB18030 DEFINED the mapping between GB18030 and the rest of 11 planes? I don't
Yung-Fong Tang writes:
Does GB18030 DEFINED the mapping between GB18030 and the rest of 11
planes? I don't think so, since Unicode have not define them yet,
right ?
Sure it does. We know what the code points are, even if they don't
have characters assigned to them yet. This allows GB18030 to
In a message dated 2001-09-24 20:50:25 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
Does GB18030 DEFINED the mapping between GB18030 and the rest of 11 planes?
I don't think so, since Unicode have not define them yet, right ?
Unicode defined all the planes, a long long time ago. It's
29 matches
Mail list logo