GB18030

2001-09-21 Thread Charlie Jolly
GB18030 

In what ways will this effect Unicode?

Does it contain anything that Unicode doesn't?


RE: numeric ordering

2001-09-21 Thread Karlsson Kent - keka



  1.  Is there another document/algorithm/table that does provide
  guidelines for sorting numbers within strings?  Something
  that deals with different scripts?
 
 ISO/IEC 14651 International String Ordering includes
 an informative annex on this topic. In particular, see
 C.2 Handling of numeral substrings in collation. The specific

C.3 in my copy...

 case of sorting multiple-part section numbering is not
 addressed in detail, 

...because that is subsumed under C.3.1 (Handling of 'ordinary'
numerals for natural numbers), when also considering
FULL STOP to separate numerals, and not be part of them
(which is usually the case for natural number numerals).

(Teknisk norm nr. 34, Swedish Alphanumeric Sorting, [Swedish] Statskontoret,
1992, has a somewhat different approach to the same problem; however, that
document is only available in Swedish, does not go into details on this, and
even though it describes a multi-level ordering it does not fit well with
the 
UTR10/14651 framework...)

/Kent Karlsson

 but many similar kinds of problems
 are.
 
 --Ken
 
 




Re: GB18030

2001-09-21 Thread Thierry Sourbier
Charlie,

 In what ways will this effect Unicode?

 Does it contain anything that Unicode doesn't?

I suggest that you take a look at Markus Scherer paper "GB 18030: A
mega-codepage"
http://www-106.ibm.com/developerworks/library/u-china.html

It will probably answer your question on the relationship between GB18030
and Unicode.

Cheers,
Thierry.


www.i18ngurus.com - Open Internationalization Resources Directory


Kana syllables

2001-09-21 Thread てんどう瘢雹りゅう瘢雹じ
The small letters are for making like in my fake name. The regular Ri and the small Yu 
make Ryu.
Some syllables require 2 katakana (or hiragana) symbols.

But the thing is, are "ra gyou" kana to be regarded as having R or L for their 
consonant?

You can get lots of 2-kana syllables. Like in the title "ranmafankurabu" where "fa" is 
a Fu with small A. (Actually, in Unicode names, the Fu is called Hu.)

Some Unicode names for kana do not reflect the pronunciation.
The kana Si is usually pronounced Shi, but I think it depends on your dialeect of 
Japanese. I think it could also be Si.

There are many sources of info on kana on the Web. Look one up.
Heck, I can't even sing fast enough in kana to keep up with the song.

rubyrbじゅういっちゃん/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


Re: UTF-8 UCS-2/UTF-16 conversion for library use

2001-09-21 Thread Kenneth Whistler

Tree said:

 While the conversion between UTF-8 and UTF-16/UCS-2 is algorithmic and
 very fast, we need to remember that a buffer needs to be allocated to
 hold the converted result, and the data needs to be copied as things
 go in and out of the library.

Well, of course. But then I am mostly a C programmer, and tend to
think of these things in terms of preallocated static buffers that
get reused, or autoallocation on the stack, with just pointers
getting passed around to reduce data copies. With such methods,
for practical purposes, the conversions tend to be insignificant
compared to the rest of the work the API is usually engaged in.

But if you are doing object-oriented programming, it is always
a danger that you may end up multiplying your object constructions
needlessly, and to paraphrase Everett Dirksen, for the other
oldtimers out there, a billion nanoseconds here, a billion
nanoseconds there, eventually turn into real time. *hehe*

It is my impression, however, that most significant applications
tend, these days, to be I/O bound and/or network
transport bound, rather than compute bound. With a little care
in implementation, such things as string character set conversions
at interfaces do end up down in the noise, compared to the
other major issues that can affect overall performance and
throughput. Remember, these days we are dealing with gigahertz+
processors -- these are not your father's CPU's.

My point was that character set conversion at the interface to
a library -- particularly such conversions as UTF-8 == UTF-16
that don't even involve loading a resource table for conversion --
should not be seen as a significant barrier or performance
bottleneck. Looking for a UTF-8 library because it would
be more efficient to avoid conversions, even when a good
UTF-16 API is available, is misconstruing the problem and
(mostly) misplacing concern about performance.

 What is the real impact of this? I don't know: I haven't measured it
 myself. Obviously this could be handled a number of ways with various
 performance characteristics, but it does become an issue.

It's an issue, certainly, but to my mind, more one of a cultural
issue based on a somewhat dated set of worries, rather than a significant
performance issue.

I'm reminded somewhat of the clamor a decade ago about how
bad Unicode was because it would double the size of our
data stores. At the time, I was working on a computer with
a 20 megabyte hard disk, and (ooh!) a new, modern, 1-megabyte
floppy disk drive. Today, my home computer has a 45-*giga*byte
hard drive. I could spend the rest of my life trying to
create enough *text* data to fill a significant portion of
that drive. It is mostly populated with code images, libraries,
artwork and other graphics, web pages, music, and what not,
as are most people's hard disks, I surmise. We don't hear
much, anymore, about how wasteful Unicode is in its storage
of characters.

--Ken




RE: GB18030

2001-09-21 Thread Sampo Syreeni

On Fri, 21 Sep 2001, Carl W. Brown wrote:

Most systems that handle GB18030 will want to convert it to Unicode first
to reduce processing overhead.

Unless we start seeing Chinese software which is designed to utilize the
compatibility between 18030 and GBK -- font rendering apps and the influence
such OS level functionality tends to have on common APIs immediately come to
mind.

Besides, if the Chinese for any reason get bored enough with the Unicode
and/or ISO character allocation process, they might indeed start assigning
some of those extra code points in 18030. If this ever happens, the
incompatibility might well lead to a significant mass of software with 18030
as the primary character set.

With GB18030 you some times have to check the first two characters.
UTF-8 for example is an MBCS character set but if I am going backwards
through a string I can do so. With GB18030 I must start over from the
beginning of the string to find the start of the previous character.

Actually I think the previous line feed will buy you a sync.

Still, that is a *very* bad thing, especially since we know that many of
earlier ISO2022 derived multibyte codings had problems with string search
and like functionality which were all but solved by UTF-8. It'd be a real
shame to see progress towards encodings which force people to again devote
time to something that has already been solved once.

It is smaller that UTF-8 for Chinese and larger for anyone else.

But you'll have to condeed that that is a significant point, especially if
people perceive UTF-8 coded Chinese as being unacceptably large compared to
existing Chinese encodings (GB, Big Five, now 18030). A billion people, and
so forth...

Sampo Syreeni, aka decoy, mailto:[EMAIL PROTECTED], gsm: +358-50-5756111
student/math+cs/helsinki university, http://www.iki.fi/~decoy/front





Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer

I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for 
in-process string transformations:

UTF-16 - UTF-8
UTF-16 - UTF-32
UTF-16 - wchar_t*

markus




RE: GB18030

2001-09-21 Thread Murray Sargent

I think I've figured out a way to find the beginning of a GB18030 character starting 
anywhere in a document. The algorithm is similar to finding the beginning of a DBCS 
character in that you scan backward until you find a byte that can only come at the 
start of a character. The main difference is that you check for being in four-byte 
characters first (those of the form HdHd, where H is a byte in the range 0x81 - 0xFE 
and d is an ASCII digit). If a four-byte character isn't involved (ordinary GB 
don't use d as a trail byte), you revert to the DBCS approach for handling the rest of 
GB18030. 
 
This algorithm is handy when you want to stream in a file in chunks and need to know 
if a chunk ends in the middle of a character. One can also solve this particular 
problem by keeping track of character boundaries from the start of stream, but 
typically more processing is involved.
 
Murray

-Original Message- 
From: Carl W. Brown [mailto:[EMAIL PROTECTED]] 
Sent: Fri 2001/09/21 04:56 
To: Charlie Jolly; [EMAIL PROTECTED] 
Cc: 
Subject: RE: GB18030



Charlie,

GB18030 is designed to support all Unicode characters.  It has the capacity
to also encode additional characters.  I know of no plans to do so.

I don't think it will have much affect on Unicode.  Most systems that handle
GB18030 will want to convert it to Unicode first to reduce processing
overhead.  With most of the common MBCS code pages you can determine the
length of the character from the first  byte.  With GB18030 you some times
have to check the first two characters.  UTF-8 for example is an MBCS
character set but if I am going backwards through a string I can do so.
With GB18030 I must start over from the beginning of the string to find the
start of the previous character.

It is smaller that UTF-8 for Chinese and larger for anyone else.

Carl

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of Charlie Jolly
 Sent: Friday, September 21, 2001 1:42 AM
 To: [EMAIL PROTECTED]
 Subject: GB18030


 GB18030

 In what ways will this effect Unicode?

 Does it contain anything that Unicode doesn't?












Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang


Mozilla also use Unicode internally and are cross platform.
[EMAIL PROTECTED] wrote:

For cross-platform software (NT,Solaris,HP,AIX),
the only 3rd-party unicode support
I found so far is IBM ICU.
It's a very good support for
cross-platform software internationalization. However,
ICU internally uses UTF-16, For
our application using UTF-8 as input and output,
I have to convert from UTF-8
to UTF-16, before calling ICU functions (such as ucol_strcoll() )
I'm worried about the performance
overhead of this conversion.
Then... use Unicode internally in your software regardless you use
UTF-8 or UCS2 as the data type in the interface, eventually some code need
to convert it to UCS2 for most of the processing. Unless you use UCS2 internally,
you need to pay for the performance, either inside the library our in your
own code.


Are there any other cross-platform
3rd party unicode supports with better UTF-8 handling ?
Thanks a lot.
-Changjian Sun



Re: GB18030

2001-09-21 Thread Yung-Fong Tang

bascillay GB18030 is design to encode All Unicode BMP in a encoding which is
backward compatable with GB2312 and GBK.

The birth of GB18030 is because those characters which are encoded unicode
but not encoded in GB2312 neither GBK.


Thierry Sourbier wrote:

 Charlie,

  In what ways will this effect Unicode?
 
  Does it contain anything that Unicode doesn't?

 I suggest that you take a look at Markus Scherer paper GB 18030: A
 mega-codepage
 http://www-106.ibm.com/developerworks/library/u-china.html

 It will probably answer your question on the relationship between GB18030
 and Unicode.

 Cheers,
 Thierry.

 
 www.i18ngurus.com - Open Internationalization Resources Directory





Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang



Markus Scherer wrote:

 I would like to add that ICU 2.0 (in a few weeks) will have convenience functions 
for in-process string transformations:

 UTF-16 - UTF-8
 UTF-16 - UTF-32
 UTF-16 - wchar_t*

Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot 
convert between UTF-16 and wchar_t. You,
however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare 
UTF-16 as the encoding for wchar_t.
However, that is not universal true. Different platform can chose the size of wchar_t 
and the internal representation of
wchar_t* according to POSIX.



 markus





Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer

Yung-Fong Tang wrote:
  UTF-16 - wchar_t*
 
 Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot 
convert between UTF-16 and wchar_t. You,
 however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare 
UTF-16 as the encoding for wchar_t.
 However, that is not universal true. Different platform can chose the size of 
wchar_t and the internal representation of
 wchar_t* according to POSIX.

I know. Don't get me started on the usefulness of wchar_t...
We handle this in our convenience function as best as we could figure out.
That's what makes it _convenient_ ;-)

[Granted, it might also not work everywhere, but it is better than nothing.]

markus




Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread David Starner

On Fri, Sep 21, 2001 at 04:16:50PM -0700, Yung-Fong Tang wrote:
 Then... use Unicode internally in your software regardless you use
 UTF-8 or UCS2 as the data type in the interface, eventually some code
 need to convert it to UCS2 for most of the processing. 

Why? UCS2 shouldn't be used at all, since it's only BMP. UTF-16 has all
the problems of UTF-8, except in a more limited way. If you can deal
with mixed 2 byte and 4 byte characters, you can also deal 1, 2, 3 and 4
byte characters. 

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




RE: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yves Arrouye

  UTF-16 - wchar_t*
 
 Wait be careful here. wchar_t is not an encoding. So.. in 
 theory, you cannot convert between UTF-16 and wchar_t. You,
 however, can convert between UTF-16 and wchar_t* ON win32 
 since microsoft declare UTF-16 as the encoding for wchar_t.

And he can also do some between UTF-16 and UTF-32 for glibc-based programs
since UTF-32 is the encoding for wchar_t for such platforms.

The way I read that was UTF-16 - UTF-(8*sizeof(wchar_t)). (Please don't
ask what happens when sizeof(wchar_t) is 3 or larger than 4, you know what I
mean :)). I guess the responsibility of this being a meaningful conversion
would be with the caller.

YA

PS: I don't know a way of knowing the encoding of wchar_t programmatically.
Is there one? That'd offer some interesting possibilities..