GB18030
GB18030 In what ways will this effect Unicode? Does it contain anything that Unicode doesn't?
RE: numeric ordering
1. Is there another document/algorithm/table that does provide guidelines for sorting numbers within strings? Something that deals with different scripts? ISO/IEC 14651 International String Ordering includes an informative annex on this topic. In particular, see C.2 Handling of numeral substrings in collation. The specific C.3 in my copy... case of sorting multiple-part section numbering is not addressed in detail, ...because that is subsumed under C.3.1 (Handling of 'ordinary' numerals for natural numbers), when also considering FULL STOP to separate numerals, and not be part of them (which is usually the case for natural number numerals). (Teknisk norm nr. 34, Swedish Alphanumeric Sorting, [Swedish] Statskontoret, 1992, has a somewhat different approach to the same problem; however, that document is only available in Swedish, does not go into details on this, and even though it describes a multi-level ordering it does not fit well with the UTR10/14651 framework...) /Kent Karlsson but many similar kinds of problems are. --Ken
Re: GB18030
Charlie, In what ways will this effect Unicode? Does it contain anything that Unicode doesn't? I suggest that you take a look at Markus Scherer paper "GB 18030: A mega-codepage" http://www-106.ibm.com/developerworks/library/u-china.html It will probably answer your question on the relationship between GB18030 and Unicode. Cheers, Thierry. www.i18ngurus.com - Open Internationalization Resources Directory
Kana syllables
The small letters are for making like in my fake name. The regular Ri and the small Yu make Ryu. Some syllables require 2 katakana (or hiragana) symbols. But the thing is, are "ra gyou" kana to be regarded as having R or L for their consonant? You can get lots of 2-kana syllables. Like in the title "ranmafankurabu" where "fa" is a Fu with small A. (Actually, in Unicode names, the Fu is called Hu.) Some Unicode names for kana do not reflect the pronunciation. The kana Si is usually pronounced Shi, but I think it depends on your dialeect of Japanese. I think it could also be Si. There are many sources of info on kana on the Web. Look one up. Heck, I can't even sing fast enough in kana to keep up with the song. rubyrbじゅういっちゃん/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town
Re: UTF-8 UCS-2/UTF-16 conversion for library use
Tree said: While the conversion between UTF-8 and UTF-16/UCS-2 is algorithmic and very fast, we need to remember that a buffer needs to be allocated to hold the converted result, and the data needs to be copied as things go in and out of the library. Well, of course. But then I am mostly a C programmer, and tend to think of these things in terms of preallocated static buffers that get reused, or autoallocation on the stack, with just pointers getting passed around to reduce data copies. With such methods, for practical purposes, the conversions tend to be insignificant compared to the rest of the work the API is usually engaged in. But if you are doing object-oriented programming, it is always a danger that you may end up multiplying your object constructions needlessly, and to paraphrase Everett Dirksen, for the other oldtimers out there, a billion nanoseconds here, a billion nanoseconds there, eventually turn into real time. *hehe* It is my impression, however, that most significant applications tend, these days, to be I/O bound and/or network transport bound, rather than compute bound. With a little care in implementation, such things as string character set conversions at interfaces do end up down in the noise, compared to the other major issues that can affect overall performance and throughput. Remember, these days we are dealing with gigahertz+ processors -- these are not your father's CPU's. My point was that character set conversion at the interface to a library -- particularly such conversions as UTF-8 == UTF-16 that don't even involve loading a resource table for conversion -- should not be seen as a significant barrier or performance bottleneck. Looking for a UTF-8 library because it would be more efficient to avoid conversions, even when a good UTF-16 API is available, is misconstruing the problem and (mostly) misplacing concern about performance. What is the real impact of this? I don't know: I haven't measured it myself. Obviously this could be handled a number of ways with various performance characteristics, but it does become an issue. It's an issue, certainly, but to my mind, more one of a cultural issue based on a somewhat dated set of worries, rather than a significant performance issue. I'm reminded somewhat of the clamor a decade ago about how bad Unicode was because it would double the size of our data stores. At the time, I was working on a computer with a 20 megabyte hard disk, and (ooh!) a new, modern, 1-megabyte floppy disk drive. Today, my home computer has a 45-*giga*byte hard drive. I could spend the rest of my life trying to create enough *text* data to fill a significant portion of that drive. It is mostly populated with code images, libraries, artwork and other graphics, web pages, music, and what not, as are most people's hard disks, I surmise. We don't hear much, anymore, about how wasteful Unicode is in its storage of characters. --Ken
RE: GB18030
On Fri, 21 Sep 2001, Carl W. Brown wrote: Most systems that handle GB18030 will want to convert it to Unicode first to reduce processing overhead. Unless we start seeing Chinese software which is designed to utilize the compatibility between 18030 and GBK -- font rendering apps and the influence such OS level functionality tends to have on common APIs immediately come to mind. Besides, if the Chinese for any reason get bored enough with the Unicode and/or ISO character allocation process, they might indeed start assigning some of those extra code points in 18030. If this ever happens, the incompatibility might well lead to a significant mass of software with 18030 as the primary character set. With GB18030 you some times have to check the first two characters. UTF-8 for example is an MBCS character set but if I am going backwards through a string I can do so. With GB18030 I must start over from the beginning of the string to find the start of the previous character. Actually I think the previous line feed will buy you a sync. Still, that is a *very* bad thing, especially since we know that many of earlier ISO2022 derived multibyte codings had problems with string search and like functionality which were all but solved by UTF-8. It'd be a real shame to see progress towards encodings which force people to again devote time to something that has already been solved once. It is smaller that UTF-8 for Chinese and larger for anyone else. But you'll have to condeed that that is a significant point, especially if people perceive UTF-8 coded Chinese as being unacceptably large compared to existing Chinese encodings (GB, Big Five, now 18030). A billion people, and so forth... Sampo Syreeni, aka decoy, mailto:[EMAIL PROTECTED], gsm: +358-50-5756111 student/math+cs/helsinki university, http://www.iki.fi/~decoy/front
Re: 3rd-party cross-platform UTF-8 support
I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 - UTF-8 UTF-16 - UTF-32 UTF-16 - wchar_t* markus
RE: GB18030
I think I've figured out a way to find the beginning of a GB18030 character starting anywhere in a document. The algorithm is similar to finding the beginning of a DBCS character in that you scan backward until you find a byte that can only come at the start of a character. The main difference is that you check for being in four-byte characters first (those of the form HdHd, where H is a byte in the range 0x81 - 0xFE and d is an ASCII digit). If a four-byte character isn't involved (ordinary GB don't use d as a trail byte), you revert to the DBCS approach for handling the rest of GB18030. This algorithm is handy when you want to stream in a file in chunks and need to know if a chunk ends in the middle of a character. One can also solve this particular problem by keeping track of character boundaries from the start of stream, but typically more processing is involved. Murray -Original Message- From: Carl W. Brown [mailto:[EMAIL PROTECTED]] Sent: Fri 2001/09/21 04:56 To: Charlie Jolly; [EMAIL PROTECTED] Cc: Subject: RE: GB18030 Charlie, GB18030 is designed to support all Unicode characters. It has the capacity to also encode additional characters. I know of no plans to do so. I don't think it will have much affect on Unicode. Most systems that handle GB18030 will want to convert it to Unicode first to reduce processing overhead. With most of the common MBCS code pages you can determine the length of the character from the first byte. With GB18030 you some times have to check the first two characters. UTF-8 for example is an MBCS character set but if I am going backwards through a string I can do so. With GB18030 I must start over from the beginning of the string to find the start of the previous character. It is smaller that UTF-8 for Chinese and larger for anyone else. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Charlie Jolly Sent: Friday, September 21, 2001 1:42 AM To: [EMAIL PROTECTED] Subject: GB18030 GB18030 In what ways will this effect Unicode? Does it contain anything that Unicode doesn't?
Re: 3rd-party cross-platform UTF-8 support
Mozilla also use Unicode internally and are cross platform. [EMAIL PROTECTED] wrote: For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party unicode support I found so far is IBM ICU. It's a very good support for cross-platform software internationalization. However, ICU internally uses UTF-16, For our application using UTF-8 as input and output, I have to convert from UTF-8 to UTF-16, before calling ICU functions (such as ucol_strcoll() ) I'm worried about the performance overhead of this conversion. Then... use Unicode internally in your software regardless you use UTF-8 or UCS2 as the data type in the interface, eventually some code need to convert it to UCS2 for most of the processing. Unless you use UCS2 internally, you need to pay for the performance, either inside the library our in your own code. Are there any other cross-platform 3rd party unicode supports with better UTF-8 handling ? Thanks a lot. -Changjian Sun
Re: GB18030
bascillay GB18030 is design to encode All Unicode BMP in a encoding which is backward compatable with GB2312 and GBK. The birth of GB18030 is because those characters which are encoded unicode but not encoded in GB2312 neither GBK. Thierry Sourbier wrote: Charlie, In what ways will this effect Unicode? Does it contain anything that Unicode doesn't? I suggest that you take a look at Markus Scherer paper GB 18030: A mega-codepage http://www-106.ibm.com/developerworks/library/u-china.html It will probably answer your question on the relationship between GB18030 and Unicode. Cheers, Thierry. www.i18ngurus.com - Open Internationalization Resources Directory
Re: 3rd-party cross-platform UTF-8 support
Markus Scherer wrote: I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 - UTF-8 UTF-16 - UTF-32 UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. However, that is not universal true. Different platform can chose the size of wchar_t and the internal representation of wchar_t* according to POSIX. markus
Re: 3rd-party cross-platform UTF-8 support
Yung-Fong Tang wrote: UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. However, that is not universal true. Different platform can chose the size of wchar_t and the internal representation of wchar_t* according to POSIX. I know. Don't get me started on the usefulness of wchar_t... We handle this in our convenience function as best as we could figure out. That's what makes it _convenient_ ;-) [Granted, it might also not work everywhere, but it is better than nothing.] markus
Re: 3rd-party cross-platform UTF-8 support
On Fri, Sep 21, 2001 at 04:16:50PM -0700, Yung-Fong Tang wrote: Then... use Unicode internally in your software regardless you use UTF-8 or UCS2 as the data type in the interface, eventually some code need to convert it to UCS2 for most of the processing. Why? UCS2 shouldn't be used at all, since it's only BMP. UTF-16 has all the problems of UTF-8, except in a more limited way. If you can deal with mixed 2 byte and 4 byte characters, you can also deal 1, 2, 3 and 4 byte characters. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
RE: 3rd-party cross-platform UTF-8 support
UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. And he can also do some between UTF-16 and UTF-32 for glibc-based programs since UTF-32 is the encoding for wchar_t for such platforms. The way I read that was UTF-16 - UTF-(8*sizeof(wchar_t)). (Please don't ask what happens when sizeof(wchar_t) is 3 or larger than 4, you know what I mean :)). I guess the responsibility of this being a meaningful conversion would be with the caller. YA PS: I don't know a way of knowing the encoding of wchar_t programmatically. Is there one? That'd offer some interesting possibilities..