RE: any unicode conversion tools?

2004-05-13 Thread Kent Karlsson
Peter Constable wrote: UTF-8 sequences, as originally defined, could be longer than four bytes, in order to address codepoints in the vast expanse of UCS-4 at U+11..U+. Since the accepted code space has been constrained to U+..U+10, only four bytes are needed. There

UTF-8 nitpicking (was: RE: any unicode conversion tools?)

2004-05-13 Thread Kenneth Whistler
Kent, It's time to nitpick the nitpicker. ;-) 1. UCS-4, which is still defined by 10646 (but never by Unicode) is limited at U-7FFF U-7FFF (~ U7FFF ~ 7FFF ~ -7FFF [!]) The space in U-7FFF is a Swedishism, not specified in the standard. The U and the - are

Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)

2004-05-13 Thread jcowan
Kenneth Whistler scripsit: It was only with Unicode 3.0 (and the correlated 10646-1:2000) that this was rationalized to the Unicode definition of UTF-8 formally consisting of only 1-4 bytes sequences, while simultaneously the potential need for 5 and 6-byte sequences in 10646 was removed,

Re: UTF-8 nitpicking (was: RE: any unicode conversion tools?)

2004-05-13 Thread Kenneth Whistler
John Cowan asked: Tell us, O Keen-Eyed Peerer Into The Future: is there any hope that the code space above 10 will ever be removed from 10646, so that the Unicode's a subset of 10646 meme can be stomped once and for all? I grow weary of explaining this pointless difference. Anything is

Re: any unicode conversion tools?

2004-05-07 Thread Chris Jacobs
- Original Message - From: Chan Fook Sheng [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, May 07, 2004 10:03 AM Subject: any unicode conversion tools? Hi I am looking for unicode, utf-8 coonversion tools for windows platform, but can't find any on the web. can anyone

Re: any unicode conversion tools?

2004-05-07 Thread Philippe Verdy
From: Chan Fook Sheng [EMAIL PROTECTED] I am looking for unicode, utf-8 coonversion tools for windows platform, but can't find any on the web. can anyone direct me to some links? for example: the / character is 47 in decimal, 2F in hex. it can be represented in UTF-8 format as: 1 byte:

Re: any unicode conversion tools?

2004-05-07 Thread John Cowan
Philippe Verdy scripsit: A free converter tool exists in the Java SDK for Windows: look for native2ascii. Beware of trying to use this as a general converter: it's meant only for Java code, or code from a closely related programming language. In particular, it treats strings inside or ''

Re: any unicode conversion tools?

2004-05-07 Thread Clark Cox
On May 07, 2004, at 08:08, Philippe Verdy wrote: From: Chan Fook Sheng [EMAIL PROTECTED] I am looking for unicode, utf-8 coonversion tools for windows platform, but can't find any on the web. can anyone direct me to some links? for example: the / character is 47 in decimal, 2F in hex. it can

Re: any unicode conversion tools?

2004-05-07 Thread Rick McGowan
See also http://www.unicode.org/review/index.html#pri33 Rick

Re: any unicode conversion tools?

2004-05-07 Thread Jon Hanna
it can be represented in UTF-8 format as: 1 byte: still 2F 2 bytes: C0 AF (illegal) 3 bytes: E0 80 AF (illegal) Thanks for keeping the indication that the last two are illegal with UTF-8. But you should have better never listed them (even if there still exists some legacy

RE: any unicode conversion tools?

2004-05-07 Thread Peter Constable
UTF-8 encoded sequences can be up to 5 bytes long... How is that possible. I was under the impression that a UTF-8 sequence could never be more than 4 bytes (i.e. U+10 becomes F4 8F BF BF). Philippe chastised Chan for mentioning illegal sequences, but then went on to make

Re: any unicode conversion tools?

2004-05-07 Thread Stefan Persson
Clark Cox wrote: Note also that UTF-8 encoded sequences can be up to 5 bytes long... How is that possible. I was under the impression that a UTF-8 sequence could never be more than 4 bytes (i.e. U+10 becomes F4 8F BF BF). Unicode ISO/IEC 10646 define UTF-8 differently; Unicode stops at

RE: any unicode conversion tools?

2004-05-07 Thread Peter Constable
UTF-8 sequences, as originally defined, could be longer than four bytes, in order to address codepoints in the vast expanse of UCS-4 at U+11..U+. U+ or U+7FFF? (not nit-picking, genuinely unsure). Thanks to Jon Hanna for catching this: it was U+7FFF. Peter