Re: C # character model

2000-06-27 Thread Markus Scherer
John O'Conner wrote: It appears that this new product is not adopting UTF-32...and is sticking with UTF-16 (or more appropriately UCS-2?). APIs use and return single 16-bit values. This certainly doesn't make surrogate-pair values easy to use. What influence, if any, does this have on the

Re: 2 dumb questions: Plane 14 and codepages

2000-06-30 Thread Markus Scherer
rm originate? It seems to have a hardware flavor, as if an old piece of display hardware had selectable ROMed fonts. i am guessing it is from printed manuals? Mike Newhall AltaVista markus scherer ibm (icu)

Re: Using Unicode in XML

2000-07-15 Thread Markus Scherer
please note that the xml spec has an "errata" list with 67 items that substantially update the spec. there, it now also recommends (though it does not force) for xml clients to recognize u+feff for utf-8 (the bytes ef bb bf) and many other byte combinations. there is a link to the errata at the

Re: Using Unicode in XML

2000-07-17 Thread Markus Scherer
"Michael (michka) Kaplan" wrote: there, it now also recommends (though it does not force) for xml clients to recognize u+feff for utf-8 (the bytes ef bb bf) and many other byte combinations. there is a link to the errata at the beginning of the xml spec. Where do you see this? The list

Re: FW: quick question about Wireless Application Protocol (WAP)

2000-07-17 Thread Markus Scherer
hi, i have finally looked around for an answer to this, i hope it is still relevant - in the specification document for the wireless markup language (wml) at http://www1.wapforum.org/tech/documents/SPEC-WML-19991104.pdf it says in chapter 6 that wml is based on xml and therefore uses xml's

Re: Using unicode in a Java program

2000-07-19 Thread Markus Scherer
William Overington wrote: (hexadecimal) 109. From something I saw a long time ago, before I started learning Java, I think that I need to put something like \u0109 into the program somewhere, though whether it is \u0109 or "\u0109" in quotation marks or whatever I do not know. you got it.

Re: Unicode FAQ addendum

2000-07-19 Thread Markus Scherer
John Cowan wrote: The new Unicode FAQ (like the old) supplies the panting world with John's Own Version of Unicode Conformance: some of the old ones seem to be pre-unicode 1.1. should they not be updated? 1) Unicode code units are 16 bits long; deal with it. this is true for the default

Re: Unicode FAQ addendum

2000-07-20 Thread Markus Scherer
Becker, Joseph wrote: terminology in an informal statement, I wouldn't have a problem with the simple update: 1) Unicode code units are not 8 bits long; deal with it. how about: 1) Unicode code units are not necessarily 8 bits long [wide], code points use 21 bits; deal with it. rationale:

Re: Question regarding bidirectional algorithm

2000-07-27 Thread Markus Scherer
David Tooke wrote: The bidirectional algorithm mentions mirrored glyphs. The reference code handles them by replacing these characters with their mirror image. Is this the preferred method of doing this? If so, is there any where in the Unicode database that correlates the two

IANA charset registration for SCSU

2000-08-11 Thread Markus Scherer
Hello, I proposed SCSU (as described in UTR 6) for registration as a charset with IANA (as "SCSU" with no aliases). Good news: it was approved on 2000-jul-19. Bad news: The publication of IANA registrations is currently being redesigned and re-staffed, therefore nothing has been and will be

Re: FW: Date Controls

2000-08-17 Thread Markus Scherer
we have a c/c++ library and java classes that might be interesting here: icu (icu4c) provides a lot of locale information. see http://oss.software.ibm.com/developerworks/opensource/icu/localeexplorer and http://oss.software.ibm.com/icu/ the java classes, together with what is in the jdk, are

IANA Character Set registration for SCSU

2000-09-05 Thread Markus Scherer
Hello, the IANA charset list is nicely refurbished, updated, and now lists SCSU. See below. markus Original Message From: "IANA" [EMAIL PROTECTED] Subject: Character Set registration To: [EMAIL PROTECTED] Dear Markus, We sincerely apologize for the delay. We have registered

Re: [unicode] More ways to encode U+FEFF (was: Re: Designing a

2000-09-06 Thread Markus Scherer
of this list, only UTF-EBCDIC is a viable encoding form. the others are either deprecated, never made it beyond draft, or are unofficial discussion pieces that never made it anywhere (i proposed one of them :-). if you detect all the big- and little-endian boms for the standard forms utf-8,

Re: Unicode on a website

2000-09-22 Thread Markus Scherer
plenty of people responded about trade-offs between utf-8 page size and conversion overhead - there is one more thing: scsu would work well as an html/xml encoding and is easily decoded without bulky tables. it can be similarly compact to language-specific codepages. so, how do we get scsu

need translation of gb 18030 standard

2000-09-29 Thread Markus Scherer
Hello, I am looking for an English (or German or possibly French) translation of the Chinese standard GB 18030 for the new mandated codepage. Is there anything available? For the purpose of implementing it, I would like to see the closest to first-hand information. Actually, where can I find

Re: Cyrillic -

2000-09-29 Thread Markus Scherer
hello, for fonts etc. have a look at http://www.unicode.org/unicode/onlinedat/resources.html for converting your pages to unicode, you would need some library or operating system api to do so. there are plenty around, but you would have to find out exactly what is the encoding of your pages.

locale data on the web

2000-09-29 Thread Markus Scherer
Hello, we are collecting locale data for a number of new locales for ICU. If you think you might have feedback for them, please have a look at the italics locale IDs in http://oss.software.ibm.com/developerworks/opensource/icu/localeexplorer/ The new locales include Afrikaans, Basque,

Re: lag time in Unicode implementations in OS, etc?

2000-10-12 Thread Markus Scherer
sorry for responding to an old thread - comment below. markus Chris Pratley wrote on 2000-oct-03: Surrogate support was not turned on by default in Win2000 because the Windows team was waiting for the standard to be finalized. It was also added late, so to reduce the potential impact they had

Re: lag time in Unicode implementations in OS, etc?

2000-10-12 Thread Markus Scherer
so, what is there to be turned on and off in win2k if surrogate pairs are already handled as single units? if fonts just don't contain mappings and glyphs for pairs, then the layout engine will ignore them anyway until fonts provide that data. markus John McConnell wrote: Windows 2000

Re: OT: Relevance of Locale data?

2000-10-31 Thread Markus Scherer
Hi, for Java and ICU locales, if you don't specify something, you will get the string from the default/root locale. This would typically be in English. If you don't want English _and_ don't want to create 15 locales, then how about using number strings, like "10" for "October", or roman

Re: Java and Unicode

2000-11-16 Thread Markus Scherer
Juliusz Chroboczek wrote: I believe that Java strings use UTF-8 internally. .class files use a _modified_ utf-8. at runtime, strings are always in 16-bit unicode. At any rate the internal implementation is not exposed to applications -- note that `length' is a method in class String (while

Re: UTF-8 Corrigendum, new Glossary

2000-11-30 Thread Markus Scherer
Kevin Bracey wrote: I find this silly. That creation of such forms would be forbidden I can see and agree to. But interpretation? I understand the reasoning when security is an issue. But why make it flat illegal? There are many applications where such a sequence poses no security danger.

gb 18030 mapping available

2000-12-01 Thread Markus Scherer
Hi all, Yesterday, I received the re-released mapping table for GB 18030. I updated the ICU implementation with this, in time for ICU 1.7. See http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/tools/makeconv/gb18030/gb18030.html#officialdata George and I also converted the mapping data

ICU 1.7 released

2000-12-15 Thread Markus Scherer
ICU 1.7 is released! See http://oss.software.ibm.com/icu/download/1.7/ Summary of changes: - Collation performance improved - New conversion support: + ISO-2022-JP/CN/KR with extensions + GB 18030 + HZ + UTF-32 (incomplete) - code/data library names contain version numbers - Debian

Re: UNICODE application on IBM Mainframe

2001-01-23 Thread Markus Scherer
I would like to add one item to this discussion: Recently, someone from the IBM S/390 group told me that they had decided to store and use Unicode on S/390 as UTF-8/16/32. They will not use UTF-EBCDIC. I am not aware of anyone inside or outside of IBM who does use UTF-EBCDIC. (There is another

Re: PDUTR #27: Unicode 3.1

2001-01-23 Thread Markus Scherer
ICU stores most UnicodeData.txt properties in its uprops.dat, currently some 23kB (Unicode 3.0). This does not include character names, which are in unames.dat, currently some 83kB. There is currently a bug about wrong properties for the last 1k chars in plane 15 16 (I will try to fix this

Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

2001-02-22 Thread Markus Scherer
Tom Lord wrote: Two code points represent non-characters. These are U+FFFE and U+. Programs are free to give these values special meaning internally. Unicode (2.0 and up?) has 34 non-characters at U+xxFFFE and U+xx where xx is 00, 01, .., 0F, 10. Unicode 3.1 is adding another 32

Re: Unicode complaints

2001-03-15 Thread Markus Scherer
"Michael (michka) Kaplan" wrote: And your suggestion for characters that sort *differently* in different locales? You would want to add them multiple times? Obviously not. Locale-sensitive collation is an independent issue, and, of course, we provide it - now based on the UCA. For collation

Re: Microsoft Word Query

2001-03-19 Thread Markus Scherer
Sam Chapman wrote: character | UNICODE | converted to " 84 22 ... 85 2E 2E 2E ' 91 27 ' 92 27 " 93 22 " 94 22 - 96 2D - 97 2D (tm)99 28 74 6D 29 (tm) Note that these are _not_ Unicode

Re: Unicode encoding forms in web development

2001-03-20 Thread Markus Scherer
For HTML, use UTF-8. For XML, use UTF-8 or UTF-16. US-ASCII and ISO 8859-1 are also acceptable, either if your actual character needs are limited to their repertoires or with numeric character references. If you know the sender and receiver and you half a low-bandwidth application, consider

[unicode] Re: removing compromises from unicode (WCode)

2001-03-21 Thread Markus Scherer
John Cowan wrote: The result is a back-to-the-principles "WCode", nicely streamlined: - no compatibility or precomposed characters But less compact. Without precomposed characters, the overhead of conversion from old character sets grows considerably. True. Compactness was not a goal

Re: Sun's Java encodings vs IANA's character set registry

2001-04-13 Thread Markus Scherer
It looks to me like the "Cp" names might be IBM CCSIDs. For those, have a look at the "ibm-" names in ICU's alias table at http://oss.software.ibm.com/cvs/icu/~checkout~/icu/data/convrtrs.txt Note that ICU uses "cp" to mean Microsoft codepage numbers. Note also that even IBM changes some of

Re: Byte Order Marks

2001-04-19 Thread Markus Scherer
There is an RFC about UTF-16 that explains this: If the text is labeled by the protocol as charset=UTF-16 then the first two bytes are the byte order mark charset=UTF-16BE then it is big-endian and the first two bytes are just text charset=UTF-16LE then it is little-endian and the first two

Re: Byte Order Marks

2001-04-19 Thread Markus Scherer
Yves Arrouye wrote: If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? ICU does not do Unicode-signature or other encoding detection as part of a converter. When you

Re: Byte Order Marks

2001-04-20 Thread Markus Scherer
Yves, we are thinking about a general API for encoding detection that could initially just check for BOM/Unicode signatures. I believe we have a feature request for this already. Mark and I just brainstormed about what we may want an API look like. The reason for doing what ICU is doing

Re: Unicode in a URL

2001-04-26 Thread Markus Scherer
Paul Deuter wrote: I am wondering if there isn't a need for the Unicode Spec to also dictate a way of encoding Unicode in an ASCII stream. Perhaps How many more ways to we need? To be 8-bit-friendly, we have UTF-8. To get everything into ASCII characters, we have UTF-7. W3C specifies to use

Re: UTC Agenda Item : UTF-8S

2001-05-15 Thread Markus Scherer
3 comments: 1. Binary order of UTF-16 strings compatible with binary order of UTF-8/32 is easily achieved using the fix-up described in my article on developerWorks (there is currently a problem with that site). Essentially, one rotates the 16-bit values so that the surrogates get to the top

Re: [OT] bits and bytes

2001-05-18 Thread Markus Scherer
[EMAIL PROTECTED] wrote: the smallest and largest size code units ever used for representing character data? Teletype machines commonly use a 5-bit code (Baudot, International Alphabet Nr. 2). It has Shift-In/Shift-Out codes to switch between an alphabetic default level and a level with

Re: UCN (Java) notation beyond the BMP

2001-05-23 Thread Markus Scherer
[EMAIL PROTECTED] wrote: Is there a currently accepted format for Universal Character Names ... for the Unicode characters beyond U+? Not in Java. In C99, there is \U (8 hex digits). markus

Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-06-05 Thread Markus Scherer
Personally, I find it interesting to see which and how many characters are affected by the difference in binary ordering between UTF-8 and UTF-16. Affected are all code points in two ranges: U+e000..U+ U+1..U+10 The second range contains assignments for characters that are

Re: UTF-8

2001-06-12 Thread Markus Scherer
Bill Kurmey wrote: Will the Unicode version of UTF-8 be registered with IANA and, if so, what will be its charset designation? I believe this question is based on a misunderstanding: 6-byte sequences have been mentioned in this discussion. The intended meaning was pairs of 3-byte sequences

ICU 1.8.1 and ICU4J 1.3.1 released - with new license

2001-06-13 Thread Markus Scherer
* License Change We are pleased to announce that the ICU projects are changing to the X open-source license. This change is at the request of many people involved in major open source software projects such as Linux, Perl, and Gnome. It allows ICU to be incorporated into a wide variety of

Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

2001-06-18 Thread Markus Scherer
There is one statement that appears to want to be framed: Jianping Yang wrote: [...] When Unicode came to version 2.1, we found our AL24UTFFSS had trouble for 2.1 as Hangul's reallocation, and we could not simply update AL24UTFFSS to 2.1 definition as it would mess existing users' data in

Re: converting ISO 8859-1 character set text to ASCII (128)charactet set

2001-06-20 Thread Markus Scherer
cls raj wrote: We have a specific requirment of converting Latin -1 character set ( iso 8859-1 ) text to ASCII charactet set ( a set of only 128 characters). 8859-1 is a superset of ASCII (of US-ASCII, to be precise, but you seem to be using that). US-ASCII uses byte values 0..127 (7 bits),

The perfect solution for the UTF-8/16 discussion

2001-06-21 Thread Markus Scherer
Abolish all in-process Unicode encodings except UTF-16. If everyone uses the same encoding form then there is no problem with different string lengths, results of binary comparisons, etc. Once we are here, abolish all little-endian UTF-16 implementations. This will save a lot of byte swapping,

Re: UTF-17

2001-06-21 Thread Markus Scherer
Nice, but you have the same kind of shortest-form problem as in UTF-8: 38 30 30 30 30 30 30 30 could be mis-interpreted by a lenient decoder as U+. Ts, ts... At least it sorts binary in code point order. markus

Re: DUDE-8, a compression proposal

2001-06-25 Thread Markus Scherer
John Cowan wrote: 5. Emit all non-zero bytes. Do you mean omit leading zeroes and emit following bytes? You would not want to emit all but a middle byte, right? markus

Re: Unicode transliterations (and other operations)

2001-07-03 Thread Markus Scherer
Looks interesting. How are you approaching the complication that transliteration is between pairs of languages? I know what you mean: Gorbachev is Gorbatschow in German. I think that the rules that we have in ICU are probably English-centric where it makes a difference. Note that some of

Re: Unicode in Asia Question

2001-08-01 Thread Markus Scherer
You can go and try using a web server that works internally in 16-bit Unicode (UTF-16) and serves web pages in many languages in either UTF-8 (default) or many other codepages. (Now that ICU was mentioned already...) Go to

important notice about icu gb 18030 xml table

2001-08-22 Thread Markus Scherer
To anyone who is looking at the GB 18030 xml mapping file in our charset repository at http://oss.software.ibm.com/cvs/icu/charset/data/xml/gb-18030-2000.xml We found - thanks to Carl Brown's questions - that we had checked in an incorrect file, different from the correct file that we had

Re: Using a polyglot compatibility section in a DVB-MHP program

2001-08-28 Thread Markus Scherer
You are on the right track: ResourceBundle's allow you to store strings and all kinds of objects separate from the program code. Locale's are used to identify which resouce bundles (among other things) to select, based on language and region/country, mostly. You will also need MessageFormat

ICU demos back online

2001-08-29 Thread Markus Scherer
Dear friends of ICU, The Locale Explorer and the Unicode Browser demos are back online, but at new URLs. Please visit http://oss.software.ibm.com/icu/demo/ . The transliteration demo is still not working on the new server. Sorry. markus

Re: UTF-8 on NT

2001-09-04 Thread Markus Scherer
No. On Windows NT/2000/XP/CE, everything is UTF-16 Unicode, for all locales. Locales and codepages are separate, as they should be. You should compile your programs with UNICODE and _UNICODE defined to use the native Unicode kernel functions. UTF-8 is not possible - as far as I know - as a

Re: PDUTR #26 posted

2001-09-17 Thread Markus Scherer
One technical nit: The document says: 2.1 c. The bit pattern 0xxx is illegal in any CESU-8 byte, ... In fact, this should say The bit patterns are illegal in ... The changes are subtle: one '0' replaced by 'x' - you want to forbid all bytes =0xf0 (f0..ff), not just f0..f7. (The

Re: UCS-2 to UTF-8 hex values

2001-09-19 Thread Markus Scherer
I have written some time ago a little C program that generates such a list for all of Unicode, not just the UCS-2 subset. It uses macros from a few ICU header files, but does not need the compiled ICU library. On Unixes, you may need to runConfigure, but on Windows it will work out of the

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer
I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 - UTF-8 UTF-16 - UTF-32 UTF-16 - wchar_t* markus

Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer
Yung-Fong Tang wrote: UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t.

Re: GB18030

2001-09-24 Thread Markus Scherer
Yung-Fong Tang wrote: bascillay GB18030 is design to encode All Unicode BMP in a encoding which is backward compatable with GB2312 and GBK. Correction: to encode _all_ of Unicode, not just all Unicode BMP - GB 18030 covers all 17 planes, not just the BMP. markus

Re: plane business

2001-10-05 Thread Markus Scherer
Asmus Freytag wrote: Designation changed twice in Unicode, once to designate the surrogates, and once to designate the 32 characters on the BMP as non-characters. Designation also changed between Unicode 1.1 and 2.0 to move around the Private-Use and Hangul blocks, and to add the Plane-16/17

Re: plane business

2001-10-05 Thread Markus Scherer
Bernard Miller wrote: I don't understand this, the arabic non characters are supposed to REPRESENT the hidden non characters? no, they are unrelated and additional. markus

Re: Character encoding at the prompt

2001-10-25 Thread Markus Scherer
On the DOS prompt of Windows NT4/2000/XP, you should be able to get 16-bit Unicode with chcp 1. markus

Re: ISCII-Unicode Conversion

2001-11-06 Thread Markus Scherer
Below is Ken Whistler's historical table of codespace counts, extracted from the message Is 879,309 enough? and slightly refurbished, followed by the Unicode 3.2 codespace allocation map (cf. http://www.unicode.org/roadmaps ). Your terminology may differ, but here's what's used below: BMP

Re: USV to UTF-8 mapping

2001-11-14 Thread Markus Scherer
[EMAIL PROTECTED] wrote: ... Else if U+0800 = U = U+D7FF, or if U+E000 = U = U+, then C1 = U \ x1000 + xE0 C2 = (U mod x1000) \ x40 + x80 C3 = U mod x40 + x80 Else if U = U+, then This looks like it includes U+ in two branches. Well, you catch U+

Re: GB18030

2001-09-27 Thread Markus Scherer
Yung-Fong Tang wrote: ... But you still need to know what U+4ff3a to define such mapping table, right? Wrong. You just need to know the mapping between code points, whether assigned, used, or whatever. ... So, whatever the software the user currently have today, without an upgrade (either

Re: surrogate at java's property file

2001-10-08 Thread Markus Scherer
For Java, the support for supplementary characters is actually better than one might think. It is true that the char type and the Character class only support 16-bit code units. However, storing UTF-16 strings in String objects and char[] arrays and passing code points as int's in non-JDK

ICU 2.0 released

2001-12-05 Thread Markus Scherer
ICU 2.0 is released - both ICU4C and ICU4J! For the full announcement please see the press release at http://oss.software.ibm.com/icu/press.html For details about the ICU4C 2.0 release see http://oss.software.ibm.com/icu/download/2.0/ For details about the ICU4J 2.0 release see

Re: Fun with GBK GB2312

2002-01-04 Thread Markus Scherer
We have published mapping data for Windows cp936 from the actual Windows 2000 converter API. This is probably more up to date and complete than what is listed on the unicode.org site. Of course, these tables also only show correspondences with Unicode, but a) they also show unidirectional

Re: FW: glyph bitmaps

2002-01-08 Thread Markus Scherer
Dear Andrew, In order to get useful responses, you will need to specify your request further: You will need to specify what set of characters/scripts/glyphs you need to display. For example, just Latin and/or Greek and/or Cyrillic etc. Note that East Asian scripts will display very poorly in a

Re: Fun with UDCs in Shift-JIS

2002-01-17 Thread Markus Scherer
Lars Marius Garshol wrote: I've just discovered that it seems that Shift-JIS encodes a number of User-Defined Characters in the 0xF040 to 0xFCFC range, and that these Yes, and every implementor may assign characters to them as they see fit. characters are used in web pages. Does anyone

New Unicode Encoding/Compression: BOCU-1

2002-02-06 Thread Markus Scherer
Hello, Mark Davis and I developed a concrete, MIME-friendly version of the BOCU algorithm that we presented earlier (http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html). We have a summary and spec with sample code at

Re: ICU website

2002-02-11 Thread Markus Scherer
The ICU website continues to be at http://oss.software.ibm.com/icu/ I suppose that there was a temporary networking problem somewhere. The fact that you sometimes see internal machine names like www-124 or www-126 is due to some misconfiguration. The ICU team does not itself control the server

Re: Smiles, faces, etc

2002-02-14 Thread Markus Scherer
Falkor wrote: Like 'em or hate 'em, those :) are here to stay. ...and there's at Probably, although the more people from outside the computer-tech world join in, the smaller percentage of people will use these, like my mother-in-law... They are already encoded in Unicode, using two or

Re: Unicode and end users - UTF-8B

2002-02-19 Thread Markus Scherer
Lars Kristan wrote: ... The same thing should work the other way around, store Windows filenames directly into a UTF-16 database and use UTF-8 = UTF-16 conversion for UNIX filenames. Hoping that some day most of the data will be UTF-8 makes this even more appealing. As for any data that is

Re: CRLF vs. LF (was Re: Unicode and end users)

2002-02-26 Thread Markus Scherer
Doug Ewell wrote: SC UniPad can read and write text files: - using LF, CR, CRLF, or LS (U+2028); Great, and I know about UniPad, but more people have Windows Notepad and other system-level editors. Why does UniPad not support NL and PS? One thing it cannot do is maintain different line

Re: CRLF vs. LF (was Re: Unicode and end users)

2002-02-27 Thread Markus Scherer
Doug Ewell wrote: Paragraph breaking implies that line breaking is also performed, and that the two are different somehow. LS and PS probably should not be treated as synonyms. Right, but we are talking about plain text editors here. I would expect a plain text editor to treat LS and PS

Re: All-kana documents

2002-03-04 Thread Markus Scherer
You could - use SCSU (UTR 6) - use BOCU-1 (http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html) - invent your own... markus

Re: Offtopic : Unicode and Bengali

2002-03-05 Thread Markus Scherer
Martin Kochanski wrote: And (3) it is not yet realistic to expect that all software will be able to handle the non-1-to-1 relationship between characters and glyphs, and it won't be for some time This argument belongs into the 1980s, not into the 2000s It is one reason why Unicode has

Re: processing numeric strings

2002-03-15 Thread Markus Scherer
[EMAIL PROTECTED] wrote: I've got a question I asked about on a couple of other lists, but didn't get much response, so I thought I'd try here. One of our developers has asked me for input on a certain problem: Do I need to be able to work with numbers represented using digits/numbering

Re: Collation - last character?

2002-03-15 Thread Markus Scherer
How about U+10? It is a non-character, which gives it a high (unassigned character) weight in the UCA. It is the highest code point = the last character. It cannot be a Private-Use character, so few people will be tempted to tailor it to something other than its default UCA weight. It also

Re: Regd- ISCII to Unicode Converter!

2002-03-15 Thread Markus Scherer
ICU supports ISCII, except for the font-style attributes (like bold) which are not expressible in plain text. http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/data/mappings/convrtrs.txt http://oss.software.ibm.com/icu/ ISCII is algorithmic. The mapping part to/from Unicode is fairly

Re: Double-struck italic E for mathematics?

2002-03-22 Thread Markus Scherer
I say font! and markup! and I run and duck and cover... markus (Very personal opinion!)

Re: UTR#9: Bidirection and UTR#14: Line Breaking

2002-03-25 Thread Markus Scherer
Chookij Vanatham wrote: UTR#14:Line Breaking says that, Interpretation of line breaking properties in bidirectional text takes place before applying rule L1 of the Unicode Bidirectional Algorithm. UTR#9:Bidirectional says that, [at the Reordering Resolved Levels section], As opposed

Re: how can I write an arabic square root

2002-03-25 Thread Markus Scherer
munzir taha wrote: It's just a english square root symbol flipped horizontally. I think there should be one in the unicode, doesn't it? It is the task of the font and the layout engine to mirror certain character's glyphs in a right-to-left context. Sometimes there is another character

Re: how can I write an arabic square root- I think I've understood a little.

2002-04-01 Thread Markus Scherer
Eric Muller wrote: I believe that the current mirrored and mirrored glyph properties are useful only when no help can be obtained from the font; otherwise, the resolved directionality should be provided to the font, which should then select the appropriate shape for each and every

Re: accessing extended ranges

2002-04-02 Thread Markus Scherer
Addison Phillips [wM] wrote: ICU4J, the IBM opensource project, provides some UTF-16 support capabilities that suggest a possible solution, but there are seemingly intractable problems with the Character class and char data type (luckily most APIs in Java take int arguments for characters

Re: xml 1.0 and unicode ideograph ext a and ext b

2002-04-03 Thread Markus Scherer
I think you are looking for the next version of XML: http://www.w3.org/TR/xml-blueberry-req markus Yung-Fong Tang wrote: Any plan to change the xml specification to follow the newly updated Unicode 3.2 Standard ?

conversion performance: UTF-8 BOCU-1 SCSU

2002-04-04 Thread Markus Scherer
I have numbers for text size and conversion performance of BOCU-1 and SCSU relative to UTF-8. Quick summary: For Latin text, UTF-8 is best. For CJK, BOCU-1 and SCSU provide smaller size, with some speed trade-off. For other scripts, BOCU-1 and SCSU are much better than UTF-8 in both speed and

Re: how can I write an arabic square root- I think I've understood a little.

2002-04-04 Thread Markus Scherer
[EMAIL PROTECTED] wrote: This raises a question in my mind: how is an app to know whether the layout engine+font are smart enough? ... In other words, it seems to me that it must be agreed that an app should assume it is handled by Uniscribe/OT, or should assume that it is not. Yes, I

Re: conversion performance: UTF-8 BOCU-1 SCSU

2002-04-05 Thread Markus Scherer
With a small improvement to the single-byte loop, the BOCU-1 converter became faster. This improves especially the speed for Latin text conversion. Please see new numbers at http://oss.software.ibm.com/icu/dropbox/bocuperf.html markus

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
Rick Cameron wrote: So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE is unambiguous. If you

UTF-7 signature

2002-04-11 Thread Markus Scherer
On 2002-apr-09, Shlomi Tal and Doug Ewell discussed on this list a UTF-7 signature byte sequence of +/v8- (which was news to me). (Subject MS/Unix BOM FAQ again (small fix)) I meditated some over this - +/v8 is the encoding of U+FEFF as the first code point in a text. So far, so good. The '-'

Re: UTF-7 signature

2002-04-11 Thread Markus Scherer
Shlomi Tal wrote: UTF-7, it shocked me how Greek Sokrates and S o k r a t e s (with spaces between each Greek letter in the latter) would have different encodings for the same Unicode characters. That is not unusual for stateful encodings. It's the same with BOCU-1 (not in this particular

Re: When was U+xxxx added?

2002-04-11 Thread Markus Scherer
ICU 2.1 will have an API for this, uchar.h/u_charAge(). markus Kenneth Whistler wrote: Frank asked: Given a Unicode encoding value U+ (or whatever for non-BMP), how can I find out the version of the Unicode standard in which this character first appeared?

Re: MS/Unix BOM FAQ again (small fix)

2002-04-12 Thread Markus Scherer
George W Gerrity wrote: To expand on this, imagine there is a text file in some encoding on some medium created by a little-endian machine (say a DEC Vax or a Macintosh 68000), and it is to be accessed on a big-endian machine (any Intel 8080 -- Pentium architecture). Unless the two CPUs

Re: Thai word list

2002-04-18 Thread Markus Scherer
Doug Ewell wrote: The ICU package includes a sorted Thai word list in a UTF-8 file called th18057.txt. Since you may not wish to download the whole package and I don't know if the Thai file is available separately, I have uploaded it (for a limited time only) to: Note that ICU has CVS and

Re: Useful Resources - Another round of spring cleaning

2002-05-09 Thread Markus Scherer
Google+ : markus Magda Danish (Unicode) wrote: - Akkadian http://saturn.sron.ruu.nl/~jheise/akkadian/index.html http://www.sron.nl/~jheise/akkadian/ - Multilingual Project Gutenberg http://www.informatik.uni-hamburg.de/gutenb http://www.sharat.co.il/pg/ - USMARC to UNIVERSAL

Re: Encoding of symbols, and a lock/unlock pre-proposal

2002-05-20 Thread Markus Scherer
Personally, I find it counter-productive to add a hodge-podge of dingbats and miscellaneous symbols to Unicode, or any coded character set. They had practical uses when user interfaces and display systems could not handle icons and arbitrary images, but those times are long over. Witness the

cuneiform on the web (USA Today 20020521)

2002-05-22 Thread Markus Scherer
USA Today ran a story yesterday on efforts to get cuneiform tablets published on the web in dictionary, photographic and 3-D forms: http://www.usatoday.com/news/healthscience/science/anthro/2002-05-21-cuneiform.htm It mentions an encoding effort (which I am sure was mentioned on this list).

Re: [OT] Agreement and i18n (was RE: Language name questions)

2002-05-28 Thread Markus Scherer
Some suggestions below. markus Marco Cimarosti wrote: A very well-known example of this situation is the menu New of Windows Explorer (the program that is used to manipulate the file system under MS Windows systems). I am not sure much can be done in this case: The New/Nuovo is at a

  1   2   3   4   5   >