Re: Normalisation and font technology

2002-05-28 Thread Markus Scherer
John, you seem to say normalization but mean decomposition. Please note that there are several normalization forms, and the most popular one is NFC, typically using code points for precomposed characters. Your email suggests that MacOS is using NFD, which I find surprising. On the issue of

Re: Unicode in email

2002-05-28 Thread Markus Scherer
The human-readable part of the email address (the friendly name) can contain any character, while the internal or actual address is very limited. A posting to the unicode list a while ago has the following header lines (among others): From:

Re: Towards some more Private Use Area code points for

2002-05-31 Thread Markus Scherer
For borders and arbitrary logos/symbols, it sounds like the best would be to do what someone else suggested on this list a few days ago: Define markup to specify a font and a glyph ID in that font to display something without the need for a pseudo-character encoding for it. Something for

Re: from 4 to null (was: 3 big bidi bugs)

2002-05-31 Thread Markus Scherer
There are a Java and a C++ reference implementation linked from the Bidi TR. The Java one is straightforward (and slow), written so that you can read each rule in the TR and see in the source that it works as specified. The C++ code is verified to produce the same results as the Java code.

Re: utf-8 and databases

2002-07-09 Thread Markus Scherer
Databases use table definitions that usually define what encoding is used in which parts of the database. Encodings can be set per database, per table, or per column, and the definition syntax seems to vary widely among vendors and products. Generally, UTF-8 or 1208 or unicode or similar is

Re: Filesystem Encoding

2002-07-10 Thread Markus Scherer
Doug Ewell wrote: Not sure if this is relevant to your specific case, bit I still use the command prompt (MS-DOS Prompt) a lot ... Interesting. I just tried the following: Windows 2000. New text document with Notepad, arbitrary contents. Save as AC06 0436.txt (Hangul letter + Cyrillic

Re: chcp 10000 (was: Filesystems)

2002-07-11 Thread Markus Scherer
a few years ago using the w versions of main(), printf(), etc., and they worked just fine. I think I switched the file mode of stdout to binary in those tools. markus Shlomi Tal wrote: Hello Markus Scherer. You wrote: chcp 1 to change the command prompt code page to UTF-16. But as far as I

Re: chcp 10000 (was: Filesystems)

2002-07-11 Thread Markus Scherer
Joseph Boyle wrote: Don't you need a fixed width font though? My W2K shows only Raster Fonts and Lucida Console when I try to change the command window font. Yes, I used Lucida Console, as I wrote originally. The command prompt window does not appear to accept duospace fonts, which would

Re: Is UniCode's Thai character representation is acceptable by TISI or not?

2002-07-17 Thread Markus Scherer
The SARA AM problem seems to be with the compatibility decomposition (NFKD and NFKC). NFK* change a lot of characters and strings - not just Thai - in various visible and functional ways and must be used with caution. markus Samphan Raruenrom wrote: Mark Davis wrote: - decomposition of

Re: Abstract character?

2002-07-23 Thread Markus Scherer
So far, the Unicode Standard has defined code points to be from the contiguous range of 0..0x10. Some definitions are fuzzy in the standard, with hopes of clarification in Unicode 4.0. It is true that UTF-16 cannot encode d800 dc00, but it can encode d800 0061 dc00. There are at least

Re: library for identifying equivalent sequences

2002-08-01 Thread Markus Scherer
Mark Davis wrote: We do have that in ICU 2.2. It is not a public interface (meaning that we will likely change the API before we make it public), but it is accessible if you want to test with it for now. See the ICU i18n library's caniter.h and caniter.cpp

Re: egyptian example -fixed

2002-08-12 Thread Markus Scherer
Tex, the presentation forms are marked with Bidi AL just like the normal Arabic characters. A conformant Bidi implementation must treat them the same. markus

Re: Dots as far as the eye can see (formerly: Re: New version of TR29:)

2002-08-14 Thread Markus Scherer
Mark Davis wrote: Note that we have a gazillion other dots already: ... And these are just the obvious ones found with a quick search (and just for the single dots). There are probably more hiding out in little corners of scripts (it's a bit like Where's Waldo looking for them. To find

Re: The Unicode Technical Committee meeting in Redmond, Washington State, USA.

2002-08-21 Thread Markus Scherer
Boris Becker against Steffi Graf 6:4 4:6 6:7... There are UTC/L2 documents for the agenda, topics, action items, minutes, etc. whenever appropriate. William Overington wrote: This is not in the same news gathering league as having CNN and other Oh - markus

Re: Mercury News: Hawaiian on a Mac

2002-09-05 Thread Markus Scherer
Stefan Persson wrote: This links to a different page on the same server: http://www.cl.cam.ac.uk/~mgk25/unicode.html That page contains a strange UTF-8 table: ... The last two byte sequences are invalid. Markus Kuhn's page shows the original ISO 10646 definition. This necessarily

Re: ISRI SoEuro has just been created!!

2002-09-10 Thread Markus Scherer
Doug Ewell wrote: ... They are not necessarily intended to replace the established mechanisms, although I suspect the ICU team does intend BOCU to replace SCSU. ... Nope. They have different properties and are useful for different if overlapping applications. BOCU-1 was developed for

Re: French or German Unicode Names??

2002-09-17 Thread Markus Scherer
Not that I have anything against French or German(!), but beware of what you would do with a translation. Translated names are fine as an annotation. Character names are treated as identifiers of abstract characters. They do not necessarily describe the abstract character well, or even

Re: Small 's' with grave?

2002-09-26 Thread Markus Scherer
[EMAIL PROTECTED] wrote: A friend of a friend asked me if Unicode has a code for small s with a grave. U+0073 U+0300 Has it been added since 3.0? Thanks in advance. Afaik, there is not and will not be any new precomposed characters since Unicode 3.0 I think the policy is to not add new

Re: glyph selection for Unicode in browsers

2002-09-26 Thread Markus Scherer
Tex Texin wrote: However, a Japanese user might have to choose a Japanese font, if the Unicode font does not favor (and cannot be made to favor with language tags) Japanese renderings. So it's catch 22. They have native fonts because Unicode fonts are inadequate, but we can be relieved that

Unicode charsets registered with IANA

2002-09-30 Thread Markus Scherer
BOCU-1 is now an IANA-registered charset: http://www.iana.org/assignments/character-sets I thought it might be useful and interesting to show the list of Unicode charsets that are registered: Charset name, MIBenum, aliases (if any *) UTF-7(MIBenum 1012) UTF-8(MIBenum 106) UTF-16

Re: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Markus Scherer
Barry Caplan wrote: There is a link with the story on the fron page of www.i18n.com Nice story, similar to the one with Gary Miller. It seems like we have three stories of origin now (with mid-'80s DEC). The i18n.com version does not date the MIT meeting, does it? markus

Re: GCGID for U+03B8

2002-10-11 Thread Markus Scherer
Doug Ewell wrote: What is the correct IBM GCGID value for U+03B8 GREEK SMALL LETTER THETA? Is it GT61 or GT610002? I have an internal document that shows GT61 Theta Small - (see GT610001, GT610002) U3B8 GREEK SMALL LETTER THETA GT610001 Theta Small (Open Form) - (resembles SA50)

Re: Manchu/Mongolian in Unicode

2002-10-15 Thread Markus Scherer
Andrew C. West wrote: On Tue, 15 Oct 2002, Stefan Persson wrote: That font also includes some characters mapped to the PUA: A € sign, and several #28450; character, many of which look like radicals. Why? Is that something that's also required by that law? It's my experience that many fonts

Re: Character identities

2002-10-23 Thread Markus Scherer
David Starner wrote: First, is it compliant with Unicode for an Antiqua font to use an s glyph for ſ (U+017F)? It makes switching between Antiqua and Fraktur fonts possible, and it is arguably the glyph given to the middle s in modern Antiqua fonts. Likewise, ä is printed as a with e above in

Re: XML Suitable (was: Meeting minutes for UTC 92 in August)

2002-10-22 Thread Markus Scherer
Doug Ewell wrote: [92-C23] Consensus: Add a definition of XML Suitable and a recommendation that SCSU encoders should be XML Suitable. [L2/02-262] [92-A46] Action Item for Markus Scherer, Editorial Committee: Post a proposed update to Unicode Technical Standard #6 A Standard Compression Scheme

Re: New Charakter Proposal

2002-10-30 Thread Markus Scherer
Dominikus Scherkl wrote: My other suggestion (and the main reason to call the proposed charakter source failure indicator symbol (SFIS)) was intended especaly for mall-formed utf-8 input that has overlong encodings. In this special case a converter exactly knows which char is intended, but needs

Re: New Charakter Proposal

2002-11-01 Thread Markus Scherer
David Starner wrote: Chances are nearly 100% that overlong UTF-8 was a spoofing attempt, or the result of something other than a UTF-8 encoder. With the exception of overlong sequences for null (C0 80?), which Java generates in an attempt to avoid true nulls. I am aware of this one. This

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-05 Thread Markus Scherer
Mark Davis wrote: Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are not modifying the contents). This points to a pragmatic way to deal with this issue: If software claims that it does

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Markus Scherer
Lars Kristan wrote: Markus Scherer wrote: If software claims that it does not modify the contents of a document *except* for initial U+FEFF then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed emif software claims to not modify text/em then one

charset name for IMAP mailbox encoding (mod UTF-7)?

2002-11-06 Thread Markus Scherer
Quick question: IMAP specifies a modified UTF-7 encoding for mailbox names. I imagine that this might be implemented in some applications as a converter. If so, what charset name is used for it? Is there a common one? If there is no commonly used charset name, then how about imap-mailbox-name?

Re: FW: ct, fj and blackletter ligatures

2002-11-07 Thread Markus Scherer
Dominikus Scherkl wrote: I don't believe that English readers encountering an fb ligature in the middle of the compound word 'goofball' are confused about where the syllables, and hence the subwords, end and begin. That may be because english doesn't use word-concatenations the way german do:

Re: Speaking of Plane 1 characters...

2002-11-11 Thread Markus Scherer
Michael (michka) Kaplan wrote: Michael, in answer to your request for a UTF-8 converter, that will have to be another day (its a bit more complicated, and I spend most of my time in UTF-16 and UTF-32 so I can't really pretend its work related). If you wanted to provide the code in VBScript or

Re: Media UI Symbols

2002-11-12 Thread Markus Scherer
Christoph Päper wrote: Moin, Selber moin :-) I've checked the existing chars, http://www.unicode.org/alloc/Pipeline.html and this year's thread titles of this mailing list, but didn't find characters to represent UI controls of media devices (or a proposal for including them for that matter)

Re: UTF-8 to Japanese encoding scheme for IBM Mainframe OS 390

2002-11-12 Thread Markus Scherer
sourav mazumder wrote: Need an urgent help regarding UTF-8 data conversion in IBM Mainframe 390. I have a data file in Windows system which contains Japanese characters encoded using UTF-8. I need to send this file to IBM Mainframe 390, where an application will read this data. In this context

Re: Double Byte Character Set (DBCS)

2002-11-13 Thread Markus Scherer
-Original Message- We are now looking to expand the market for this product into countries such as China. To achieve this I have been informed we need to enable our application for Double Byte Character Set (DBCS). DBCS is an old, pre-Unicode term for character sets with

Re: IBM AIX 5 and GB18030

2002-11-13 Thread Markus Scherer
xjliu_ca wrote: I have searched all the web on IBM about the support of GB18030 in OS AIX 4.3 and 5, but didn't find anything. I only can see they support GB2312 and GBK. Google found something for me: http://www-3.ibm.com/software/ts/mqseries/support/readme/aix530_read.html Search for 18030

Re: IBM AIX 5 and GB18030

2002-11-14 Thread Markus Scherer
Carl W. Brown wrote: Some Unix systems adapted faster because the later Unicode adopters used 32 bit Unicode characters making the job 100 times easier. Other companies like Microsoft took a very big gamble and implemented the code for surrogate support into Windows 2000 based on early drafts of

Re: IBM AIX 5 and GB18030

2002-11-14 Thread Markus Scherer
Jane Liu wrote: That may mean IBM AIX 5 support converison between GB18030 and Unicode, but I don't see this is a system level of support because there is no locale names for GB18030 in the doc of AIX 5 : The GB 18030 standard requires software to be able to _read and write_ text in the GB18030

Re: IBM AIX 5 and GB18030

2002-11-15 Thread Markus Scherer
Michael Yau wrote: Markus, The standard does _not_ require to _process_ internally in GB18030. It is sufficient to have a converter and to process in Unicode, which does contain all of the characters. Just curious, do you have this in writing from the China standards body? I don't

Re: IBM AIX 5 and GB18030

2002-11-15 Thread Markus Scherer
Jane, you are right, I over-simplified. I tried to make the point that you need not _process_ text in GB18030 but that Unicode processing and conversion to/from GB18030 fulfills the requirement to be able to read and write GB18030 text. Yes, you need to have font support for all the characters

Re: Small Latin Letter m with Macron

2003-01-15 Thread Markus Scherer
David J. Perry wrote: The convention of using a horizontal line to mark an abbreviation, often the omission of m or n, goes back to the middle ages (if not earlier) and was often used in early printed books; apparently it has lived on in some handwriting, to judge from your post. ... I can

Re: newbie 18030 font question

2003-01-16 Thread Markus Scherer
GB 18030 is defined with a 1:1 mapping table to Unicode. It has large code spaces for user-defined characters, but the standard repertoire is the same as Unicode's. In practice, all modern browsers work internally with Unicode no matter what page charset is received. They all convert from the

Re: ISO 639 arg - Esperanto

2003-01-22 Thread Markus Scherer
I am pleasantly surprised to see Esperanto on this list, even just in a quote :-) No, I don't claim to be proficient any more. Anto'nio Martins-Tuva'lkin wrote: I got the following reaction from a specialist on aragonese issues, Ferran Marin i Ramos [EMAIL PROTECTED]: Certe temas pri eraro,

Re: Greek fractions

2003-01-24 Thread Markus Scherer
Raymond Mercier wrote: The problem is rather: when are Unicode going to include the great many symbols covered in Betacode... Any characters make it into Unicode by someone - you? - writing complete, reasonable, convincing proposals that then make it through the committees and get approved in

Re: Arabic Presentation Forms

2003-02-06 Thread Markus Scherer
ICU has a function u_shapeArabic(): http://oss.software.ibm.com/icu/apiref/ushape_8h.html#a24 markus Mete Kural wrote: I need to figure out a method to convert Arabic Unicode text encoded in its normal form to Arabic Unicode text encoded in Arabic presentation forms. ...

Re: Plane 14 Tag Deprecation Issue (was Re: VS vs. P14 (was Re: IndicDevanagari Query))

2003-02-07 Thread Markus Scherer
William Overington wrote: Kenneth Whistler now states an opinion as to what the review is about and mentions a file PropList.txt of which I was previously unaware. Kenneth Whistler referred to a file that is part of the publicicly and freely provided Unicode Character Database, showing various

Re: CJK test data

2003-02-07 Thread Markus Scherer
Michael (michka) Kaplan wrote: GB18030 does not define a specific standard for sorting (as far as I know, neither does GB13000). It is an encoding standard. GB 18030 certainly does not define sorting. It defines a CCS/CES based on a mapping table to/from Unicode/ISO 10646. GB 13000 is, as far

Re: Converting EBCDIC to Unicode

2003-02-11 Thread Markus Scherer
Doug Ewell wrote: SRIDHARAN Aravind ASridharan at covansys dot com wrote: How to convert EBCDIC data into Unicode? There are informative mapping tables available at: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/ There are also various places where IBM publishes

Re: Never say never

2003-02-11 Thread Markus Scherer
Marco Cimarosti wrote: It has been repeated a lot of times that no more precomposed character will never ever ever ever be added. ... Stability requires that no more precomposed characters will be added that are equivalent to sequences of already-existing other characters. This is because it

Re: newbie: unicode (when used as a coding) = UTF16LE?

2003-02-13 Thread Markus Scherer
Tom Gewecke wrote: Aside from Tex Texin's experimental pages, does any one have url's of web sites done in UTF-16 rather than UTF-8? I don't, but would not mind having some examples for testing. In some of the ICU online demos you can choose the output charset. This should be interesting with

Re: Character display problem in browser

2003-02-17 Thread Markus Scherer
SRIDHARAN Aravind wrote: My database is Oracle and its character set is WE8ISO8859P1. In database, I have stored special Polish characters. First of all, the database character set is ISO 8859-1 which cannot represent special Polish characters. In all likelyhood, you have taken a byte stream

Re: BOM's at Beginning of Web Pages?

2003-02-17 Thread Markus Scherer
I would like to add some information here without getting myself into the core of the discussion: HTML recognizes a lot fewer whitespace characters than Java or Unicode. Different people have different sets of whitespace characters. Unicode's White_Space property (PropList.txt) contains 24 code

Re: DBCS and Unicode 3.1

2003-02-17 Thread Markus Scherer
Michael (michka) Kaplan wrote: Well, DBCS means double byte character set and thus it is always two bytes. But its a theoretical definition since there are no actual DBCS code pages -- all of the ones that exist are MBCS (multibyte character set) since they support both one-byte and two-byte

Re: DBCS and Unicode 3.1

2003-02-18 Thread Markus Scherer
Jungshik Shin wrote: On Mon, 17 Feb 2003, Markus Scherer wrote: Other examples: There are EUC-JP (1/2/3 bytes per character) and EUC-CN (1/2/4 BpC) which are quite old (much older than GB 18030). Markus's fingers made a mistake here :-). It's EUC-TW (not EUC-CN) that encodes CNS 11643

Re: DBCS and Unicode 3.1

2003-02-18 Thread Markus Scherer
[EMAIL PROTECTED] wrote: Does anyone know of a way to process GB 18030 data in COBOL on MVS? You could try to call ICU4C from COBOL http://oss.software.ibm.com/icu/userguide/cobol.html ICU has a GB 18030 converter. markus

Re: RFC, 5-6 octets sequence in UTF8, non short form in UTF8

2003-02-18 Thread Markus Scherer
Frank, http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-03.txt addresses these, and version -04 of this draft will be public shortly. markus

Re: [REPOST, LONG] XML and tags (LONG) - SCSU for XML

2003-02-21 Thread Markus Scherer
Marco Cimarosti wrote: BTW, would it be possible to encode XML in SCSU? Yes. Any reasonable SCSU encoder will stay in the ASCII-compatible single-byte mode until it sees a character from beyond Latin-1. Thus the encoding declaration will be ASCII-readable. The next version of UTR #6 will say so

Re: [REPOST, LONG] XML and tags (LONG) - SCSU for XML

2003-02-21 Thread Markus Scherer
Martin Duerst wrote: - Is it *probable* that an XML processor decodes XML in SCSU? No, XML processors are only required to support UTF-8 and UTF-16. Many of them support other encodings, such as iso-8859-1,..., but support for SCSU is thin as far as I'm aware. Well, Xerces is a reasonably

Re: symbols for `born' and `died'

2003-02-24 Thread Markus Scherer
Werner LEMBERG wrote: ... Similarly, the year of marriage is depicted as two intertwined circles. How will this be represented in Unicode? Are there characters for it? For the marriage symbol, U+221E INFINITY should work fine - and quite appropriately. markus

Re: UTF-8 question

2003-02-25 Thread Markus Scherer
A UTF-x converter must handle non-characters like U+FFFE, U+FDD0, etc. Unicode 3.0 chapter 3.8 Transformations clause D29 defines this, and the text there and below spells out that non-characters and the like must be converted as well. The change since 3.0 only affects single-surrogate code

Re: Finding string with special characters

2003-02-25 Thread Markus Scherer
SRIDHARAN Aravind wrote: I just want to know whether a particulat string from the source has got special characters. How can I make a dynamic check for it? Well, you usually use the methods on the String class to search for a matching character or substring, or methods to iterate through the code

Re: Finding string with special characters

2003-02-26 Thread Markus Scherer
It sounds like you don't know in what encoding you get your input, and you are munging the input bytes(?!) in a custom way. You need to identify the input encoding/charset and, in Java, instantiate an InputStreamReader with the correct encoding name. Then you get proper Unicode strings, and

Re: Unicode 4.0 BETA available for review

2003-02-26 Thread Markus Scherer
Yung-Fong Tang wrote: I see a hole here. How about UTF-8 representing a paired of surrogate code point with two 3 octets sequence instead of an one octets UTF-8 sequence? It should be ill-formed since it is non-shortest form also, right? But we really need to watch out the language used there

Re: UTF-8 Error Handling

2003-02-28 Thread Markus Scherer
Yung-Fong Tang wrote: Same thing for JIS x0208 (a TWO and only TWO bytes character set, not a variable length character set). If I am processing a ISO-2022-JP message and in the JIS x0208 mode and I got a 0x24 0xa8 I know the boundary of that problem is 16 bits, not 8 -bits nor 32 bits. Not

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Markus Scherer
I am not sure yet how far I want to get into this discussion... but this seems worth mentioning: Asmus Freytag wrote: The ideal case is one where the converter stops in a restartable configuration, allowing the client to implement (or ask for) a variety of error-recovery options. A nice

Re: Question about CollationTest_NON_IGNORABLE.txt NormalizationTest.txt

2003-03-10 Thread Markus Scherer
No takers for this question? Let me try... askq1 askq1 wrote: The CollationTest_NON_IGNORABLE.txt NormalizationTest.txt contain test-cases for sorting and normalization. The strings that are mentioned in these files follow a specific order: ... I want to know if these files are organized

Re: Unicode character transformation through XSLT

2003-03-11 Thread Markus Scherer
Kenneth Whistler wrote: Unicode character (\uFFE2\uFF80\uFF93) ... What you are actually looking for is the UTF-8 sequence: 0xE2 0x80 0x93 The 8-bit UTF-8 bytes E2 80 93 (all with the most significant bit set) get *sign-extended* to 16 bits, producing FFE2 FF80 FF93. It should suffice in a

Re: ZWNJ Persian Collation

2003-03-12 Thread Markus Scherer
Roozbeh Pournader wrote: Well, anything that is completely ignored in collation creates problems with deterministic sorting. I don't think you mean deterministic. UCA is deterministic, it just sorts many strings as equal. There are certain words in Persian, with completely different meanings,

Re: Unicode library that provides versioned Unicode API?

2003-03-12 Thread Markus Scherer
ICU4C 2.6 (June/July) will support Unicode 4 but also provide an option for Unicode 3.2 normalization (with NormalizationCorrections.txt applied though). http://oss.software.ibm.com/icu/ http://oss.software.ibm.com/pipermail/icu/2003-March/005406.html We do not have any plans so far to do this

Re: Unicode character transformation through XSLT

2003-03-12 Thread Markus Scherer
Generally, try instantiating an InputStreamReader or similar from your input, with an explicit encoding=UTF8. That will perform the conversion from UTF-8 to the internal 16-bit Unicode that Java processes. Always use XYZReader classes for text input and XYZWriter classes for text output.

per-character stories in a database

2003-03-13 Thread Markus Scherer
(from Re: geometric shapes) It has been suggested many times to build a database (list, document, XML, ...) where each designated/assigned code point and each character gets its story: Comments on the glyphs, from what codepage it was inherited, usage comments and examples, alternate names,

Re: Unicode character transformation through XSLT

2003-03-14 Thread Markus Scherer
Nooo - Java's old UTF functions do not process UTF-8! They are there for String serialization, a Java-internal format. Use the Java Reader/Writer classes instead of these old ones! See the Java tutorials on Internationalization: http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html

Re: UTF-24

2003-04-03 Thread Markus Scherer
Pim Blokland wrote: Why is there no UTF-24? Well, I once proposed UTF-20... See, these MathText characters take up a lot of space. No matter how you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes long. True for them alone, in those UTFs. Short of defining another Unicode encoding,

Re: javascript and unicode

2003-05-27 Thread Markus Scherer
Paul Hastings wrote: would it be correct to say that javascript natively supports unicode? ECMAScript, of which JavaScript and JScript are implementations, is defined on 16-bit Unicode scripts and using 16-bit Unicode strings. In other words, the basic encoding support is there, but there are

Re: book end or enclosing characters in most languages?

2003-05-30 Thread Markus Scherer
Ben Dougall wrote: On Wednesday, May 28, 2003, at 06:59 pm, Otto Stolz wrote: PS. In these tow languages, the quote-marks are paired thusly: en_US: U+201C ... U+201D, and U+2018 ... U+2019 de_DE: U+201E ... U+201C, and U+201A ... U+2018 are they the right way round? so in german it'd be:

Re: book end or enclosing characters in most languages?

2003-05-30 Thread Markus Scherer
Ben Dougall wrote: So, there is not comprehensive list of openers vs. closers possible. so that's a 99 shaped quote on the baseline to open and, and a 99 high up to close. seems very odd to use 99 high or low to open, not a 66. but if that's how it is, that's how it is. Well, wait - I was

simple case mappings across UTF-8 length boundaries

2003-07-01 Thread Markus Scherer
FYI I wrote a little program for other standards activities to check which Unicode characters have simple lower-/uppercase mappings across UTF-8 length boundaries (0080, 0800, 1). This is with Unicode 4 data. I thought some unicode subscribers might be interested in the result. Best

Re: Cases of signs? [RE: simple case mappings across UTF-8 lengthboundaries]

2003-07-01 Thread Markus Scherer
could that mean? From: Markus Scherer [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 01, 2003 1:30 PM To: unicode Subject: simple case mappings across UTF-8 length boundaries U+2126 simple-lowercases to U+03c9 U+2126 is OHM SIGN U+212a simple-lowercases to U+006b U+212a is KELVIN SIGN U+212b simple

Re: ISO 639 duplicate codes - ICU4J modularization

2003-07-15 Thread Markus Scherer
ICU4J 2.6 provides build options out of the box to select certain functionalities. Please see the bullet Modularization on http://oss.software.ibm.com/icu4j/download/2.6/ markus

Re: Code Pages!

2003-07-24 Thread Markus Scherer
There are many codepages for Indic languages. Modern systems support Unicode. It is what Windows and MacOS X and Java and modern web browsers etc. use internally - everything else is supported via conversion, which can be problematic. The ISCII standard is byte-based and stateful. (Complicated

Re: Unicode Normalisaton Optimisation Experiments

2003-09-24 Thread Markus Scherer
Jon Hanna wrote: Hi, I'm currently experimenting with various trade-offs for Unicode normalisation code. Any comments on these (particularly of the that's insane, here's why, stop now! variety) would be welcome. You might want to look at, if not even use, the ICU open-source implementation:

Re: Unicode Normalisaton Optimisation Experiments

2003-09-25 Thread Markus Scherer
Peter Kirk wrote: On 25/09/2003 12:27, [EMAIL PROTECTED] wrote: It's not a reordering per se, as the first combining character is given the first opportunity to combine. Thanks for the clarification. In other words, yes, Unicode's NFC does perform discontiguous composition. Some things might be

Re: Unicode Normalisaton Optimisation Experiments

2003-09-25 Thread Markus Scherer
Peter Kirk wrote: On 25/09/2003 14:25, Markus Scherer wrote: In other words, yes, Unicode's NFC does perform discontiguous composition. Some things might be easier if only contiguous composition were used, but the current definition does give you the shortest strings. And this current

Re: Non-ascii string processing? - count display units

2003-10-07 Thread Markus Scherer
You might want to look at East Asian Width http://unicode.org/reports/tr11/ for an approximation of the green-screen width of a string. To be absolutely precise, you need feedback from your green-screen layout engine and its font, of course, like you do for a graphical display. markus Edward

Re: Euro Currency for UK

2003-10-09 Thread Markus Scherer
I think Addison is on the right track here. I would like to point to ICU sample code for this kind of thing: http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/samples/numfmt/main.cpp See the code there from setNumberFormatCurrency_2_6 on down (the preceding code is for older ICU

Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Markus Scherer
Jill Ramonsky wrote: I had to write an API for my employer last year to handle some aspects of Unicode. We normalised everything to NFD, not NFC (but that's easier, not harder). Nonetheless, all the string handling routines were not allowed to assume that the input was in NFD, but they had to

Re: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Markus Scherer
Philippe Verdy wrote: ... In fact, to further optimize and reduce the memory footprint of Java strings, in fact I choosed to store the String in a array of bytes with UTF-8, instead of an array of chars with UTF-16. The internal representation is This does or does not save space and time depending

Re: unicode on Linux

2003-10-23 Thread Markus Scherer
Stefan Persson wrote: Stephane Bortzmeyer wrote: I do not agree. It would mean *each* application has to normalize because it cannot rely on the kernel. It has huge security implications (two file names with the same name in NFC, so visually impossible to distinguish, but two different string of

Re: FW: Web Form: Other Question, Problem, or Feedback

2003-10-23 Thread Markus Scherer
If this is in C/C++ and your text is in Unicode, and you convert to a legacy (non-Unicode) codepage, then you could use the ICU conversion API. It has an option to turn non-mappable characters into numeric character references for HTML/XML. Please see

Re: unicode on Linux

2003-10-28 Thread Markus Scherer
You should use Unicode internally - UTF-16 when you use ICU or most other libraries and software. Externally, that is for protocols and files and other data exchange, you need to identify (input: determine; output: label) the encoding of the data and convert between it and Unicode. If you can

Re: osmanya script transliteration

2003-10-29 Thread Markus Scherer
[EMAIL PROTECTED] wrote: is it possible to design a program that takes the vaLue of the osmanya script and compare it with the somali latin script. then afterwards, displaying the equivalent. Generally, yes - this is called script transliteration. You could try this online at

Re: UAX #29 beta update (text breaks): apostrophe ./. H

2003-10-29 Thread Markus Scherer
Like German heute (=today) where the eu sounds like the oy in Spanish hoy? hui=hoy=heu(te)... Neat! markus Michael Everson wrote: At 23:07 +0100 2003-10-27, Philippe Verdy wrote: The historic French word hui is now completely obsoleted, and commonly found only in the single expression

Re: unicode on Linux

2003-10-29 Thread Markus Scherer
Philippe Verdy wrote: the input:determine strategy will work fine for UTF-8 or SCSU, provided that the leading BOM is explicitly encoded. ... With determine I do not mean to restrict to checking for a BOM. There are several ways to determine the input charset, depending on the protocol and

Re: Collation contractions and reordering, was: Hebrew composition model, with cantillation marks

2003-11-03 Thread Markus Scherer
I suggest you try it out - http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=heEXPLORE_CollationElements= ICU implements the UCA, including discontiguous contractions. markus Peter Kirk wrote: On 03/11/2003 07:01, Kent Karlsson wrote: However, the UCA does ignore differences between

Re: Collation contractions and reordering, was: Hebrew composition model, with cantillation marks

2003-11-04 Thread Markus Scherer
Peter Kirk wrote: On 03/11/2003 15:26, Markus Scherer wrote: I suggest you try it out - http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=heEXPLORE_CollationElements= ICU implements the UCA, including discontiguous contractions. Thank you, Markus. Unfortunately the results are barely

Re: charset=utf8 and Mac mailers

2003-11-04 Thread Markus Scherer
[EMAIL PROTECTED] wrote: We are talking about charset value for the internet protocol here. It is a special narrow field of charset name. The value used by Internet protocol are defined by a well defined process- http://www.faqs.org/rfcs/rfc2278.html RFC 2278 - IANA Charset Registration

Re: UTF-16 inside UTF-8

2003-11-06 Thread Markus Scherer
I would like to comment on several statements that I have seen in this thread - - Migrating from UCS-2 to UTF-16: Doable, and has been done for many applications and libraries. - Difficult to handle UTF-16? Use ICU - it handles all of Unicode for collation, regular expressions, string

Re: Handy table of combining character classes

2003-11-11 Thread Markus Scherer
John Cowan wrote: Here's a little table of the combining classes, showing the value, the number of characters in the class, and a handy name (typically the one used in the Unicode Standard, or a CODE POINT NAME if there is only one; sometimes of my own invention). This is already published with

Re: FW: Web Form: Other Question, Problem, or Feedback

2003-11-14 Thread Markus Scherer
Try a) #x2510; etc. b) Use an application to find those characters, copy them, and paste them into your HTML editor. For this you need to use a Unicode charset for your HTML document, see http://www.unicode.org/faq/unicode_web.html#9 Possible applications to use to find and copy the

Re: Unicode dictionary coding? UTF8, UTF32, etc

2003-11-14 Thread Markus Scherer
Theodore H. Smith wrote: Can someone give me some advice? If I was to write a dictionary class for Unicode, would I be better off writing it using a b-tree, or hash-bin system? Or maybe an array of pointers to arrays system? See John's reply. Tries of some sort should be good. I think there was

<    1   2   3   4   5   >