Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Peter Kirk <[EMAIL PROTECTED]> writes: > Jill, again your solution is ingenious. But would it not work just > as well to for Lars' purposes to use, instead of your string of > random characters, just ONE reserved code point followed by U+0xx? > Instead of asking the UTC to allocate a specific code

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > OK, strcpy does not need to interpret UTF-8. But strchr probably should. No. Its argument is a byte, even though it's passed as type int. By "byte" here I mean "C char value, which is an octet in virtually all modern C implementations; the C standard doe

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin &#x27;Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > Unix makes is possible for /you/ to change /your/ locale - but by > your reasoning, this is an error, unless all other users do so > simultaneously. Not necessarily: you can change the locale as long as it uses the same default encoding. By "error" I m

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > Now, it is true that data from two applications using this technique can > become intermixed. But this is not something we should fear. On the > contrary, this is why I do what to standardize the approach. Because in most > cases what will happen is exact

Re: Roundtripping Solved

2004-12-15 Thread Marcin &#x27;Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > OBSERVATION - Requirement (4) is not met absolutely, however, > the probability of the UTF-8 encoding of this sequence occuring > "accidently" at an arbitrary offset in an arbitrary octet stream > is approximately one in 2^384; Assuming that the distrib

Unicode filenames and other external strings on Unix - existing practice

2004-12-14 Thread Marcin &#x27;Qrczak' Kowalczyk
I describe here languages which exclusively use Unicode strings. Some languages have both byte strings and Unicode strings (e.g. Python) and then byte strings are generally used for strings exchanged with the OS, the programmer is responsible for the conversion if he wishes to use Unicode. I consi

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin &#x27;Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > Hm, here lies the catch. According to UTC, you need to keep > processing the UNIX filenames as BINARY data. And, also according > to UTC, any UTF-8 function is allowed to reject invalid sequences. > Basically, you are not supposed to use strcpy to pro

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin &#x27;Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -> > NOT-UTF-16 -> NOT-UTF-8 But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 -> NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an awkward way which would h

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > And once we understand that things are manageable and not as > frigtening as it seems at first, then we can stop using this as an > argument against introducing 128 codepoints. People who will find > them useful should and will bother with the consequence

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > But, as I once already said, you can do it with UTF-8, you simply > keep the invalid sequences as they are, and really handle them > differently only when you actually process them or display them. UTF-8 is painful to process in the first place. You are

Re: Roundtripping in Unicode

2004-12-12 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: >> Please make up your mind: either they are valid and programs are >> required to accept them, or they are invalid and programs are required >> to reject them. > > I don't know what they should be called. The fact is there shouldn't be any. > And that cur

Re: Nicest UTF

2004-12-12 Thread Marcin &#x27;Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > It's hard to create a general model that will work for all scripts > encoded in Unicode. There are too many differences. So Unicode just > appears to standardize a higher level of processing with combining > sequences and normalization forms that are

Re: Nicest UTF

2004-12-12 Thread Marcin &#x27;Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> But demanding that each program which searches strings checks for >> combining classes is I'm afraid too much. > > How is it any different from a case-insenstive search? We started from string equality, which somehow changed into searching. Default st

Re: Nicest UTF

2004-12-12 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > My my, you are assuming all files are in the same encoding. Yes. Otherwise nothing shows filenames correctly to the user. > And what about all the references to the files in scripts? > In configuration files? Such files rarely use non-ASCII characters.

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > All assigned codepoints do roundtrip even in my concept. > But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject the

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: >> It's essential that any UTF-n can be translated to any other without >> loss of data. Because it allows to use an implementation of the given >> functionality which represents data in any form, not necessarily the >> form we have at hand, as long as corr

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > The other name for this is roundtripping. Currently, Unicode allows > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are > several reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more > valuable, even if it means that the other roundtrip i

Re: Nicest UTF

2004-12-11 Thread Marcin &#x27;Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> > This implies that every programmer needs an indepth knowledge of >> > Unicode to handle simple strings. >> >> There is no way to avoid that. > > Then there's no way that we're ever going to get reliable Unicode > support. This is probably true. I

Re: Nicest UTF

2004-12-11 Thread Marcin &#x27;Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: [...] > This was later amended in an errata for XML 1.0 which now says that > the list of code points whose use is *discouraged* (but explicitly > *not* forbidden) for the "Char" production is now: [...] Ugh, it's a mess... IMHO Unicode is partially t

Re: Nicest UTF

2004-12-10 Thread Marcin &#x27;Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining

Re: Nicest UTF

2004-12-10 Thread Marcin &#x27;Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining

Re: Nicest UTF

2004-12-10 Thread Marcin &#x27;Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The XML/HTML core syntax is defined with fixed behavior of some > individual characters like '&', '<', quotation marks, and with special > behavior for spaces. The point is: what "characters" mean in this sentence. Code points? Combining character se

Re: Nicest UTF

2004-12-10 Thread Marcin &#x27;Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> String equality in a programming language should not treat composed >> and decomposed forms as equal. Not this level of abstraction. > > This implies that every programmer needs an indepth knowledge of > Unicode to handle simple strings. There is no way

Re: When to validate?

2004-12-10 Thread Marcin &#x27;Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > Here's something that's been bothering me. Suppose I write a function > - > let's call it trim(), which removes leading and trailing spaces from a > string, represented as one of the UTFs. If I've understood this > correctly, I'm supposed to validate the

Re: Nicest UTF

2004-12-08 Thread Marcin &#x27;Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> String equality in a programming language should not treat composed >> and decomposed forms as equal. Not this level of abstraction. > > Well, that assumes that there's a special "string equality" predicate, > as distinct from just having various predicate

Re: Nicest UTF

2004-12-08 Thread Marcin &#x27;Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: > The semantics there are surprising, but that's true no matter what you > do. An NFC string + an NFC string may not be NFC; the resulting text > doesn't have N+M graphemes. Which implies that automatically NFC-ing strings as they are processed would be a

Re: Invalid UTF-8 sequences

2004-12-08 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > Quite close. Except for the fact that: > * U+EE93 is represented in UTF-32 as 0xEE93 > * U+EE93 is represented in UTF-16 as 0xEE93 > * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93) Then it would be impossible to represent sequences li

Re: If only MS Word was coded this well

2004-12-08 Thread Marcin &#x27;Qrczak' Kowalczyk
"Theodore H. Smith" <[EMAIL PROTECTED]> writes: >>> It's because code points have variable lengths in bytes, so >>> extracting individual characters is almost meaningless > > Same with UTF-16 and UTF-32. A character is multiple code-points, > remember? (decomposed chars?) > Nope. I've done tons o

Re: Nicest UTF

2004-12-08 Thread Marcin &#x27;Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: > You could hide combining characters, which would be extremely useful if > we were just using Latin and Cyrillic scripts. It would need a separate API for examining the contents of a combining character. You can't avoid the sequence of code points comple

Re: Nicest UTF

2004-12-06 Thread Marcin &#x27;Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: >> This is simply what you have to do. You cannot convert the data >> into Unicode in a way that says "I don't know how to convert this >> data into Unicode." You must either convert it properly, or leave >> the data in its original encoding (properly marke

Re: Nicest UTF

2004-12-05 Thread Marcin &#x27;Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The question is why you would need to extract the nth codepoint so > blindly. For example I'm scanning a string backwards (to remove '\n' at the end, to find and display the last N lines of a buffer, to find the last '/' or last '.' in a file name).

Re: Nicest UTF

2004-12-05 Thread Marcin &#x27;Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: >> The point is that indexing should better be O(1). > > SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. > But individual characters do not always have

Re: Nicest UTF

2004-12-04 Thread Marcin &#x27;Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > There's nothing that requires the string storage to use the same > "exposed" array, The point is that indexing should better be O(1). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of intege

Re: Nicest UTF

2004-12-03 Thread Marcin &#x27;Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > Decoding SCSU is very straightforward, But not for random access by code point index, which is needed by many string APIs. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-02 Thread Marcin &#x27;Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > Oh for a chip with 21-bit wide registers! Not 21-bit but 20.087462841250343-bit :-) -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Re: Nicest UTF

2004-12-01 Thread Marcin &#x27;Qrczak' Kowalczyk
"Theodore H. Smith" <[EMAIL PROTECTED]> writes: > Assuming you had no legacy code. And no "handy" libraries either, [...] > What would be the nicest UTF to use? For internals of my language Kogut I've chosen a mixture of ISO-8859-1 and UTF-32. Normalized, i.e. a string with chracters which fit in

Re: Unicode & IDNs

2004-11-09 Thread Marcin &#x27;Qrczak' Kowalczyk
"Donald Z. Osborn" <[EMAIL PROTECTED]> writes: > Is anyone aware of URLs that use extended Latin characters as examples? http://ÅÃÅw.pl/ -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

Re: bit notation in ISO-8859-x is wrong

2004-10-10 Thread Marcin &#x27;Qrczak' Kowalczyk
[EMAIL PROTECTED] (James Kass) writes: [...] > If there are eight bits, why shouldn't they be bits one > through eight? Because then the number of a bit doesn't correspond to the exponent of its weight, so I even don't know in which order they are specified (as many people order bits backwards,

Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)

2004-08-14 Thread Marcin &#x27;Qrczak' Kowalczyk
W liście z sob, 14-08-2004, godz. 12:35 +0200, Philippe Verdy napisał: > Simply because, for both Unicode and ISO/IEC 10646, the character > model includes the fact that ANY base character forms a combining > character sequence with ANY following combining character or ZW(N)J > character. Shouldn

Re: Combining across markup?

2004-08-12 Thread Marcin &#x27;Qrczak' Kowalczyk
W liście z czw, 12-08-2004, godz. 13:00 -0400, John Cowan napisał: > > Even better yet: Have the WC3 rephrase their demand that no element > > should start with a defective sequence (when considered in separate) > > as that no *block-level* element should etc., and leave things like > > , and ot

RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Marcin &#x27;Qrczak' Kowalczyk
W liście z wto, 10-08-2004, godz. 18:33 +0100, Jon Hanna napisał: > By the rules of XML replacing ≯ with U+226F would mean the document was > no longer well-formed. Really? I don't have a XML spec handy, but character references like ̸ can't be processed before parsing tags, because &60; is the

Re: Microsoft Unicode Article Review

2004-08-06 Thread Marcin &#x27;Qrczak' Kowalczyk
W liście z czw, 05-08-2004, godz. 15:52 -0500, John Tisdale napisał: > Yet, if you are working with an application that must parse and > manipulate text at the byte-level, the costliness of variable length > encoding will probably outweigh the benefits of ASCII compatibility. > In such a case the

Re: UAX 15 hangul composition

2004-08-03 Thread Marcin &#x27;Qrczak' Kowalczyk
W liście z wto, 03-08-2004, godz. 13:47 +0200, Theo Veenker napisał: > Don't know if this has been asked/reported before, but is the example code > for hangul composition in UAX 15 correct? I reported it a month ago and got a response stating that "This has been forwarded to the right people, and

Re: Umlaut and Tréma, was: Variation selectors and vowel marks

2004-07-23 Thread Marcin &#x27;Qrczak' Kowalczyk
W liÅcie z piÄ, 23-07-2004, godz. 18:01 +0200, Philipp Reichmuth napisaÅ: > However, to return to the original problem, I don't remember ever having > seen a data where it would be necessary to distinguish between trema and > diaeresis in the data itself. A similar issue: a Polish encyclopaedia I

Re: Folding algorithm and canonical equivalence

2004-07-18 Thread Marcin &#x27;Qrczak' Kowalczyk
W liście z sob, 17-07-2004, godz. 16:46 -0700, Asmus Freytag napisał: > I wonder whether that's truly intended, or whether it could be replaced > by a combination of > > AccentFolding > OtherDiacriticFolding > > where AccentFolding removes *all* nonspacing marks following Latin, Greek > or Cyri

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-10 Thread Marcin &#x27;Qrczak' Kowalczyk
W liÅcie z piÄ, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisaÅ: > o-slash, can be analyzed as o and slash, even though that's not done > canonically in Unicode. Allowing users outside Scandinavia to perform > fuzzy searches for words with this character is useful. > > In this view of fol

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-06 Thread Marcin &#x27;Qrczak' Kowalczyk
W liście z wto, 06-07-2004, godz. 10:50 +0100, Peter Kirk napisał: > I guess another similar change would be Danzig -> Gdansk, but > I don't know where the initial G came from so possibly the Polish form > is older than the German. A name with initial "Gd" is older than with "D": http://ency

Error in Hangul composition code

2004-07-05 Thread Marcin &#x27;Qrczak' Kowalczyk
says: int SIndex = last - SBase; if (0 <= SIndex && SIndex < SCount && (SIndex % TCount) == 0) { int TIndex = ch - TBase; if (0 <= TIndex && TIndex <= TCount) { // make syllable of

Re: Shape of the US Dollar Sign

2001-09-28 Thread Marcin &#x27;Qrczak' Kowalczyk
Fri, 28 Sep 2001 09:58:39 -0600, Jim Melton <[EMAIL PROTECTED]> pisze: > I believe this is nothing but a font/glyph/presentation issue. A font for text mode I once made had the dollar like this: . . . . . . . . . . . . # . # . . . . . . # . # . . . . . # # # # # . . . # # . # . # # .

Re: Any tools to convert HTML unicode to JAVA unicode

2001-09-22 Thread Marcin &#x27;Qrczak' Kowalczyk
Wed, 19 Sep 2001 03:47:59 -0700 (PDT), MindTerm <[EMAIL PROTECTED]> pisze: > I would like to ask any tools to convert HTML > unicode ( e.g. & # n n n n ) to JAVA unicode ( e.g. \u > n n n n ) ? Here is a Perl program which does this: perl -pe 'BEGIN {sub java ($) {sprintf "\\u%04x", $_[0]}}

Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Marcin &#x27;Qrczak' Kowalczyk
Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler <[EMAIL PROTECTED]> pisze: > If you are expecting better performance from a library that takes UTF-8 > API's and then does all its internal processing in UTF-8 *without* > converting to UTF-16, then I think you are mistaken. UTF-8 is a bad >

Re: CESU-8 vs UTF-8

2001-09-16 Thread Marcin &#x27;Qrczak' Kowalczyk
Sun, 16 Sep 2001 01:14:06 -0700, Carl W. Brown <[EMAIL PROTECTED]> pisze: > If it can be demonstrated that there is a real need for an encoding > like CESU-8 then is should be very different from UTF-8. How does > SCSU for example sort? SCSU encoding is non-deterministic and its representations

Re: PDUTR #26 posted

2001-09-14 Thread Marcin &#x27;Qrczak' Kowalczyk
Thu, 13 Sep 2001 12:52:04 -0700, Asmus Freytag <[EMAIL PROTECTED]> pisze: > UTF-32 does have the same byte order issues as UTF-16, except that > byte order is recognizable without a BOM. UTF-8 would be used for external communication almost exclusively. Especially as it's compatible with ASCII a

Re: PDUTR #26 posted

2001-09-14 Thread Marcin &#x27;Qrczak' Kowalczyk
Two things I forgot to add: Thu, 13 Sep 2001 12:52:04 -0700, Asmus Freytag <[EMAIL PROTECTED]> pisze: >>IMHO Unicode would have been a better standard if UTF-16 >>hadn't existed. > > Decidedly not. In fact, Unicode would not be widely implemented today. It's much simpler to migrate from byte e

Re: PDUTR #26 posted

2001-09-13 Thread Marcin &#x27;Qrczak' Kowalczyk
Wed, 12 Sep 2001 11:08:41 -0700, Julie Doll Allen <[EMAIL PROTECTED]> pisze: > Proposed Draft Unicode Technical Report #26: Compatibility Encoding > Scheme for UTF-16: 8-Bit (CESU-8) is now available at: > http://www.unicode.org/unicode/reports/tr26/ IMHO Unicode would have been a better standar

Re: [OT] o-circumflex

2001-09-10 Thread Marcin &#x27;Qrczak' Kowalczyk
Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti <[EMAIL PROTECTED]> pisze: > It's as weird as some Italian names for German cities: Aquisgrana > for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di > Baviera) for München. Interesting that Polish names of these cities are more like It

Re: Nonsense in http://www.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.C?

2001-08-25 Thread Marcin &#x27;Qrczak' Kowalczyk
Wed, 22 Aug 2001 15:59:15 -0700, Michael (michka) Kaplan <[EMAIL PROTECTED]> pisze: >> Functions ConvertUCS4toUTF8 and ConvertUTF8toUCS4 use surrogates >> in UCS4. In particular ConvertUTF8toUCS4 converts a character above >> U+ into two UCS4 words. Why is this absurd there?! > > UCS-4 has n

Nonsense in http://www.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.C?

2001-08-22 Thread Marcin &#x27;Qrczak' Kowalczyk
Functions ConvertUCS4toUTF8 and ConvertUTF8toUCS4 use surrogates in UCS4. In particular ConvertUTF8toUCS4 converts a character above U+ into two UCS4 words. Why is this absurd there?! -- __("< Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SY

Re: COMMERCIAL AT

2001-07-15 Thread Marcin &#x27;Qrczak' Kowalczyk
Sat, 14 Jul 2001 11:51:29 +0100, Michael Everson <[EMAIL PROTECTED]> pisze: > References to animals are the most common. Germans, Dutch, Finns, > Hungarians, Poles and South Africans see it as a monkey tail. Indeed it's commonly called "monkey" in Polish (in parallel with "at"), but some call

Re: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Marcin &#x27;Qrczak' Kowalczyk
Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] <[EMAIL PROTECTED]> pisze: > Unfortunately, you don't hear much about SCSU, and in particular > the Unicode Consortium doesn't really seem to promote it much > (although they may be trying to avoid the "too many UTF's" syndrome). SCSU doesn't look

Re: Terms "constructed script", "invented script" (was: FW: Re: Shavian)

2001-07-11 Thread Marcin &#x27;Qrczak' Kowalczyk
7 Jul 2001 11:01:18 GMT, Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> pisze: > I put a sample at <http://qrczak.ids.net.pl/vi-001.gif> Now I put a prettier version there: with variable line width, serifs, and by a slightly improved sizing engine (enlargement of rounde

Re: Terms "constructed script", "invented script" (was: FW: Re: Shavian)

2001-07-07 Thread Marcin &#x27;Qrczak' Kowalczyk
In a message dated 2001-07-06 0:31:39 Pacific Daylight Time, [EMAIL PROTECTED] writes: > I wonder: why aren't languages with simple syllabic structures > written in hiragana? It seems to be built for them. I am using my own script inspired by hiragana 10 years ago for writing Polish. It looks

Re: validity of lone surrogates

2001-07-03 Thread Marcin &#x27;Qrczak' Kowalczyk
Tue, 3 Jul 2001 11:19:05 +0100, Michael Everson <[EMAIL PROTECTED]> pisze: >>I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and >>decoders to not worry about surrogates at all. Please leave surrogate >>issues to UTF-16. > > But what if I want to put up a Web page in Etruscan

Re: validity of lone surrogates (was Re: Unicode surrogates: just say no!)

2001-07-03 Thread Marcin &#x27;Qrczak' Kowalczyk
Tue, 3 Jul 2001 01:50:56 -0700, Michael (michka) Kaplan <[EMAIL PROTECTED]> pisze: >> It's a pity that UTF-16 doesn't encode characters up to U+F, such >> that code points corresponding to lone surrogates can be encoded as >> pairs of surrogates. > > Unfortunately, we would then be stuck wit

Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)

2001-07-03 Thread Marcin &#x27;Qrczak' Kowalczyk
27 Jun 2001 13:38:33 +0100, Gaute B Strokkenes <[EMAIL PROTECTED]> pisze: > I would be indebted if any of the experts who hang out on the > unicode list could sort out this confusion. I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and decoders to not worry about surrogates a

Re: How does Python Unicode treat surrogates?

2001-06-25 Thread Marcin &#x27;Qrczak' Kowalczyk
Mon, 25 Jun 2001 07:24:28 -0700, Mark Davis <[EMAIL PROTECTED]> pisze: > In most people's experience, it is best to leave the low level interfaces > with indices in terms of code units, then supply some utility routines that > tell you information about code points. It's yet better to work on ch

Re: How will software source code represent 21 bit unicode characters?

2001-04-17 Thread Marcin &#x27;Qrczak' Kowalczyk
Tue, 17 Apr 2001 07:33:16 +0100, William Overington <[EMAIL PROTECTED]> pisze: > In Java source code one may currently represent a 16 bit unicode character > by using \u where each h is any hexadecimal character. > > How will Java, and maybe other languages, represent 21 bit unicode > chara

Re: Latin digraph characters

2001-03-03 Thread Marcin &#x27;Qrczak' Kowalczyk
Wed, 28 Feb 2001 13:35:17 -0800 (GMT-0800), Pierpaolo BERNARDI <[EMAIL PROTECTED]> pisze: > The initial character of the name is transliterated as CH in English, > TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the > official Russian transliteration. And CZ in Polish. --

Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread Marcin &#x27;Qrczak' Kowalczyk
Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis <[EMAIL PROTECTED]> pisze: > The topic came up in a UTC meeting some time ago, a "UTF-8S". The > motivation was for performance (having a form that reproduces the > binary order of UTF-16). This is unfair: it slows down the conversion UTF-8 <

Re: Transcriptions of "Unicode"

2001-01-29 Thread Marcin &#x27;Qrczak' Kowalczyk
Fri, 12 Jan 2001 07:28:18 -0800 (GMT-0800), Mark Davis <[EMAIL PROTECTED]> pisze: > According to the references I have, the prefix "uni" is directly from > Latin while the word "code" is through French. The Indo-European would > have been *oi-no-kau-do ("give one strike"): *kau apparently being >

Re: Transcriptions of "Unicode"

2001-01-29 Thread Marcin &#x27;Qrczak' Kowalczyk
Mon, 15 Jan 2001 13:09:47 -0800 (GMT-0800), G. Adam Stanislav <[EMAIL PROTECTED]> pisze: > I would not be surprised if speakers of certain Slavic languages even > changed the SPELLING to Unikod (with an acute over the [o]), as they > have done with other imported words (such as futbal for footba

Re: Teletext mappings

2001-01-27 Thread Marcin &#x27;Qrczak' Kowalczyk
Sun, 21 Jan 2001 09:29:56 -0800 (GMT-0800), Rob Hardy <[EMAIL PROTECTED]> pisze: > > [Polish set] contains the line > > 0x5B 0x01B5 # LATIN CAPITAL LETTER Z WITH STROKE > > should supposedly be > > 0x5B 0x017B # LATIN CAPITAL LETTER Z WITH DOT ABOVE > > My teletext spec definitely has a Z with

Re: [novice question] Characters to languages mapping list?

2000-11-10 Thread Marcin &#x27;Qrczak' Kowalczyk
Fri, 10 Nov 2000 06:49:49 -0800 (GMT-0800), Adam Twardoch <[EMAIL PROTECTED]> pisze: > I'm looking for electronically available resources which show > character repertoires used in various languages. I need it for > orientational purposes only, so the list doesn't have to be 100% > correct. I co

Re: Character properties

2000-10-24 Thread Marcin &#x27;Qrczak' Kowalczyk
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] <[EMAIL PROTECTED]> pisze: > > isDigit:Nd > > isHexDigit: '0'..'9', 'A'..'F', 'a'..'f' > > isDecDigit: '0'..'9' > > isOctDigit: '0'..'7' > > The definition "Nd" is what I would have proposed for isDecDigit. The name isDecDigit is confusing

Re: Character properties

2000-10-24 Thread Marcin &#x27;Qrczak' Kowalczyk
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] <[EMAIL PROTECTED]> pisze: > > isDigit:Nd > > isHexDigit: '0'..'9', 'A'..'F', 'a'..'f' > > isDecDigit: '0'..'9' > > isOctDigit: '0'..'7' > > The definition "Nd" is what I would have proposed for isDecDigit. The name isDecDigit is confusing

Re: Character properties

2000-10-21 Thread Marcin &#x27;Qrczak' Kowalczyk
Wed, 11 Oct 2000 07:15:05 -0800 (GMT-0800), Mark Davis <[EMAIL PROTECTED]> pisze: > Here is my take on the way Unicode general categories should be > mapped to POSIX ones. Reiterated, here is my compilation of mapping of properties proposed for Haskell: isAssigned: all except Cs, Cn isControl:

Re: Character properties

2000-10-08 Thread Marcin &#x27;Qrczak' Kowalczyk
Wed, 4 Oct 2000 18:48:17 -0700 (PDT), Kenneth Whistler <[EMAIL PROTECTED]> pisze: > It is quite clear that many important character properties cannot > be deduced from the General Category values in UnicodeData.txt alone. What a pity. Especially as it does work for some properties and I would li

Re: Character properties

2000-09-23 Thread Marcin &#x27;Qrczak' Kowalczyk
Fri, 22 Sep 2000 22:11:44 -0800 (GMT-0800), Roozbeh Pournader <[EMAIL PROTECTED]> pisze: > intToDigit should look at the locale to select the preferred digit > form, I think. Sorry, it cannot apply to Haskell, because it's a functional language. It must work the same way all the time, unless it

Re: Character properties

2000-09-21 Thread Marcin &#x27;Qrczak' Kowalczyk
Thu, 21 Sep 2000 23:55:24 +0330 (IRT), Roozbeh Pournader <[EMAIL PROTECTED]> pisze: > > isDigit intentionally recognizes ASCII digits only. IMHO it's more > > often needed and this is what the Haskell 98 Report says. (But I > > don't follow the report in some other cases.) > > Would you please

Character properties

2000-09-21 Thread Marcin &#x27;Qrczak' Kowalczyk
I am trying to improve character properties handling in the language Haskell. What should the following functions return, i.e. what is most standard/natural/preferred mapping between Unicode character categories and predicates like isalpha etc.? What else should be provided? Here are definitions t