Re: Roundtripping Solved
Arcane Jill [EMAIL PROTECTED] writes: OBSERVATION - Requirement (4) is not met absolutely, however, the probability of the UTF-8 encoding of this sequence occuring accidently at an arbitrary offset in an arbitrary octet stream is approximately one in 2^384; Assuming that the distribution of sequences of characters is uniform. But it's not! As soon as you start using this encoding somewhere, the probability of appearing of this sequence raises dramatically. If you convert UTF-8 - UTF-32 using modified rules, and UTF-32 - UTF-8 using standard rules, then you get this sequence without waiting 2^340 years. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Arcane Jill [EMAIL PROTECTED] writes: Unix makes is possible for /you/ to change /your/ locale - but by your reasoning, this is an error, unless all other users do so simultaneously. Not necessarily: you can change the locale as long as it uses the same default encoding. By error I mean a bad idea. The system does not prevent from changing the locale to a different encoding. But then you are on your own and various things can break: terminal output will be mangled, you can't enter characters used in a different encoding from the keyboard, text files will be illegible, and Unicode programs which process texts may reject your data or even filenames. If you still need to change encodings, it's safer to use ASCII-only filenames. This situation is temporary. Well, it may last 10 more years or so, but it will probably gradually improve: First, more protocols and file formats are becoming aware of character encodings and either label them explicitly or use a known encoding (generally some Unicode encoding scheme). Especially protocols for data interchange over Internet: WWW, email, usenet, modern instant messaging protocols like Jabber. Some old protocols remain encoding-ignorant, e.g. irc and finger. GNOME 1 used the locale encoding, GNOME 2 uses UTF-8. Copying pasting text in X window now has a separate API which uses UTF-8. While the irc protocol doesn't specify the encoding, the irssi client can now recode texts itself to conform to customs of particular channels. Second, UTF-8 is becoming more usable as the default encoding specified by the locale. I don't use it now because too many things still break, but it's improving: there are things which didn't work just a few years ago and work now. Terminal emulators in X widely support UTF-8 mode now. The curses library now has a working wide character API. Emacs and vi work in UTF-8 (Emacs still has problems). Readline now works in UTF-8. Localized messages (gettext) are now recoded automatically. Other programs still don't work. Bash works, while zsh and ksh don't. Most full-screen text programs use the narrow character curses API and don't work in UTF-8. Brokenness of interactive interpreters of various languages vary. BTW, in the wide character curses API, the only way curses can work in a UTF-8 terminal, characters are expressed as sequences of wchar_t (base char + some combining chars, possibly double width). Which means that you must somehow translate filenames to this representation in order to display them - same as with a Unicode-based GUI. It's meaningless to render arbitrary bytes on the terminal, and you can't force curses to emit the original byte sequences which form filenames (which would be a bad idea for control characters anyway). By legimitizing non-UTF-8 filenames in a UTF-8 system you increase problems to overcome by such applications: not only they have to show control characters somehow, but also invalid UTF-8. But it goes beyond that. Copy a file onto a floppy disc and then physically take that floppy disc to a different Unix machine and log on as guest and insert the disc ... Will the filename look the same? Depends on the filesystem and the way it is mounted. For example if it's FAT with long filenames (which I think is the usual format for floppies even on Unix), filenames can be recoded by the kernel: you specify the encoding to present filenames in and the encoding of short names. I don't know what happens with filenames which are not expressible in the selected encoding. In this way filenames may automatically convert between systems which use different default encodings, preserving the character semantics rather than the byte representation. Of course file contents will not be converted. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: OK, strcpy does not need to interpret UTF-8. But strchr probably should. No. Its argument is a byte, even though it's passed as type int. By byte here I mean C char value, which is an octet in virtually all modern C implementations; the C standard doesn't guarantee this but POSIX does. Many C functions are not suitable for processing UTF-8, or are suitable only as long as we consider all non-ASCII characters opaque bags of bytes. For example isalpha takes a byte, toupper transforms a byte to a byte, and strncpy copies up to n bytes even if it's in the middle of a UTF-8 character. There are wide character versions like iswalpha and towupper. But then data must be converted from a sequence of char to a sequence of wchar_t. Standard and semi-standard function which do this conversion for UTF-8 reject invalid UTF-8 (they all have a mean for reporting errors). The assumption that wchar_t has something do to with Unicode is not as common as about char and bytes. I don't know whether FreeBSD finally changed their wchar_t to Unicode. And it can be UTF-32 (Unix) or UTF-16 (Windows). But then all languages are supposed to provide functions for processing opaque strings in addition to their Unicode functions. Yes, IMHO all general-purpose languages should support processing arrays of bytes, in addition to Unicode strings. It's not clear however how the API of filenames should look like, especially if they wish to be portable to Windows. But sooner or later you need to incorporate the filename in some UTF-8 text. An error report, for example. While it's not clear what a well-behaved application should do by default, in order to be 100% robust and preserve all information you must change the usual conventions anyway. Remember that any byte except \0 and / is valid in a filename, so you must either escape some characters, or delimit the filename with \0, or prefix it with the length, or something like this. A backup software should do this and not pay attention to the locale. But for end-user software like an image viewer, processing arbitrary filenames is less important. What are stdin, stdout and argv (command line parameters) when a process is running in a UTF-8 locale? Technically they are binary (command line arguments must not contain zero bytes). Users are expecting stdin and stdout to be treated as text or binary depending on the program, while command like arguments are generally interpreted as text or filenames. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping Solved
Peter Kirk [EMAIL PROTECTED] writes: Jill, again your solution is ingenious. But would it not work just as well to for Lars' purposes to use, instead of your string of random characters, just ONE reserved code point followed by U+0xx? Instead of asking the UTC to allocate a specific code point for this (which it probably will not do), he can use either U+FFFE or U+, which are intended for process internal uses, but are not permitted for interchange. Let's call the one non-character chosen INVALID. Perhaps what is needed is a shift of viewpoint, not a big technical change. Don't call it a UTF. Call it escaping. Don't reserve 128 code points. Use an existing but rare code point to prefix a byte escaped among code points, and escape the escape if it's found in the original. Perhaps the character could be ESC (27) or SUB (26), followed by U+00nn. Well, a viewpoint shift doesn't solve all problems: it's still dangerous for interoperability. If the programmer doesn't do anything special when writing filenames to a file, then instead of an error which indicates that the goal doesn't have a natural solution he gets an escaped string which will not be understood by other applications wich don't use this convention. If the filename is passed to a part of the program which doesn't use this convention, then it will break too. If something cannot be done reliably, it's better to signal the problem immediately than to hide it and misbehave later. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Arcane Jill [EMAIL PROTECTED] writes: OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 - NOT-UTF-16 - NOT-UTF-8 But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 - NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an awkward way which would happen to exclude those subsequences of non-characters which would form a valid UTF-8 fragment. Unicode has the following property. Consider sequences of valid Unicode characters: from the range U+..U+10, excluding non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded in any UTF-n, and nothing else is expected from UTF-n. With the exception of the set of non-characters being irregular and IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top limit caused by UTF-16, this gives a precise and unambiguous set of values for which encoders and decoders are supposed to work. Well, except non-obvious treatment of a BOM (at which level it should be stripped? does this include UTF-8?). A variant of UTF-8 which includes all byte sequences yields a much less regular set of abstract string values. Especially if we consider that 1110 1011 1010 binary is not valid UTF-8, as much as 0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in order for a BOM to fulfill its role). Question: should a new programming language which uses Unicode for string representation allow non-characters in strings? Argument for allowing them: otherwise they are completely useless at all, except U+FFFE for BOM detection. Argument for disallowing them: they make UTF-n inappropriate for serialization of arbitrary strings, and thus non-standard extensions of UTF-n must be used for serialization. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically, you are not supposed to use strcpy to process filenames. No: strcpy passes raw bytes, it does not interpret them according to the locale. It's not an UTF-8 function. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Arcane Jill [EMAIL PROTECTED] writes: If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Unicode filenames and other external strings on Unix - existing practice
I describe here languages which exclusively use Unicode strings. Some languages have both byte strings and Unicode strings (e.g. Python) and then byte strings are generally used for strings exchanged with the OS, the programmer is responsible for the conversion if he wishes to use Unicode. I consider situations when the encoding is implicit. For I/O of file contents it's always possible to set the encoding explicitly somehow. Corrections are welcome. This is mostly based on experimentation. Java (Sun) -- Strings are UTF-16. Filenames are assumed to be in the locale encoding. a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD. b) Creating. Characters which cannot be converted are replaced by ?. Command line arguments and standard I/O are treated in the same way. Java (GNU) -- Strings are UTF-16. Filenames are assumed to be in Java-modified UTF-8. a) Interpreting. If a filename cannot be converted, a directory listing contains a null instead of a string object. b) Creating. All Java characters are representable in Java-modified UTF-8. Obviously not all potential filenames can be represented. Command line arguments are interpreted according to the locale. Bytes which cannot be converted are skipped. Standard I/O works in ISO-8859-1 by default. Obviously all input is accepted. On output characters above U+00FF are replaced by ?. C# (mono) - Strings are UTF-16. Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS environment variable, with UTF-8 implicitly added at the end. These encodings are tried in order. a) Interpreting. If a filename cannot be converted, it's skipped in a directory listing. The documentation says that if a filename, a command line argument etc. looks like valid UTF-8, it is treated as such first, and MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases. The reality seems to not match this (mono-1.0.5). b) Creating. If UTF-8 is used, Non-characters are converted to pseudo-UTF-8, U+ throws an exception (System.ArgumentException: Path contains invalid chars), paired surrogates are treated correctly, and an isolated surrogate causes an internal error: ** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL) aborting... Command line arguments are treated in the same way, except that if an argument cannot be converted, the program dies at start: [Invalid UTF-8] Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea). Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again. Console.WriteLine emits UTF-8. Paired surrogates are treated correctly, non-characters and unpaired surrogates are converted to pseudo-UTF-8. Console.ReadLine interprets text as UTF-8. Bytes which cannot be converted are skipped. Perl Depending on the convention used by a particular function and on imported packages, a Perl string is treated either as Perl-modified Unicode (with character values up to 32 bits or 64 bits depending on the architecture) or as an unspecified locale encoding. It has two internal representations: ISO-8859-1 and Perl-modified UTF-8 (with an extended range). If every Perl string is assumed to be a Unicode string, then filenames are effectively ISO-8859-1. a) Interpreting. Characters up to 0xFF are used. b) Creating. If the filename has no characters above 0xFF, it is converted to ISO-8859-1. Otherwise it is converted to Perl-modified UTF-8 (all characters, not just those above 0xFF). Command line arguments and standard I/O are treated in the same way, i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on output, depending on the contents. This behavior is modifiable by importing various packages and using interpreter invocation flags. When Perl is told that command line arguments are UTF-8, the behavior for strings which cannot be converted is inconsistent: sometimes it's treated as ISO-8859-1, sometimes an error is signalled. Haskell --- Haskell nominally uses Unicode. There is no conversion framework standarized or implemented yet though. Implementations which support more than 256 characters currently assume ISO-8859-1 for filenames, command line arguments and all I/O, taking the lowest 8 bits of a character code on output. Common Lisp: Clisp -- Common Lisp standard doesn't say anything about string encoding. In Clisp strings are UTF-32 (internally optimized as UCS-2 and ISO-8859-1 when possible). Any character code up to U+10 is allowed, including non-characters and isolated surrogates. Filenames are assumed to be in the locale encoding. a) Interpreting. If a byte cannot be converted, an exception is thrown. b) Creating. If a character cannot be converted, an exception is thrown. Kogut (my language; this is the current state - can be changed) - Strings are UTF-32 (internally optimized as ISO-8859-1 when possible). Currently any
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: But, as I once already said, you can do it with UTF-8, you simply keep the invalid sequences as they are, and really handle them differently only when you actually process them or display them. UTF-8 is painful to process in the first place. You are making it even harder by demanding that all functions which process UTF-8 do something sensible for bytes which don't form valid UTF-8. They even can't temporarily convert it to UTF-32 for internal processing for convenience. Listing files in a directory should not signal anything. It MUST return all files and it should also return them in a way that this list can be used to access each of the files. Which implies that they can't be interpreted as UTF-8. By masking an error you are not encouraging users to fix it. Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error. Let's start with UTF-8 usernames. This is a likely scenario, since I think UTF-8 will typically be used in network communication. If you store the usernames in UTF-16, the conversion will signal an error and you will not have any users with invalid UTF-8 sequences nor will any invalid sequence be able to match any user. If you later on start comparing users somewhere else, in UTF-8, then you must not only strcmp them, but also validate each string. This is just a fact and I am not complaining about it. If usernames are supposed to be UTF-8, and in fact they are not, then it's normal that some software will signal an error instead of processing them. The proper way is to fix the username database, not to change programs. The interesting thing is that if you do start using my conversion, you can actually get rid of the need to validate UTF-8 strings in the first scenario. That of course means you will allow users with invalid UTF-8 sequences, but if one determines that this is acceptable (or even desired), then it makes things easier. But the choice is yours. For me it's not acceptable, so I will not support declaring it valid. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: And once we understand that things are manageable and not as frigtening as it seems at first, then we can stop using this as an argument against introducing 128 codepoints. People who will find them useful should and will bother with the consequences. Others don't need to and can roundtrip them as today. A person who is against them can't ignore a motion to introduce them, because if they are introduced, other people / programs will start feeding our programs arbitrary byte sequences labeled as UTF-8 expecting them to accept the data. So, interpreting the 128 codepoints as 'recreate the original byte sequence' is an option. Which guarantees that different programs will have different view of the validity and meaning of the same data labeled by the same encoding. Long live standarization. Even I will do the same where I just want to represent Unicode in UTF-8. I will only use this conversion in certain places. So it's not just different programs, but even the same program in different places. Great... The fact that my conversion actually produces UTF-8 from most of Unicode points does not mean it produced UTF-8. Increasing the number of encodings means more opportunities of mislabeling and using wrong libraries to process data (as it works in most of cases and thus the error is not detected immediately) and harder life for programs which aim at supporting all data. Think further than the immediate moment where many people are performing a transition form something to UTF-8. Look what happened with the interpretation of HTML in web browsers. If the standard from the beginning stood firmly at disallowing guessing what a malformed HTML was supposed to mean, then people would learn how to produce correct HTML and the interpretation would be unambiguous. But browsers tried to accept arbitrary contents and interpret parts of HTML they found there, guessing how errors should be resolved, being friendly to careless webmasters. The effect is that too often they submitted a webpage after checking that it works in their browser, but in fact it had basic syntax errors. Other browsers interpreted the errors differently, and the page was inaccessible or looked badly. When designing XML, they learned from this mistake: http://www.xml.com/axml/target.html#dt-fatal http://www.xml.com/axml/notes/Draconian.html That's why people here reject balkanization of UTF-8 by introducing variations with subtle differences, like Java-modified UTF-8. Inaccessible filenames are something we shouldn't accept. All your discussion of non-empty empty directories is just approaching the problem from the wrong end. One should fix the root cause, not consequences. The root cause is that users and programs use different encodings in different places, and thus Unix filenames can't be unambiguously and context-freely interpreted as character sequences. Unfortunately it's hard to fix. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Lars Kristan [EMAIL PROTECTED] writes: My my, you are assuming all files are in the same encoding. Yes. Otherwise nothing shows filenames correctly to the user. And what about all the references to the files in scripts? In configuration files? Such files rarely use non-ASCII characters. Non-ASCII characters are primarily used in names of documents created explicitly by the user. Soft links? They can be fixed automatically. If you want to break things, this is definitely the way to do it. Using non-ASCII filenames is risky to begin with. Existing tools don't have a good answer to what should happen with these files when the default encoding used by the user changes, or when a user using a different encoding tries to access them. As long as everybody uses the same encoding and files use it too, things work. When the assumption is false, something will break. You mean, various programs will break at various points of time, instead of working correctly from the beginning? So far nothing broke. Because all the programs are in UTF-8. This doesn't imply that they won't break. You are talking about filenames which are *not* UTF-8, with the locale set to UTF-8. Mozilla doesn't show such filenames in a directory listing. You may consider it a bug, but this is a fact. Producing non-UTF-8 HTML labeled as UTF-8 would be wrong too. There is no good solution to the problem of filenames encoded in different encodings. Handling such filenames is incompatible with using Unicode to process strings. You have to go back to passing arrays of bytes with ambiguous interpretation of non-ASCII characters, and live with inconveniences like displaying garbage for non-ASCII filenames and broken sorting. Mixing any two incompatible filename encodings on the same file system is a bad idea. As soon as you realize you cannot convert filenames to UTF-8, you will see that all you can do is start adding new ones in UTF-8. Or forget about Unicode. I'm not using a UTF-8 locale yet, because too many programs don't support it. I'm using ISO-8859-2. But almost all filenames are ASCII. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
D. Starner [EMAIL PROTECTED] writes: But demanding that each program which searches strings checks for combining classes is I'm afraid too much. How is it any different from a case-insenstive search? We started from string equality, which somehow changed into searching. Default string equality is case-sensitive. Searching for an arbitrary substring entered by a user should use user-friendly rules which fold various minor differences like decomposition and case and soft hyphens, but it's a rare task and changing rules generally affects convenience rather than correctness. String equality is used for internal and important operations like lookup in a dictionary (not necessarily of strings ever viewed by the user), comparing XML tags, filenames, mail headers, program identifiers, hyperlink addresses etc. They should be unambiguous, simple and fast. Computing approximate equivalence by folding minor differenes must be done explicitly when needed, as mandated by relevant protocols and standards, not forced as the default. Does \n followed by a combining code point start a new line? The Standard says no, that's a defective combining sequence. Is there *any* program which behaves this way? I misstated that; it's a new line followed by a defective combining sequence. What is the definition of combining sequences? It doesn't matter that accented backslashes don't occur practice. I do care for unambiguous, consistent and simple rules. So do I; and the only unambiguous, consistent and simple rule that won't give users hell is that ba never matches b. Any programs for end-users must follow that rule. Please give a precise definition of string equality. What representation of strings it needs - a sequence of code points or something else? Are all strings valid and comparable? Are there operations which give different results for equal strings? If string equality folded the difference between precomposed and decomposed characters, then the API should hide that difference in other places as well, otherwise string equality is not the finest distinction between string values but some arbitrary equivalence relation. My current implementation doesn't support filenames which can't be encoded in the current default encoding. The right thing to do, IMO, would be to support filenames as byte strings, and let the programmer convert them back and forth between character strings, knowing that it won't roundtrip. Perhaps. Unfortunately it makes filename processing harder, e.g. you can't store them in *text* files processed through a transparent conversion between its encoding and Unicode. In effect we must go back from manipulating context-insensitive character sequences to manipulating byte sequences with context-dependent interpretation. We can't even sort filenames using Unicode algorithms for collation but must use some algorithms which are capable of processing both strings in the locale's encoding and arbitrary byte sequences at the same time. This is much more complicated than using Unicode algorithms alone. What is worse, in Windows filenames the primary representation of filenames is Unicode, so programs which carefully use APIs based on byte sequences for processing filenames will be less general than Unicode-based APIs when the program is ported to Windows. The computing world is slowly migrating from processing byte sequences in ambiguous encodings to processing Unicode strings, often represented by byte sequences in explicitly labeled encodings. There are relics when the new paradigm doesn't fit well, like Unix filenames, but sticking to the old paradigm means that programs will continue to support mixing scripts poorly or not at all. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes: It's hard to create a general model that will work for all scripts encoded in Unicode. There are too many differences. So Unicode just appears to standardize a higher level of processing with combining sequences and normalization forms that are better approaching the linguistic and semantic of the scripts. Consider this level as an intermediate tool that will help simplify the identification of processing units. While rendering and user input may use evolving rules with complex specifications and implementations which depend on the environment and user's configuration (actually there is no other choice: this is inherently complicated for some scripts), string processing in a programming language should have a stable base with well-defined and easy to remember semantics which doesn't depend on too many settable preferences and version variations. The more complex rules a protocol demands (case-insensitive programming language identifiers, compared after normalization, after bidi processing, with soft hyphens removed etc.), the more tools will implement it incorrectly. Usually with subtle errors which don't manifest until someone tries to process an unusual name (e.g. documentation generation tool will produce hyperlinks with dangling links, because a WWW server does not perform sufficient transformations of addresses). -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject them. I don't know what they should be called. The fact is there shouldn't be any. And that current software should treat them as valid. So, they are not valid but cannot (and must not) be validated. As stupid as it sounds. I am sure one of the standardizers will find a Unicodally correct way of putting it. I am sure they will not. There is a tension to migrate from processing strings in terms of bytes in some vaguely specified encoding to processing them in terms of code points of a known encoding, or even further: combining character sequences, graphemes etc. 20 years ago the distinction was moot: a byte was a character, except for some specialied programs for handling CJK. Today when latin names with accented characters mixed with cyrillic names are not displayed correctly or not sorted according to lexicograpic conventions of some culture, the program can be considered broken. Unfortunately supporting this requires changing the paradigm. A font with 256 characters with byte-based rendering engine is not enough for a display, and for sorting it's no longer enough to compare a byte at a time. You are trying to stick with processing byte sequences, carefully preserving the storage format instead of preserving the meaning in terms of Unicode characters. This leads to less robust software which is not certain about the encoding of texts it processes and thus can't apply algorithms like case mapping without risking doing a meaningless damage to the text. Today, two invalid UTF-8 strings compare the same in UTF-16, after a valid conversion (using a single replacement char, U+FFFD) and they compare different in their original form, Conversion should signal an error by default. Replacing errors by U+FFFD should be done only when the data is processed purely for showing it to the user, without any further processing, i.e. when it's better to show the text partially even if we know that it's corrupted. Either you do everything in UTF-8, or everything in UTF-16. Not always, but typically. If comparisons are not always done in the same UTF, then you need to validate. And not validate while converting, but validate on its own. And now many designers will remember that they didn't. So, all UTF-8 programs (of that kind) will need to be fixed. Well, might as well adopt my broken conversion and fix all UTF-16 programs. Again, of that kind, not all in general, so there are few. And even those would not be all affected. It would depend on which conversion is used where. Things could be worked out. Even if we would start changing all the conversions. Even more so if a new conversion is added and only used when specifically requested. I don't understand anything of this. I cannot afford not to access the files. Then you have two choices: - Don't use Unicode. - Pretend that filenames are encoded in ISO-8859-1, and represent them as a sequence of code points U+0001..U+00FF. They will not be displayed correctly but the information will be preserved. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes: [...] This was later amended in an errata for XML 1.0 which now says that the list of code points whose use is *discouraged* (but explicitly *not* forbidden) for the Char production is now: [...] Ugh, it's a mess... IMHO Unicode is partially to blame, by introducing various kinds of holes in code point numbering (non-characters, surrogages), by not being clear when the unit of processing should be a code point and when a combining character sequence, and earlier by pushing UTF-16 as the fundamental representation of the text (which led to such horrible descriptions as http://www.xml.com/axml/notes/Surrogates.html). XML is just an example of a standard which must decide: A. What is the unit of text processing? (code point? combining character sequence? something else? hopefully it would not be UTF-16 unit) B. Which (sequences of) characters are valid when present in the raw source, i.e. what UTF-n really means? C. Which (sequences of) characters can be formed by specifying a character number? A programming language must do the same. The language Kogut I'm designing and developing uses Unicode as string representation, but the details can still be changed. I want to have rules which are correct as far as Unicode is concerned, and which are simple enough to be practical (e.g. if a standard forced me to make the conversion from code point number to actual character contextual, or if it forced me to unconditionally unify precomposed and decomposed characters, then I quit and won't support a broken standard). Internal text processing in a programming language can be more permissive than an application of such processing like XML parsing: if a particular character is valid in UTF-8 but XML disallows it, everything is fine, it can be rejected at some stage. It must not be more restrictive however, as it would make impossible to implement XML parsing in terms of string processing. Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining character sequences. I'm afraid that anything other than a mixture of 1 and 3 is too complicated to be widely used. Almost everybody is representing strings either as code points, or as even lower-level units like UTF-16 units. And while 2 is nice from the user's point of view, it's a nightmare from the programmer's point of view: - Unicode character properties (like general category, character name, digit value) are defined in terms of code points. Choosing 2 would immediately require two-stage processing: a string is a sequence of sequences of code points. - Unicode algorithms (like collation, case mapping, normalization) are specified in terms of code points. - Data exchange formats (UTF-n) are always closer to code points than to combining character sequences. - Code points have a finite domain, so you can make dictionaries indexed by code points; for combining character sequences we would be forced to make functions which *compute* the relevant property basing on the structure of such a sequence. I don't believe 2 is workable at all. The question is how to make 3 convenient enough to be used more often. Unfortunately it's much harder than 1, unless strings used some completely different iteration protocols than other sequences. I don't have an idea how to make 3 convenient. Regarding B in the context of a programming language (not XML), chapter 3.9 of the Unicode standard version 4.0 excludes only surrogates: it does not exclude non-characters like U+. But non-characters must be excluded somewhere, because otherwise U+FFFE at the beginning would be mistaken for a BOM. I'm confused. Regarding C, I'm confused too. Should a function which returns the character of the given number accept surrogates? I guess no. Should it accept non-characters? I don't know. I only know that it should not accept values above 0x10. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: It's essential that any UTF-n can be translated to any other without loss of data. Because it allows to use an implementation of the given functionality which represents data in any form, not necessarily the form we have at hand, as long as correctness is concerned. Avoiding conversion should matter only for efficiency, not for correctness. When I am talking about roundtrip, I speak of arbitrary data, not just valid data. You want to declare all byte sequences as valid. And thus valid data is no longer preserved on round trip, because different UTFs are able to encode different sequences of code points. Roundtrip for valid data is of course essential and needs to be preserved. Your proposal does not do this. Unpaired surrogates are not valid UTF-16, and there are no surrogates in UTF-8 at all, so there is no point in trying to preserve UTF-16 which is not really UTF-16. Actually, there is a point. It is just that you fail to understand it. But then, you needn't worry about it, since it is outside of your area of interest. I would worry if my programs would no longer accept what Unicode considers valid UTF-n. And I would worry if rules defined by Unicode would make U+ encodable as UTF-n, U+ encodable too, but the sequence U+ U+ not encodable (because UTF-n would no longer be usable as a format for serialization of arbitrary strings of valid code points). I would also worry if an API, file format or network protocol intended for use by various programs required a non-standard variant of UTF-n, because I couldn't use standard UTF-n encoding and decoding functions to interoperate with it. I indeed don't worry in what way you abuse UTF-n, as long as it's not an official Unicode standard and it's not widely used in practice. If UTC takes 128 unassigned codepoints and declares them to be a new set of surrogates, you needn't worry either (your valid data will still convert to any UTF). No, because it would remove responsibility to not generate such data and add responsibility to accept them, and thus some programs which are not currently broken would be broken under changed rules. Unless you have a strict validator which already validates unpaired surrogates. But you don't. I am pretty sure about it. I use system-supplied iconv() which does not accept anything which can be described as unpaired surrogates. If a user encounters corrupt data and cannot process it with your program, she (she is 'politically correct', but in this case can be seen as sexism) will blame it on the program, not the data. I don't care. This has been discussed mails back. UNIX filenames are already 'submitted'. Once you set your locale to UTF-8, you have labelled them all as UTF-8. Suggestions? Convert them to be valid UTF-8 (as long as locales used in the system use UTF-8 as the encoding, that is, otherwise keep them in the locale's encoding). -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
D. Starner [EMAIL PROTECTED] writes: This implies that every programmer needs an indepth knowledge of Unicode to handle simple strings. There is no way to avoid that. Then there's no way that we're ever going to get reliable Unicode support. This is probably true. I wonder whether things could have been done significantly better, or it's an inherent complexity of text. Just curious, it doesn't help with the reality. If the runtime automatically performed NFC on input, then a part of a program which is supposed to pass a string unmodified would sometimes modify it. Similarly with NFD. No. By the same logic you used above, I can expect the programmer to understand their tools, and if they need to pass strings unmodified, they shouldn't load them using methods that normalize the string. That's my point: if he normalizes, he does this explicitly. If a standard (a programming language, XML, whatever) specifies that identifiers should be normalized before comparison, a program should do this. If it specifies that Cf characters are to be ignored, then a program should comply. A standard doesn't have to specify such things however, so a programming language shouldn't do too much automatically. It's easier to apply a transformation than to undo a transformation applied automatically. Sometimes things get ambiguous if one day #349; is matched by s and one day #349; isn't? That's absolutely wrong behavior; the program must serve the user, not the programmer. If I use grep to search for a combining acute, I bet it will currently match cases where it's a separate combining character but will not match precomposed characters. Do you say that this should be changed? Hey, Linux grep matches only a single byte by ., even in UTF-8 locale. Now, I can agree that this should be changed. But demanding that each program which searches strings checks for combining classes is I'm afraid too much. Does \n followed by a combining code point start a new line? The Standard says no, that's a defective combining sequence. Is there *any* program which behaves this way? How useful is a rule in a standard which nobody obeys to? Does a double quote followed by a combining code point start a string literal? That would depend on your language. I'd prefer no, but it's obvious many have made other choices. Since my language is young and almost doesn't have users, I can even change decisions made earlier: I'm not constrained by compatibility yet. But if lexical structure of the program worked in terms of combining character sequences, it would have to be somehow supported by generic string processing functions, and it would have to consistely work for all lexical features. For example */ followed by a combining accent would not end a comment, accented backslash would not need escaping in a string literal, and something unambiguous would have to be done with an accented newline. Such rules would be harder to support with most text processing tools. I know no language in which searching for a backslash in a string would not find an accented backslash. It doesn't matter that accented backslashes don't occur practice. I do care for unambiguous, consistent and simple rules. Does a slash followed by a combining code point separate subdirectory names? In Unix, yes; that's because filenames in Unix are byte streams with the byte 0x2F acting as a path seperator. My current implementation doesn't support filenames which can't be encoded in the current default encoding. The encoding can be changed from within a program (perhaps locally during execution of some code). So one can process any Unix filename by temporarily setting the encoding to Latin1. It's unfortunate that the default setting is more restrictive than the OS, but I have found no sensible alternative other than encouraging processing strings in their transportation encoding. Anyway, if a string *is* accepted as a file name, the program's idea about directory separators is the same as the OS (as long as we assume Unix; I don't yet provide any OS-generic pathname handling). If the program assumed that an accented slash is not a directory separator, I expect possible security holes (the program thinks that a string doesn't include slashes, but from the OS point of view it does). The rules you are offering are only simple and unambiguous to the programmer; they appear completely random to the end user. And yours are the opposite :-) -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: The other name for this is roundtripping. Currently, Unicode allows a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more valuable, even if it means that the other roundtrip is no longer guaranteed: It's essential that any UTF-n can be translated to any other without loss of data. Because it allows to use an implementation of the given functionality which represents data in any form, not necessarily the form we have at hand, as long as correctness is concerned. Avoiding conversion should matter only for efficiency, not for correctness. Let me go a bit further. A UTF-16=UTF-8=UTF-16 roundtrip is only required for valid codepoints other than the surrogates. But it also works for surrogates unless you explicitly and intentionally break it. Unpaired surrogates are not valid UTF-16, and there are no surrogates in UTF-8 at all, so there is no point in trying to preserve UTF-16 which is not really UTF-16. I would opt for the latter (i.e. keep it working), according to my statement (in the thread When to validate) that validation should be separated from other processing, where possible. Surely it should be separated: validation is only necessary when data are passed from the external world to our system. Internal operations should not produce invalid data from valid data. You don't have to check at each point whether data is valid. You can assume that it is always valid, as long as the combination of the programming language, libraries and the program is not broken. Some languages make it easier to ensure that strings are valid, to the point that they guarantee it (they don't offer any way to construct an invalid string). Unfortunately many languages don't: they say that they represent strings in UTF-8 or UTF-16, but they are unsafe, they do nothing to prevent constructing an array of words which is not valid UTF-8 or UTF-16 and passing it to functions which assume that it is. Blame these languages, not the definitions of UTF-n. A UTF-32=UTF-8=UTF-32 roundtrip is similar, except that 16-8-16 works even with concatenation, while 32-8-32 can be broken with concatenation. It always works as long as data was really UTF-32 at the first place. A word with a value of 0xD800 is not UTF-32. All this is known and presents no problems, or - only problems that can be kept under control. So, by introducing another set of 128 'surrogates', we don't get a new type of a problem, just another instance of a well known one. Nonsense. UTF-8, UTF-16 and UTF-32 are interchangeable, and you would like to break this. No way. On the other hand, UTF-8=UTF-16=UTF-8 as well as UTF-8=UTF-32=UTF-8 can be both achieved, with no exceptions. This is something no other roundtrip can offer at the moment. But they do! An isolated byte with the highest bit set is not UTF-8, so there is no point in converting it to UTF-16 and back. On top of it, I repeatedly stressed that it is UTF-8 data that has the highest probablility of any of the following: * contains portions that are not UTF-8 * is not really UTF-8, but user has UTF-8 set as default encoding * is not really UTF-8, but was marked as such * a transmission error not only changes data but also creates invalid sequences In this cases the data is broken and the damage should be signalled as soon as possible, so the submitter can know this and correct it. Alternatively you keep the original byte sequence, but don't pretend that it's UTF-8. Delete the erroneous UTF-8 label instead of changing the data. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Roundtripping in Unicode
Lars Kristan [EMAIL PROTECTED] writes: All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject them. Furthermore, I was proposing this concept to be used, but not unconditionally. So, you can, possibly even should, keep using whatever you are using. So you prefer to make programs misbehave in unpredictable ways (when they pass the data from a component which uses relaxed rules to a component which uses strict rules) rather than have a clear and unambiguous notion of a valid UTF-8? Perhaps I can convert mine, but I cannot convert all filenames on a user's system. They you can't access his files. With your proposal you couldn't as well, because you don't make them valid unconditionally. Some programs would access them and some would break, and it's not clear what should be fixed: programs or filenames. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: When to validate?
Arcane Jill [EMAIL PROTECTED] writes: Here's something that's been bothering me. Suppose I write a function - let's call it trim(), which removes leading and trailing spaces from a string, represented as one of the UTFs. If I've understood this correctly, I'm supposed to validate the input, yes? What do you mean by validate? -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
D. Starner [EMAIL PROTECTED] writes: String equality in a programming language should not treat composed and decomposed forms as equal. Not this level of abstraction. This implies that every programmer needs an indepth knowledge of Unicode to handle simple strings. There is no way to avoid that. If the runtime automatically performed NFC on input, then a part of a program which is supposed to pass a string unmodified would sometimes modify it. Similarly with NFD. You can't expect each and every program which compares strings to perform normalization (e.g. Linux kernel with filenames). Perhaps if there was a single normalization format which everybody agreed to, and unnormalized strings were never used for data interchange (if UTF-8 was specified such that to disallow unnormalized data, etc.), things would be different. But Unicode treats both composed and decomposed representations as valid. IMHO splitting into graphemes is the job of a rendering engine, not of a function which extracts a part of a string which matches a regex. So S should _sometimes_ match an accented S? Again, I feel extended misery of explaining to people why things aren't working right coming on. Well, otherwise things get ambiguous, similarly to these XML issues. Does \n followed by a combining code point start a new line? Does a double quote followed by a combining code point start a string literal? Does a slash followed by a combining code point separate subdirectory names? An iterator which delivers whole combining character sequences out of a sequence of code points can be used. You can also manipulate strings as arrays of combining character sequences. But if you insist that this is the primary string representation, you become incompatible with most programs which have different ideas about delimited strings. You can't expect each and every program to check combining classes of processed characters. It's hard enough to convince them that a character is not the same as a byte. I expect breakage of XML-based protocols if implementations are actually changed to conform to these rules (I bet they don't now). Really? In what cases are you storing isolated combining code points in XML as text? In case I want to circumvent security or deliberately cause a piece of software to misbehave. Robustness require unambiguous and simple rules. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point is: what characters mean in this sentence. Code points? Combining character sequences? Something else? -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
John Cowan [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point is: what characters mean in this sentence. Code points? Combining character sequences? Something else? Neither. Unicode characters. What does Unicode characters mean? -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
John Cowan [EMAIL PROTECTED] writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '', '', quotation marks, and with special behavior for spaces. The point is: what characters mean in this sentence. Code points? Combining character sequences? Something else? Neither. Unicode characters. http://www.w3.org/TR/2000/REC-xml-20001006#charsets implies that the appropriate level for parsing XML is code points. In particular XML allows a combining character directly after . -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
D. Starner [EMAIL PROTECTED] writes: You could hide combining characters, which would be extremely useful if we were just using Latin and Cyrillic scripts. It would need a separate API for examining the contents of a combining character. You can't avoid the sequence of code points completely. It would yield to surprising semantics: for example if you concatenate a string with N+1 possible positions of an iterator with a string with M+1 positions, you don't necessarily get a string with N+M+1 positions because there can be combining characters at the border. It's simpler to overlay various grouping styles on top of a sequence of code points than to start with automatically combined combining characters and process inwards and outwards from there (sometimes looking inside characters, sometimes grouping them even more). It would impose complexity in cases where it's not needed. Most of the time you don't care which code points are combining and which are not, for example when you compose a text file from many pieces (constants and parts filled by users) or when parsing (if a string is specified as ending with a double quote, then programs will in general treat a double quote followed by a combining character as an end marker). I believe code points are the appropriate general-purpose unit of string processing. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: If only MS Word was coded this well
Theodore H. Smith [EMAIL PROTECTED] writes: It's because code points have variable lengths in bytes, so extracting individual characters is almost meaningless Same with UTF-16 and UTF-32. A character is multiple code-points, remember? (decomposed chars?) Nope. I've done tons of UTF-8 string processing. I've even done a case insensitive word-frequency measuring algorithm on UTF-8. It runs blastingly fast, because I can do the processing with bytes. Ah, so first you say that a character mean a base code point plus a number of combining code points, and then you admit that your program actually process strings in terms of even lower level units: bytes of UTF-8 encoding? Why don't you treat a string as a sequence of base code point with combining code points items? Answer: because often this grouping is irrelevant, like in your example of word statistics. Code point grouping is more important: Unicode algorithms are typically described in terms of code points. It just requires you to understand the actual logic of UTF-8 well enough to know that you can treat it as bytes, most of the time. When I implemented the word boundary algorithm from Unicode, I was glad that I could do it in terms of UTF-32 and ISO-8859-1 instead of UTF-8, even though I do understand the logic of UTF-8. As for isspace... sure there is a UTF-8 non-byte space. I don't understand. If a string is exposed as a sequence of UTF-8 units, it makes no sense to ask whether a particular unit isspace. And it makes no sense to ask this about a whole string either. It would have to be a function which works in terms of some iterator over strings. Well, some things do work in terms of positions inside strings, for example word boundaries. But people are used to think about isspace as a property of a *character*, whatever the language exactly means under this concept. My language means a Unicode code point, for conceptual simplicity of the concept of a string as seen by the language. My case insensitive utf-8 word frequency counter (which runs blastingly fast) however didn't find this to be any problem. It dealt with non-single byte all sorts of word breaks :o) It appears to run at about 3MB/second on my laptop, which involves for every word, doing a word check on the entire previous collection of words. I happen to have written a case insensitive word frequency counter as an example in my language, to test some Unicode algorithms. It uses the word boundary algorithm to specify words; a segment between boundaries must include a character of class L* or N* in order to be counted as a word. It maintains subcounts of case-sensitive forms of a case-insensitive word (implemented as a hash table of hash tables of integers). It converts input using iconv(), i.e. from an arbitrary locale encoding supported by the system. It was not written with speed in mind. It has 24 lines, 10 of which are formatting the output (statistics about 20 most common words). http://cvs.sourceforge.net/viewcvs.py/kokogut/kokogut/tests/WordStat.ko?view=markup It's written in a dynamically typed language, with dynamic dispatches and higher order functions everywhere, where all values except small integers are pointers, with immutable strings. Each line separately is divided into words; a subsequence of spaces is materialized as a string object before the program checks that there are no letters nor numbers in it and thus it's not a word. It processed 4.8MB in 3.2s on my machine (Athlon 2000, 1.25GHz), which I think is good enough under these conditions. This input happens to be ASCII (a mailbox) but the program didn't know beforehand that it's ASCII. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Invalid UTF-8 sequences
Lars Kristan [EMAIL PROTECTED] writes: Quite close. Except for the fact that: * U+EE93 is represented in UTF-32 as 0xEE93 * U+EE93 is represented in UTF-16 as 0xEE93 * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93) Then it would be impossible to represent sequences like U+ U+EEBA U+EE93 in UTF-8, and conversion UTF-32 - UTF-8 - UTF-32 would not round-trip. Concatenation of UTF-8-encoded strings would not be equivalent to UTF-8-encoding of the concatenation of code points. This is broken. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
D. Starner [EMAIL PROTECTED] writes: The semantics there are surprising, but that's true no matter what you do. An NFC string + an NFC string may not be NFC; the resulting text doesn't have N+M graphemes. Which implies that automatically NFC-ing strings as they are processed would be a bad idea. They can be NFC-ed at the end of processing if the consumer of this data will demand this. Especially if other consumers would want NFD. String equality in a programming language should not treat composed and decomposed forms as equal. Not this level of abstraction. IMHO splitting into graphemes is the job of a rendering engine, not of a function which extracts a part of a string which matches a regex. If you do so with an language that includes , you violate the Unicode standard, because #824; (not ) and #8814; are canonically equivalent. I think that Unicode tries to push implications of equivalence too far. They are supposed to be equivalent when they are actual characters. What if they are numeric character references? Should #824; (7 characters) represent a valid plain-text character or be a broken opening tag? Note that if it's a valid plain-text character, it's impossible to represent isolated combining code points in XML, and thus it's impossible to use XML for transportation of data which allows isolated combining code points (except by introducing custom escaping of course, e.g. transmitting decimal numbers instead of characters). I expect breakage of XML-based protocols if implementations are actually changed to conform to these rules (I bet they don't now). OTOH if it's not a valid plain-text character, then conversion between numeric character references and actual characters is getting more hairy. I'll see if I have time after finals to pound out a basic API that implements this, in Ada or Lisp or something. My language is quite similar to Lisp semantically. Implementing an API which works in terms of graphemes over an API which works in terms of code points is more sane than the converse, which suggests that the core API should use code points if both APIs are sometimes needed at all. While I'm not obsessed with efficiency, it would be nice if changing the API would not slow down string processing too much. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
John Cowan [EMAIL PROTECTED] writes: String equality in a programming language should not treat composed and decomposed forms as equal. Not this level of abstraction. Well, that assumes that there's a special string equality predicate, as distinct from just having various predicates that DWIM. No, I meant the default generic equality predicate when applied to two strings. It's a broken opening tag. Ok, so it's the conversion from raw text to escaped character references which should treat combining characters specially. What about with combining acute, which doesn't have a precomposed form? A broken opening tag or a valid text character? What about #65;ACUTE where ACUTE stands for combining acute? Is this A with acute, or a broken character reference which ends with an accented semicolon? If it's a broken character reference, then what about A#769; (769 is the code for combining acute if I'm not mistaken)? If *this* is A with acute, then it's inconsistent: here combining accents are processed after resolving numeric character references, and previously it was in the opposite order. OTOH if this is something else, then it's impossible to represent letters without precomposed forms with numeric character references. The general trouble is that numeric character references can only encode individual code points rather than graphemes (is this a correct term for a non-combining code point with a sequence of combining code points?). So if XML is supposed to be treated as a sequence of graphemes, weird effects arise in the above boundary cases... -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Lars Kristan [EMAIL PROTECTED] writes: This is simply what you have to do. You cannot convert the data into Unicode in a way that says I don't know how to convert this data into Unicode. You must either convert it properly, or leave the data in its original encoding (properly marked, preferably). Here lies the problem. Suppose you have a document in UTF-8, which somehow got corrupted and now contains a single invalid sequence. Are you proposing that this document needs to be stored separately? He is not proposing that. Everything else in the database would be stored in UTF-16, but now one must add the capability to store this document separately. No, it can be be stored in UTF-16 or whatever else is used. Except the corrupted part of course, but it's corrupted, and thus useless, so it doesn't matter what happens with it. Now suppose you have a UNIX filesystem, containing filenames in a legacy encoding (possibly even more than one). If one wants to switch to UTF-8 filenames, what is one supposed to do? Convert all filenames to UTF-8? Yes. Who will do that? A system administrator (because he has access to all files). And when? When the owners of the computer system decide to switch to UTF-8. Will all users agree? It depends on who decides about such things. Either they don't have a voice, or they agree and the change is made, or they don't agree and the change is not made. What's the point? Should all filenames that do not conform to UTF-8 be declared invalid? What do you mean by invalid? They are valid from the point of view of the OS, but they will not work with reasonable applications which use Unicode internally. If you keep all processing in UTF-8, then this is a decision you can postpone. You mean, various programs will break at various points of time, instead of working correctly from the beginning? If it's broken, fix it, instead of applying patches which will sometimes hide the fact that it's broken, or sometimes not. I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames. Do you want to discourage them? Mixing any two incompatible filename encodings on the same file system is a bad idea. IMHO, preserving data is more important, but so far it seems it is not a goal at all. With a simple argument - that Unicode only defines how to process Unicode data. Understandably so, but this doesn't mean it needs to remain so. If you don't know the encoding and want to preserve the values of bytes, then don't convert it to Unicode. Well, you may have a wrong assumption here. You probably think that I convert invalid sequences into PUA characters and keep them as such in UTF-8. That is not the case. Any invalid sequences in UTF-8 are left as they are. If they need to be converted to UTF-16, then PUA is used. If they are then converted to UTF-8, they are converted back to their original bytes, hence the incorrect sequences are re-created. This does not make sense. If you want to preserve the bytes instead of working in terms of characters, don't convert it at all - keep the original byte stream. One more example of data loss that arises from your approach: If a single bit is changed in UTF-16 or UTF-32, that is all that will happen (in more than 99% of the cases). If a single bit changes in UTF-8, you risk that the entire character will be dropped or replaced with the U+FFFD. But funny, only if it ever gets converted to the UTF-16 or UTF-32. Not that this is a major problem on its own, but it indicates that there is something fishy in there. If you change one bit in a file compressed by gzip, you might not be able to recover any part of it. What's the point? UTF-x were not designed to minimize the impact of corruption of encoded bytes. If you want to preserve the text despite occasional corruption, use a higher level protocol for this (if I remember correctly, RAR can add additional information to an archive which allows to recover the data even if parts of the archive, entire blocks, have been lost). There was a discussion on nul characters not so long ago. Many text editors do not properly preserve nul characters in text files. But it is definitely a nice thing if they do. While preserving nul characters only has a limited value, preserving invalid sequences in text files could be crucial. An editor should alert the user that the file is not encoded in a particular encoding or that it's corrupted, instead of trying to guess which characters were supposed to be there. If it's supposed to edit binary files too, it should work on the bytes instead of decoded characters. A UTF-8 based editor can easily do this. A UTF-16 based editor cannot do it at all. If you say that UTF-16 is not intended for such a purpose, then so be it. But this also means that UTF-8 is superior. It's much easier with CP-1252, which shows that it's superior to UTF-8 :-) Yes, it is not related much. Except for the fact I was trying to see if UTF-32
Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes: The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. But individual characters do not always have any semantic. For languages, the relevant unit is almost always the grapheme cluster, not the character (so not its code point...). How do you determine the semantics of a grapheme cluster? Answer: by splitting it into code points. A code point is atomic, it's not split any more, because there is a finite number of them. When a string is exchanged with another application or network computer or the OS, it always uses some encoding which is closer to code points than to grapheme clusters, no matter if it's UTF-8 or UTF-16 or ISO-8859-something. If the string was originally stored as an array of grapheme clusters, it would have to be translated to code points before further conversion. Which represent will be the best is left to implementers, but I really think that compressed schemes are often introduced to increase the application performances and reduce the needed resources both in memory and for I/O, but also in networking where interoperability across systems and bandwidth optimization are also important design goals... UTF-8 is much better for interoperability than SCSU, because it's already widely supported and SCSU is not. It's also easier to add support for UTF-8 than for SCSU. UTF-8 is stateless, SCSU is stateful - this is very important. UTF-8 is easier to encode and decode. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes: The question is why you would need to extract the nth codepoint so blindly. For example I'm scanning a string backwards (to remove '\n' at the end, to find and display the last N lines of a buffer, to find the last '/' or last '.' in a file name). SCSU in general supports traversal only forwards. But remember the context in which this discussion was introduced: which UTF would be the best to represent (and store) large sets of immutable strings. The discussion about indexes in substrings is not relevevant in that context. It is relevant. A general purpose string representation should support at least a bidirectional iterator, or preferably efficient random access. Neither is possible with SCSU. * * * Now consider scanning forwards. We want to strip a beginning of a string. For example the string is an irc message prefixed with a command and we want to take the message only for further processing. We have found the end of the prefix and we want to produce a string from this position to the end (a copy, since strings are immutable). With any stateless encoding a suitable library function will compute the length of the result, allocate memory, and do an equivalent of memcpy. With SCSU it's not possible to copy the string without analysing it because the prefix might have changed the state, so the suffix is not correct when treated as a standalone string. If the stripped part is short and the remaining part is long, it might pay off to scan the part we want to strip and perform a shortcut of memcpy if the prefix did not change the state (which is probably a common case). But in general we must recompress the whole copied part! We can't even precalculate its physical size. Decompressing into temporary memory will negate benefits of a compressed encoding, so we should better decompress and compress in parallel into a dynamically resizing buffer. This is ridiculously complex compared to a memcpy. The *only* advantage of SCSU is that it takes little space. Although in most programs most strings are ASCII, and SCSU never beats ISO-8859-1 which is what the implementation of my language is using for strings which no characters above U+00FF, so it usually does not have even this advantage. Disadvantages are everywhere else: every operation which looks at the contents of a string or produces contents of a string is more complex. Some operations can't be supported at all with the same asymptotic complexity, so the API would have to be changed as well to use opaque iterators instead of indices. It's more complicated both for internal processing and for interoperability (unless the other end understands SCSU too, which is unlikely). Plain immutable character arrays are not completely universal either (e.g. they are not sufficient for a buffer of a text editor), but they are appropriate as the default representation for common cases; for representing filenames, URLs, email addresses, computer language identifiers, command line option names, lines of a text file, messages in a dialog in a GUI, names of columns of a database table etc. Most strings are short and thus performing a physical copy when extracting a substring is not disastrous. But the complexity of SCSU is too bad. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes: There's nothing that requires the string storage to use the same exposed array, The point is that indexing should better be O(1). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of integer indices. 2. Exposing a different unit in the API. 3. Living with the fact that indexing is not O(1) in general; perhaps with clever caching it's good enough in common cases. Altough all three choices can work, I would prefer to avoid them. If I had to, I would probably choose 1. But for now I've chosen a representation based on code points. Anyway, each time you use an index to access to some components of a String, the returned value is not an immutable String, but a mutable character or code unit or code point, from which you can build *other* immatable Strings No, individual characters are immutable in almost every language. Assignment to a character variable can be thought as changing the reference to point to a different character object, even if it's physically implemented by overwriting raw character code. When you do that, the returned character or code unit or code point does not guarantee that you'll build valid Unicode strings. In fact, such character-level interface is not enough to work with and transform Strings (for example it does not work to perform correct transformation of lettercase, or to manage grapheme clusters). This is a different issue. Indeed transformations like case mapping work in terms of strings, but in order to implement them you must split a string into some units of bounded size (code points, bytes, etc.). All non-trivial string algorithms boil down to working on individual units, because conditionals and dispatch tables must be driven by finite sets. Any unit of a bounded size is technically workable, but they are not equally convenient. Most algorithms are specified in terms of code points, so I chose code points for the basic unit in the API. In fact in my language there is no separate character type: a code point extracted from a string is represented by a string of length 1. It doesn't change the fact that indexing a string by code point index should run in constant time, and thus using UTF-8 internally would be a bad idea unless we implement one of the three points above. Once you realize that, which UTF you use to handle immutable String objects is not important, because it becomes part of the blackbox implementation of String instances. The black box must provide enough tools to implement any algorithm specified in terms of characters, an algorithm which was not already provided as a primitive by the language. Algorithms generally scan strings sequentially, but in order to store positions to come back to them later you must use indices or some iterators. Indices are simpler (and in my case more efficient). Using SCSU for such String blackbox can be a good option if this effectively helps in store many strings in a compact (for global performance) but still very fast (for transformations) representation. I disagree. SCSU can be a separate type to be used explicitly, but it's a bad idea for the default string representation. Most strings are short, and thus constant factors and simplicity matter more than the amount of storage. And you wouldn't save much storage anyway: as I said, in my representation strings which contain only characters U+..U+00FF are stored one byte per character. The majority of strings in average programs is ASCII. In general what I don't like in SCSU is that there is no obvious compression algorithm which makes good use of various features. Each compression algorithm is either not as powerful as it could, or is extremely slow (trying various choices), or is extremely complicated (trying only sensible paths). Unfortunately, the immutable String implementations in Java or C# or Python does not allow the application designer to decide which representation will be the best (they are implemented as concrete classes instead of virtual interfaces with possible multiple implementations, as they should; the alternative to interfaces would have been class-level methods allowing the application to trade with the blackbox class implementation the tuning parameters). Some functions accept any sequence of characters. Other functions accept only standard strings. The question is how often to use each style. Choosing the first option increases flexibility but adds an overhead in the common case. For example case mapping of a string would have to either perform dispatching functions at each step, or be implemented twice. Currently it's implemented for strings only, in C, and thus avoids calling a generic indexing function and other overheads. At some time I will probably implement it again, to work for arbitrary sequences of characters, but it's more work for effects that I don't currently need, so it's not a priority.
Re: Nicest UTF
Philippe Verdy [EMAIL PROTECTED] writes: Decoding SCSU is very straightforward, But not for random access by code point index, which is needed by many string APIs. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Arcane Jill [EMAIL PROTECTED] writes: Oh for a chip with 21-bit wide registers! Not 21-bit but 20.087462841250343-bit :-) -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Nicest UTF
Theodore H. Smith [EMAIL PROTECTED] writes: Assuming you had no legacy code. And no handy libraries either, [...] What would be the nicest UTF to use? For internals of my language Kogut I've chosen a mixture of ISO-8859-1 and UTF-32. Normalized, i.e. a string with chracters which fit in narrow characters is always stored in the narrow form. I've chosen representations with fixed size code points because nothing beats the simplicity of accessing characters by index, and the most natural thing to index by is a code point. Strings are immutable, so there is no need to upgrade or downgrade a string in place, so having two representations doesn't hurt that much. Since the majority of strings is ASCII, using UTF-32 for everything would be wasteful. Mutable and resizable character arrays use UTF-32 only. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Unicode IDNs
Donald Z. Osborn [EMAIL PROTECTED] writes: Is anyone aware of URLs that use extended Latin characters as examples? http://w.pl/ -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: bit notation in ISO-8859-x is wrong
[EMAIL PROTECTED] (James Kass) writes: [...] If there are eight bits, why shouldn't they be bits one through eight? Because then the number of a bit doesn't correspond to the exponent of its weight, so I even don't know in which order they are specified (as many people order bits backwards, i.e. from the most significant). -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)
W licie z sob, 14-08-2004, godz. 12:35 +0200, Philippe Verdy napisa: Simply because, for both Unicode and ISO/IEC 10646, the character model includes the fact that ANY base character forms a combining character sequence with ANY following combining character or ZW(N)J character. Shouldn't grapheme cluser boundary and word boundary rules in http://www.unicode.org/reports/tr29/ handle ZW(N)J? -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Combining across markup?
W licie z czw, 12-08-2004, godz. 13:00 -0400, John Cowan napisa: Even better yet: Have the WC3 rephrase their demand that no element should start with a defective sequence (when considered in separate) as that no *block-level* element should etc., and leave things like span, i and other in-line elements free to start with a combining character (provided that the said in-line container is not the first within a block-level element, of course). The trouble with that idea is that in XML generally we don't know what is a block-level element: elements are just elements, and it's up to rendering routines whether they appear as block, inline, or not at all. So if on that level of abstraction it is not known whether it would make sense or not for the higher layers, it should be permitted in all cases. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)
W licie z wto, 10-08-2004, godz. 18:33 +0100, Jon Hanna napisa: By the rules of XML replacing #x338; with U+226F would mean the document was no longer well-formed. Really? I don't have a XML spec handy, but character references like #x338; can't be processed before parsing tags, because 60; is the literal character , not the start of a tag. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Microsoft Unicode Article Review
W licie z czw, 05-08-2004, godz. 15:52 -0500, John Tisdale napisa: Yet, if you are working with an application that must parse and manipulate text at the byte-level, the costliness of variable length encoding will probably outweigh the benefits of ASCII compatibility. In such a case the fixed length of UCS-2 will usually prove the better choice. This is why Windows NT and subsequent Microsoft operating systems, SQL Server 7 (and subsequent ones), XML, Java, COM, ODBC, OLEDB and the .NET framework are all built on UCS-2 Unicode encoding. At least some of them use UTF-16, not UCS-2, e.g. Java 1.5. I wonder if not most of them actually. At least in theory. The uniform length of UCS provides a good foundation when it comes to complex data manipulation. And thus this point does not apply to them (unless you count apps which break for characters outside BMP). There are other technical differences between these standards that you may want to consider that are beyond the scope of this article (such as how UTF-16 supports surrogate pairs but UCS-2 does not). I don't like perpetuating the myth that Unicode is a 16-bit encoding and UCS-2 can represent all Unicode characters. Yes, in some places you mention that there are also some characters above the first 64k, but the general impression from the article is that UCS-2 is one of equally- functional representations of Unicode, while in fact this is the only representation which doesn't cover all code points. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: UAX 15 hangul composition
W licie z wto, 03-08-2004, godz. 13:47 +0200, Theo Veenker napisa: Don't know if this has been asked/reported before, but is the example code for hangul composition in UAX 15 correct? I reported it a month ago and got a response stating that This has been forwarded to the right people, and they are looking into it. The TIndex = TCount should be TIndex TCount I think. Right. Also, 0 = TIndex should be 0 TIndex. IMO the example would be more clear if the Hangul_Syllable_Type property would be used. I prefer to have formulas rather than tables for something which can be computed in a simple way. Recently I implemented some Unicode algorithms in a way which resulted in static linking of the relevant code into many programs. So it was important to make the executable size small, which means that I had to invent some compressed representation of various tables, and to prefer formulas. I used Hangul_Syllable_Type table before I realized that this data can be computed. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Umlaut and Tréma, was: Variation selectors and vowel marks
W licie z pi, 23-07-2004, godz. 18:01 +0200, Philipp Reichmuth napisa: However, to return to the original problem, I don't remember ever having seen a data where it would be necessary to distinguish between trema and diaeresis in the data itself. A similar issue: a Polish encyclopaedia I have from 1985 sorts words with differently depending on whether this is Polish (sorted between O and P, like other Polish letters are after letters without accents) or foreign (folded with O, like other foreign accents are folded). It's typeset in the same way. MOQUETTE MR [mo:r], city in Hungary MORA MRA [mo:ro] Ferenc, Hungarian writer MORACZEWSKA [...] MONOWADZTWO MR (a Polish word) [...] MDEK (a Polish word) MPHAHLELE -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Folding algorithm and canonical equivalence
W licie z sob, 17-07-2004, godz. 16:46 -0700, Asmus Freytag napisa: I wonder whether that's truly intended, or whether it could be replaced by a combination of AccentFolding OtherDiacriticFolding where AccentFolding removes *all* nonspacing marks following Latin, Greek or Cyrillic letters and we would remove from DiacriticFolding all cases that are already handled by accent folding. I don't think folding cyrillic short I to I would be right. While graphically it's a combining mark, semantically it would be like folding I with J. What are the purposes of this folding? -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Looking for transcription or transliteration standards latin- arabic
W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa: o-slash, can be analyzed as o and slash, even though that's not done canonically in Unicode. Allowing users outside Scandinavia to perform fuzzy searches for words with this character is useful. In this view of folding, Language-specific fuzzy searches would be tailored (usually by being based on collation information, rather than on generic diacritic folding). In Polish letters with diacritics are sorted after the corresponding letters without. Omitting diacritics is an error, even though text without them is generally readable. They are removed when the given protocol requires or encourages ASCII (e.g. filenames to be used in URLs, login names, variable names in programming languages, ancient computer systems). There is no alternate spelling scheme like German AE/OE/UE/SS. Polish leters are never folded when sorting lexicographically. This applies to in the same way as to other eight letters. Foreign diacritics are always folded though, at least I don't remember seeing any other case. I think would be folded together with O in an encyclopaedia if this is a foreign O with some accent, unrelated to Polish which is a separate letter (can you suggest some non-Polish word starting with which could be found in an encyclopaedia?). But there are cases when I would prefer to fold Polish diacritics in searches. It's basically every case when you are not sure that all stored data is using diacritics, for example in generic WWW searching. There are still people who don't use diacritics in usenet and email, or in entries in guest books and other unprofessional web content. There are even sometimes people who insist that Polish letters *should not* be used in usenet and email because some computer systems can't handle them. Diacritics are rare on IRC (because the IRC protocol doesn't distinguish between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers (because of laziness). This is why for searching archives of unknown data it's generally better to fold them. As far as I know, the default UCA folds these letters except , and standard Polish tailoring doesn't fold any Polish letter. While not folding them in searching is technically correct and nobody would be surprised that they are not folded, it's often more useful to fold them and people would be pleasantly surprised if they don't have to repeat the search with omitted diacritics. If one wants to find data containing a word, rather than collect statistics about usage of a word with and without diacritics, it's very rare than folding does some harm. Hmm, it's not that simple. When I'm searching for JZYK (existing word), I will be happy to find occurrences of JEZYK too (non-existing word, must have had diacritics stripped), but it makes no sense to return JEYK (another existing word). It's not just making the letters equivalent. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Looking for transcription or transliteration standards latin- arabic
W licie z wto, 06-07-2004, godz. 10:50 +0100, Peter Kirk napisa: I guess another similar change would be Danzig - Gdansk, but I don't know where the initial G came from so possibly the Polish form is older than the German. A name with initial Gd is older than with D: http://encyclopedia.thefreedictionary.com/Gdansk http://en.wikipedia.org/wiki/Gda%C5%84sk#Names but Wikipedia has now a hot dispute about how it should call the city: http://en.wikipedia.org/wiki/Talk:Gdansk/Naming_convention -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Error in Hangul composition code
http://www.unicode.org/reports/tr15/ says: int SIndex = last - SBase; if (0 = SIndex SIndex SCount (SIndex % TCount) == 0) { int TIndex = ch - TBase; if (0 = TIndex TIndex = TCount) { // make syllable of form LVT last += TIndex; result.setCharAt(result.length()-1, last); // reset last continue; // discard ch } } But there is no character at TBase == U+11A7. TBase is put one code point below the first trailing consonant, because TIndex == 0 as computed from SIndex % TCount generally means that there is no trailing consonant. Also, the character at TBase + TCount doesn't compose with LV. Adding a count to a base points to the first code point *after* the range. So the condition should be if (0 TIndex TIndex TCount). -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Shape of the US Dollar Sign
Fri, 28 Sep 2001 09:58:39 -0600, Jim Melton [EMAIL PROTECTED] pisze: I believe this is nothing but a font/glyph/presentation issue. A font for text mode I once made had the dollar like this: . . . . . . . . . . . . # . # . . . . . . # . # . . . . . # # # # # . . . # # . # . # # . . # # . # . . . . . # # . # . . . . . . # # # . . . . . . . . # # # . . . . . . # . # # . . . . . # . # # . . # # . # . # # . . . # # # # # . . . . . # . # . . . . . . # . # . . . . . . . . . . . . -- __( Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: 3rd-party cross-platform UTF-8 support
Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze: If you are expecting better performance from a library that takes UTF-8 API's and then does all its internal processing in UTF-8 *without* converting to UTF-16, then I think you are mistaken. UTF-8 is a bad form for much of the kind of internal processing that ICU has to do for all kinds of things -- particularly for collation weighting, for example. Any library worth its salt would *first* convert to UTF-16 (or UTF-32) internally, anyway, before doing any significant semantic manipulation of the characters. Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Any tools to convert HTML unicode to JAVA unicode
Wed, 19 Sep 2001 03:47:59 -0700 (PDT), MindTerm [EMAIL PROTECTED] pisze: I would like to ask any tools to convert HTML unicode ( e.g. # n n n n ) to JAVA unicode ( e.g. \u n n n n ) ? Here is a Perl program which does this: perl -pe 'BEGIN {sub java ($) {sprintf "\\u%04x", $_[0]}} s/#x([0-9A-Fa-f]+);/java hex $1/ge; s/#(\d+);/java $1/ge' -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: CESU-8 vs UTF-8
Sun, 16 Sep 2001 01:14:06 -0700, Carl W. Brown [EMAIL PROTECTED] pisze: If it can be demonstrated that there is a real need for an encoding like CESU-8 then is should be very different from UTF-8. How does SCSU for example sort? SCSU encoding is non-deterministic and its representations can't be compared lexicographically at all (logically equal strings might compare unequal). Ehh, we wouldn't have the problem with CESU-8 now if Unicode hadn't been described as a 16-bit encoding in the past. I still think that UTF-16 was a big mistake. Too bad that it still affects people who avoid it. We can't change the past, but I hope that at least UTF-8 processing can be done without treating surrogates in any special way. Surrogates are relevant only for UTF-16; by not using UTF-16 you should be free of surrogate issues, except by having a silly unused area in character numbers and a silly highest character number. Please don't spread UTF-16 madness where it doesn't belong. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: PDUTR #26 posted
Thu, 13 Sep 2001 12:52:04 -0700, Asmus Freytag [EMAIL PROTECTED] pisze: UTF-32 does have the same byte order issues as UTF-16, except that byte order is recognizable without a BOM. UTF-8 would be used for external communication almost exclusively. Especially as it's compatible with ASCII and thus fits nicely into existing protocols. Since you speak of internal processing: One software architect I spoke with brought this to a nice point: With UTF-16 I can put twice the data in my in-memory hash table and have *on average* the same 1:1 character code:code point characteristics for processing. That's a win-win. Only if you manage to process characters above U+ correctly. It's so easy to make processing efficient and wrong. UTF-8, while even more compressed for European data (it's 50% larger than utf-16 for ideographs), uses multi-code element encoding for all but ASCII, But UTF-16 also uses multi-code element encoding! For program complexity it doesn't matter how often it occurs if variable-length encoding has to be handled anyway. You can't take a character from a string by random index in either case for example. Since most operations are perforce exposed to its variable length, unlike UTF-16 processing, which can be optimized for the much more frequent 1-unit case, How optimized? By managing a flag when all characters fit under U+1 and using separate routines for these cases? It's yet more efficient to forget about UTF and store characters in 8, 16 or 32 bits, whatever is the first which fits. Forget about surrogates. It's simpler. utf-8 cannot as readily be used as internal format. It's as easy as UTF-16. Unless you want a broken implementation which treats surrogates as pairs of characters. It's as broken as treating multibyte sequences of UTF-8 as separate characters. Unicode limited to UTF-8 and UTF-32 would be a lot less attractive and you would not have seen it implemented in Windows, Office and other high volume platforms as early and as widespread as it has been. I don't use Windows. I use UTF-8 much more often than UTF-16 (but still rarely). -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: PDUTR #26 posted
Wed, 12 Sep 2001 11:08:41 -0700, Julie Doll Allen [EMAIL PROTECTED] pisze: Proposed Draft Unicode Technical Report #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is now available at: http://www.unicode.org/unicode/reports/tr26/ IMHO Unicode would have been a better standard if UTF-16 hadn't existed. Just UTF-8 and UTF-32, code points in the range U+..7FFF, no surrogates, no confusion about "how many bits is Unicode", an ASCII-compatible encoding in most external transmissions, uniform width for internal processing, and practically no byte ordering issues. Much simpler. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: [OT] o-circumflex
Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze: It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for Mnchen. Interesting that Polish names of these cities are more like Italian than German: Akwizgran, Augsburg, Moguncja, Monachium. Ko/benhavn is Kopenhaga, again more like other foreign forms than Danish. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Nonsense in http://www.unicode.org/Public/PROGRAMS/CVTUTF/CVTUTF.C?
Wed, 22 Aug 2001 15:59:15 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze: Functions ConvertUCS4toUTF8 and ConvertUTF8toUCS4 use surrogates in UCS4. In particular ConvertUTF8toUCS4 converts a character above U+ into two UCS4 words. Why is this absurd there?! UCS-4 has no knowledge of surrogate code points or their significance; it is ap urely algorithmic conversion. Not sure why the results would be so surprising, given this? I don't understand. I'm talking about characters above U+, not about characters from the range U+D800..DFFF. They are represented as themselves in UCS-4. But the said routine represents them as pairs of surrogates. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: COMMERCIAL AT
Sat, 14 Jul 2001 11:51:29 +0100, Michael Everson [EMAIL PROTECTED] pisze: References to animals are the most common. Germans, Dutch, Finns, Hungarians, Poles and South Africans see it as a monkey tail. Indeed it's commonly called "monkey" in Polish (in parallel with "at"), but some call it "elephant's ear". -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: More about SCSU (was: Re: A UTF-8 based News Service)
Fri, 13 Jul 2001 03:01:10 EDT, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze: Unfortunately, you don't hear much about SCSU, and in particular the Unicode Consortium doesn't really seem to promote it much (although they may be trying to avoid the "too many UTF's" syndrome). SCSU doesn't look very nice for me. The idea is OK but it's just too complicated. Various proposals of encodings differences or xors between consecutive characters are IMHO technically better: much simpler to implement and work as well. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Terms constructed script, invented script (was: FW: Re: Shavian)
7 Jul 2001 11:01:18 GMT, Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] pisze: I put a sample at http://qrczak.ids.net.pl/vi-001.gif Now I put a prettier version there: with variable line width, serifs, and by a slightly improved sizing engine (enlargement of rounded parts to make them look the same size as straight parts happens locally instead of only at the top and bottom of a letter), and with all dots looking exactly the same due to rounding coordinates of their centers to whole pixels (or whole pixels and a half, in case of an even dot size). I still can't have serifs on ends of slanted lines, but they happen only in ASCII shapes, not in my script, so I'm not sure that I want them badly enough. Serifs are really triangles, so they look like traditional serifs only in small pixel sizes like that one. It would be nice to be able to draw it with TeX, but I don't know TeX well enough. I will not reimplement the whole Metafont myself either:-) -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Terms constructed script, invented script (was: FW: Re: Shavian)
In a message dated 2001-07-06 0:31:39 Pacific Daylight Time, [EMAIL PROTECTED] writes: I wonder: why aren't languages with simple syllabic structures written in hiragana? It seems to be built for them. I am using my own script inspired by hiragana 10 years ago for writing Polish. It looks very differently, I only liked the idea of having letters for consonant+vowel pairs and stretched it a bit. I put a sample at http://qrczak.ids.net.pl/vi-001.gif (resolution suitable for printing at 300dpi). For example the subject says: Re: vi (Re: O wyższości znaku zachęty nad GUI), i.e. Re: vi (Re: About the superiority of command-line prompt over GUI), which has only 11 letters between the second Re: and GUI. I won't dare proposing to encode it in Unicode. The number of users is approaching two. But technically it's an interesting script with a non-trivial rendering engine. I implemented the rendering engine and a translator from standard Polish orthography (not perfect due to ambiguities in our orthography - I modified the orthography a little to resolve them). I did it to practice reading. I could only practice writing before - it's hard to read what you just wrote, because you remember what you wrote! Letters are composed from core characters by the engine. There are 35 consonants, 8 normal vowels, 1 extra vowel, joiner, and non-joiner. They produce an unbounded number of letters. (1) Adjacent consonants are joined up to some limit (2 is a good choice, but there is no semantic difference here) and they are joined with the following vowel if present (this is mandatory). (2) A consonant+vowel pair must be split if this is a border between a prefix and a stem or the like. Such pairs are also split in some foreign words to force correct pronunciation (pronunciation of a consonant sometimes depends on the following vowel and vice versa). Non-joiner is used to encode such splitting in the stream of core characters. (3) The default (greedy) splitting of chunks of consonants is not always perfect, e.g. when it would join a final part of a prefix with the beginning of the stem. Joiner and non-joiner are used to prevent or force splitting at certain points between consonants. Forced joining overrides the limit of joined consonants. (4) Any two letters can be joined by writing one above another with a dot between. This is never required by the orthography but is sometimes a good style, e.g. in the od prefix and in diphtongs. Joiner is used to encode that. Finally there are cases where a consonant+vowel pair is split according to (2) and then joined according to (4). I am encoding such case with joiner + non-joiner + joiner. I think that there is already a similar practice in Unicode used for Arabic ligatures. Actually I'm not using even PUA characters but an ASCII-based escaping scheme, because I don't have an editor capable of editing text in such a script. But simple non-joined letters put in a font with the ability to directly edit joiners and non-joiners would be technically workable. The meaning of a text file would then be unambiguous modulo PUA assignment (the ASCII-based escaping is a hack). -- __( Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTĘPCZA QRCZAK
Re: validity of lone surrogates
Tue, 3 Jul 2001 11:19:05 +0100, Michael Everson [EMAIL PROTECTED] pisze: I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and decoders to not worry about surrogates at all. Please leave surrogate issues to UTF-16. But what if I want to put up a Web page in Etruscan? UTF-8 and UTF-32 handle characters above U+ with no problem. I mean: forget about surrogates, i.e. about encoding those characters as pairs of words in the range 0xD800..DFFF in encodings other than UTF-16. For those encodings U+D800..DFFF are just code points like others; they encode the whole contiguous range U+..10 (maximum would be U+7FFF if the idea of UTF-16 wasn't pushed so hard). -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)
27 Jun 2001 13:38:33 +0100, Gaute B Strokkenes [EMAIL PROTECTED] pisze: I would be indebted if any of the experts who hang out on the unicode list could sort out this confusion. I would be glad if the resolution allowed UTF-8 and UTF-32 encoders and decoders to not worry about surrogates at all. Please leave surrogate issues to UTF-16. It's a pity that UTF-16 doesn't encode characters up to U+F, such that code points corresponding to lone surrogates can be encoded as pairs of surrogates. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: validity of lone surrogates (was Re: Unicode surrogates: just say no!)
Tue, 3 Jul 2001 01:50:56 -0700, Michael (michka) Kaplan [EMAIL PROTECTED] pisze: It's a pity that UTF-16 doesn't encode characters up to U+F, such that code points corresponding to lone surrogates can be encoded as pairs of surrogates. Unfortunately, we would then be stuck with what happens when two such surrogate surrogates are next to each other There is no problem with that. Encoding: A character U+..D7FF or U+E000.. is encoded as a single 16-bit word. A character U+D800..DFFF or U+1..F is encoded as two 16-bit words: 0xD800 + (ch 10) and 0xDC00 + (ch 0x3FF). Decoding: A word 0x..D7FF or 0xE000.. stands for itself. Otherwise a word 0xD800..DBFF must be followed by a word 0xDC00..DFFF, and the code obtained from them must be in the range U+D800..DFFF or U+1..F. The word stream is invalid in other cases (unpaired surrogates or surrogates which encode a character which could be encoded using a single word). This gives unambiguous mapping of all code points U+..U+F to single or double 16-bit words. The code space has exactly 20 bits. Code points corresponding to surrogates could be even allocated for real characters. Unicode issues would be simpler if UTF-16 as defined today would not exist. UTF-16 spreads its ugliness to other encoding forms and many people think that Unicode implies 16 bits per character. There is a tendency to use UTF-16 internally and ignore characters above U+, treating surrogates as real characters which must come in pairs in order to encode glyphs. I suppose that we are stuck with UTF-16 forever, so please at least don't spread surrogates to UTF-8 and UTF-32 which don't need to treat the range U+D800..DFFF in any special way. It was hard enough for me to accept that the code point space ends at a funny address U+10. UTF-8 was so nice at 31 bits. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: How does Python Unicode treat surrogates?
Mon, 25 Jun 2001 07:24:28 -0700, Mark Davis [EMAIL PROTECTED] pisze: In most people's experience, it is best to leave the low level interfaces with indices in terms of code units, then supply some utility routines that tell you information about code points. It's yet better to work on characters instead of code units internally, i.e. use UTF-whatever only for interaction with external world. Unfortunately some languages did a mistake of using only 16 bits per character and it's not easy in them. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: How will software source code represent 21 bit unicode characters?
Tue, 17 Apr 2001 07:33:16 +0100, William Overington [EMAIL PROTECTED] pisze: In Java source code one may currently represent a 16 bit unicode character by using \u where each h is any hexadecimal character. How will Java, and maybe other languages, represent 21 bit unicode characters? In Haskell the character U+FFFD can be written thus (inside character or string literal): \65533 \xFFFD \o15 Such escape sequences can have any number of digits. The sequence \ expands to the empty string and is used to protect a sequence from the following text if it begins with a digit. May I, with permission, start a discussion by suggesting that \u \vh and \whh would be good formats. Programmers could then enter unicode characters into software source code using \u and four hexdecimal characters or using \v and five hexadecimal characters or using \w and six hexadecimal characters, as convenient for any particular character. This conflicts with the usage of \v as vertical tab. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Latin digraph characters
Wed, 28 Feb 2001 13:35:17 -0800 (GMT-0800), Pierpaolo BERNARDI [EMAIL PROTECTED] pisze: The initial character of the name is transliterated as CH in English, TCH in French, TSCH in German, C or CI in Italian, C WITH CARON in the official Russian transliteration. And CZ in Polish. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: [OT] Unicode-compatible SQL?
Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation was for performance (having a form that reproduces the binary order of UTF-16). This is unfair: it slows down the conversion UTF-8 - UTF-32. In both cases the speed difference is almost none, and it's a big portability problem. I hope that such trash will not be accepted. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Transcriptions of Unicode
Mon, 15 Jan 2001 13:09:47 -0800 (GMT-0800), G. Adam Stanislav [EMAIL PROTECTED] pisze: I would not be surprised if speakers of certain Slavic languages even changed the SPELLING to Unikod (with an acute over the [o]), as they have done with other imported words (such as futbal for football). That is what we in Polish newsgroups often do, even if it's very unofficial; I don't expect Unicode or Unikod in dictionaries soon. Without acute over the [o], which would mean a different thing. Actually "kod" in Polish means "code". -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Transcriptions of Unicode
Fri, 12 Jan 2001 07:28:18 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: According to the references I have, the prefix "uni" is directly from Latin while the word "code" is through French. The Indo-European would have been *oi-no-kau-do ("give one strike"): *kau apparently being related to such English words as: hew, haggle, hoe, hag, hay, hack, caudad, caudal, caudate, caudex, coda, codex, codicil, coward, incus, and Kova (personal name: 'smith'). Oh, so my surname is related to Unicode? :-) "Kowal" means "smith" in Polish. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Teletext mappings
Sun, 21 Jan 2001 09:29:56 -0800 (GMT-0800), Rob Hardy [EMAIL PROTECTED] pisze: [Polish set] contains the line 0x5B 0x01B5 # LATIN CAPITAL LETTER Z WITH STROKE should supposedly be 0x5B 0x017B # LATIN CAPITAL LETTER Z WITH DOT ABOVE My teletext spec definitely has a Z with a stroke. In Polish capital Z with dot above is sometimes rendered with stroke instead of the dot. It's just a glyph variant, the meaning is exactly the same. The letter should be consistently encoded as Z WITH DOT ABOVE even if it's rendered with a stroke. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Character properties
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze: isDigit:Nd isHexDigit: '0'..'9', 'A'..'F', 'a'..'f' isDecDigit: '0'..'9' isOctDigit: '0'..'7' The definition "Nd" is what I would have proposed for isDecDigit. The name isDecDigit is confusing indeed... isAsciiDigit? But it would be inconsistent with the rest. In general, I would consider any script's digit for decimal and octal numbers. Not so for hex numbers, that are probably strictly bound to computer programming languages and, hence, to the Latin script. Octal digits are bound to programing languages as much as hex digits. I'm not sure about names of Nd and '0'..'9', but I think that there is no need for separate Nd-less-than-8 and '0'..'7', with '0'..'7' being enough - it is used in programming languages and formats with C-like string escapes. What is the meaning of isDigit? The intuitive meaning would be "Any kind of digit, as defined by the three specific functions below". Any kind of digit which forms numbers in the positional decimal system, convertible to an integer by the standard function digitToInt. Actually digitToInt also understands 'A'..'F' and 'a'..'f' as hex digits. So, I would say: This does not provide any name for '0'..'9'. Nor for '0'..'9' + 'A'..'F' + 'a'..'f'. Since they are commonly used in existing formats and programming languages, I'm afraid it's not enough. OTOH there should not be too many variants that nobody will use. isUpper:Lu, Lt isLower:Ll I would say that "Lt" letter are *both* uppercase and lowercase. An interesting point of view! Looks strange, but I must think about it. Some derived tests are becoming incorrect (all letters are lowercase must no longer be checked by "all isLower" but by "not . any isUpper"). Or alternatively, if you can (and wish to) add a new API entry: I think that this phenomenon is it's too rare for having a separate entry. It will not be used in practice by most people. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Character properties
Mon, 23 Oct 2000 09:48:52 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] pisze: isDigit:Nd isHexDigit: '0'..'9', 'A'..'F', 'a'..'f' isDecDigit: '0'..'9' isOctDigit: '0'..'7' The definition "Nd" is what I would have proposed for isDecDigit. The name isDecDigit is confusing indeed... isAsciiDigit? But it would be inconsistent with the rest. In general, I would consider any script's digit for decimal and octal numbers. Not so for hex numbers, that are probably strictly bound to computer programming languages and, hence, to the Latin script. Octal digits are bound to programing languages as much as hex digits. I'm not sure about names of Nd and '0'..'9', but I think that there is no need for separate Nd-less-than-8 and '0'..'7', with '0'..'7' being enough - it is used in programming languages and formats with C-like string escapes. What is the meaning of isDigit? The intuitive meaning would be "Any kind of digit, as defined by the three specific functions below". Any kind of digit which forms numbers in the positional decimal system, convertible to an integer by the standard function digitToInt. Actually digitToInt also understands 'A'..'F' and 'a'..'f' as hex digits. So, I would say: This does not provide any name for '0'..'9'. Nor for '0'..'9' + 'A'..'F' + 'a'..'f'. Since they are commonly used in existing formats and programming languages, I'm afraid it's not enough. OTOH there should not be too many variants that nobody will use. isUpper:Lu, Lt isLower:Ll I would say that "Lt" letter are *both* uppercase and lowercase. An interesting point of view! Looks strange, but I must think about it. Some derived tests are becoming incorrect (all letters are lowercase must no longer be checked by "all isLower" but by "not . any isUpper"). Or alternatively, if you can (and wish to) add a new API entry: I think that this phenomenon is it's too rare for having a separate entry. It will not be used in practice by most people. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Character properties
Wed, 11 Oct 2000 07:15:05 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: Here is my take on the way Unicode general categories should be mapped to POSIX ones. Reiterated, here is my compilation of mapping of properties proposed for Haskell: isAssigned: all except Cs, Cn isControl: Cc, Cf isPrint:L*, M*, N*, P*, S*, Zs, Co isSpace:Zs (except U+00A0, U+202F), TAB, LF, VT, FF, CR isGraph:L*, M*, N*, P*, S*, Co isPunct:P* isSymbol: S* isAlphaNum: L*, M*, N* isDigit:Nd isHexDigit: '0'..'9', 'A'..'F', 'a'..'f' isDecDigit: '0'..'9' isOctDigit: '0'..'7' isAlpha:L*, M* isUpper:Lu, Lt isLower:Ll isLatin1: U+..U+00FF isAscii:U+..U+007F -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Character properties
Wed, 4 Oct 2000 18:48:17 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze: It is quite clear that many important character properties cannot be deduced from the General Category values in UnicodeData.txt alone. What a pity. Especially as it does work for some properties and I would like to avoid having too many arbitrary data sources. isControl = c ' ' || c = '\x7F' c = '\x9F' This is fine if isControl is aimed at the ISO control codes associated with the ISO 2022 framework. However, Unicode introduces a number of other control functions encoded with characters, and it depends on what you want the property API to be sensitive to. An obvious example is the set of bidirectional format control characters. The precise meaning is to be decided too. I think that isControl should be more or less the complement of isPrint, modulo unassigned characters and surrogates. They should tell which characters should be output unescaped by programs like ls (GNU ls uses isprint), or legal in the source of some languages or text file formats. While isPrint are characters definitely safe for output, isControl would be ones that should not occur in pure text and should be always filtered out in some way before displaying (unless handled explicitly like \n \t \f), and for characters in neither class it depends on the application for which side does it want to err... I'm not sure if this makes sense. On the linux-utf8 mailing list I've got conflicting responses about U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR Should they be plain control characters or ones in the "third" class without clear status. isPrint= category is other than [Zl,Zp,Cc,Cf,Cs,Co,Cn] It probably isn't a good idea to include Co (Other, private use) in the exclusion set for isPrint. In most typical usage, if a user-defined character is assigned, it will be a printable character. I was told the same on linux-utf8, and for Cf as well. Cf surprised me, and I was told that programs like ls should not avoid outputting Cf characters. Hmm... isSpace= one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp] You need to decide whether this is for space per se or for whitespace (as you have defined it). I think whitespace - places safe to break a line into words, or stuff allowed between identifiers in some file formats or programming languages (those which say "any Unicode whitespace character", e.g. Haskell source). I was told that I should exclude U+00A0 NO-BREAK SPACE U+202F NARROW NO-BREAK SPACE because of the application for line breaking. They are excluded from is[w]space in the newest glibc. Depending on your system, you may have to add U+0085 as well. I have never heard about U+0085 being used anywhere... What is it for? isGraph= isPrint c not (isSpace c) isPunct= isGraph c not (isAlphaNum c) This is closer to a definition of something like isSymbol, rather than isPunct. I was told the same on linux-utf8, and thus now I have separate isPunct and isSymbol (despite the standard C library which puts both into is[w]punct). isAlphaNum = category is one of [Lu,Ll,Lt,Nd,Nl,No,Lm,Lo] This is definitely wrong. See isAlpha below, which has the same problem. This seems to be the biggest problem (and only real problem): the number of exceptions from any category-based predicate is large. The issue is that many scripts have combining characters which are fully alphabetic. Their General Category is typically Mc. You cannot omit those from an isAlpha or isAlphaNum and get the right results. IMHO isAlpha[Num] should tell which characters form words to be used as identifiers in various contexts. This is one of predicates important for Haskell source, not only its library. I quickly wrote perl programs to compare PropList's Alphabetic + Ideographic with subsets derived from categories. Basing on categories L* + Mc + Nl, the exception list is still large: excluded twenty Lm characters, two Mc characters, and 229 out of 447 Mn characters - near the half! European accents are excluded, but many marks from scripts that I don't know at all are included. It is not obvious why characters like U+073F SYRIAC RWAHA U+0902 DEVANAGARI SIGN ANUSVARA are included, and U+0742 SYRIAC RUKKAKHA U+093C DEVANAGARI SIGN NUKTA are excluded. I still don't know how to do it in an elegant way. Others pointed out the problem with this: isASCIIDigit isDigit. OK, this is fixed. Perhaps there are important character classes that I omitted at all. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Character properties
Fri, 22 Sep 2000 22:11:44 -0800 (GMT-0800), Roozbeh Pournader [EMAIL PROTECTED] pisze: intToDigit should look at the locale to select the preferred digit form, I think. Sorry, it cannot apply to Haskell, because it's a functional language. It must work the same way all the time, unless it had a different interface. I am going to have isDigit and isAsciiDigit. A framework for generic locale-dependent behavior is not designed yet. The implementation of conversion between the default locale-dependent byte encoding and Unicode will of course depend on the locale internally - in its current design it is allowed. There is no external interface to manual locale setting yet. Well, process-wide locale setting is against the Haskell style, but I see no other convenient interface... What about definitions of other character predicates? They came partially from my head, so may be incorrect or "incomplete". * * * What are best ways to implement the conversion between the default locale-dependent byte encoding and Unicode on various platforms? Especially ones to which the Glasgow Haskell Compiler is currently ported: * i386-unknown-{linux,freebsd,netbsd,cygwin32,mingw32} * sparc-sun-solaris2 * hppa1.1-hp-hpux{9,10} I was told on the linux-utf8 mailing list that since the assumption that wchar_t is Unicode is non-portable, the recommended generic way is to use iconv, and carry an iconv implementation (like libiconv) for platforms where it's not available. I don't like this very much, but probably it's indeed the best way on Unices, and something Windows-specific on Windows? -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Character properties
Thu, 21 Sep 2000 23:55:24 +0330 (IRT), Roozbeh Pournader [EMAIL PROTECTED] pisze: isDigit intentionally recognizes ASCII digits only. IMHO it's more often needed and this is what the Haskell 98 Report says. (But I don't follow the report in some other cases.) Would you please give me some URL? http://www.haskell.org/definition/ The Haskell 98 Library Report, module Char. I disagree with the isDigit case, simply because my main language, Persian, uses alternate digits when written. Do they form numbers in the same way as ASCII digits? Does Unicode character database provide a way to tell which digits form numbers in this way (decimal, "big Endian")? Do you think that they (and digits from other languages) should be recognized as numbers in sources for programming languages that generally accept foreign letters in identifiers? (I don't know what Haskell gurus would say for that.) What about isOctDigit and isHexDigit? Haskell provides digitToInt and intToDigit which currently deal with ASCII digits and hexadecimal "digits" A..F a..f. If isDigit accepted foreign digits, it would make sense to extend digitToInt to convert them too. But obviously not intToDigit. BTW. For using foreign alphabets in identifiers, Haskell divides identifiers into two classes basing on the case of the first letter, similarly to Prolog, SML, OCaml, Clean. It is a problem for alphabets without cases. I'm not sure what should be done with it. Haskell98 says that letters which are not lowercase should be considered uppercase. I don't agree with it and my library extension/change proposal allows characters which are isAlpha but neither isLower nor isUpper. When carried to Haskell sources, it's not obvious how to classify identifiers starting with these letters. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Character properties
I am trying to improve character properties handling in the language Haskell. What should the following functions return, i.e. what is most standard/natural/preferred mapping between Unicode character categories and predicates like isalpha etc.? What else should be provided? Here are definitions that I use currently: isControl = c ' ' || c = '\x7F' c = '\x9F' isPrint= category is other than [Zl,Zp,Cc,Cf,Cs,Co,Cn] isSpace= one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp] isGraph= isPrint c not (isSpace c) isPunct= isGraph c not (isAlphaNum c) isAlphaNum = category is one of [Lu,Ll,Lt,Nd,Nl,No,Lm,Lo] isHexDigit = isDigit c || c = 'A' c = 'F' || c = 'a' c = 'f' isDigit= c = '0' c = '9' isOctDigit = c = '0' c = '7' isAlpha= category is one of [Lu,Ll,Lt,Lm,Lo] isUpper= category is one of [Lu,Lt] isLower= category is Ll isLatin1 = c = '\xFF' isAscii= c '\x80' isDigit intentionally recognizes ASCII digits only. IMHO it's more often needed and this is what the Haskell 98 Report says. (But I don't follow the report in some other cases.) Titlecase could be handled too. Even then I think that isUpper should be True for titlecase letters (so it's usable for testing if the first letter of a word is uppercase), and there should be a separate function for category Lu only (for testing if all characters are uppercase). -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK