Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
On 26 Sep 2007, at 7:05 pm, Johan Tibell wrote: If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise. Java uses 16-bit variables to hold characters. This is SOLELY for historical reasons, not because it is a good choice. The history is a bit funny: the ISO 10646 group were working away defining a 31-bit character set, and the industry screamed blue murder about how this was going to ruin the economy, bring back the Dark Ages, c, and promptly set up the Unicode consortium to define a 16-bit character set that could do the same job. Early versions of Unicode had only about 30 000 characters, after heroic (and not entirely appreciated) efforts at unifiying Chinese characters as used in China with those used in Japan and those used in Korea. They also lumbered themselves (so that they would have a fighting chance of getting Unicode adopted) with a round trip conversion policy, namely that it should be possible to take characters using ANY current encoding standard, convert them to Unicode, and then convert back to the original encoding with no loss of information. This led to failure of unification: there are two versions of Å (one for ordinary use, one for Angstroms), two versions of mu (one for Greek, one for micron), three complete copies of ASCII, c). However, 16 bits really is not enough. Here's a table from http://www.unicode.org/versions/Unicode5.0.0/ Graphic 98,884 Format 140 Control 65 Private Use 137,468 Surrogate 2,048 Noncharacter 66 Reserved875,441 Excluding Private Use and Reserved, I make that 101,203 currently defined codes. That's nearly 1.5* the number that would fit in 16 bits. Java has had to deal with this, don't think it hasn't. For example, where Java had one set of functions referring to characters in strings by position, it now has two complete sets: one to use *which 16-bit code* (which is fast) and one to use *which actual Unicode character* (which is slow). The key point is that the second set is *always* slow even when there are no characters outside the basic multilingual plane. One Smalltalk system I sometimes use has three complete string implementations (all characters fit in a byte, all characters fit in 16 bits, some characters require more) and dynamically switches from narrow strings to wide strings behind your back. In a language with read-only strings, that makes a lot of sense; it's just a pity Smalltalk isn't one. If you want to minimize conversion effort when talking to the operating system, files, and other programs, UTF-8 is probably the way to go. (That's on Unix. For Windows it might be different.) If you want to minimize the effort of recognising character boundaries while processing strings, 32-bit characters are the way to go. If you want to be able to index into a string efficiently, they are the *only* way to go. Solaris bit the bullet many years ago; Sun C compilers jumped straight from 8-bit wchar_t to 32_bit without ever stopping at 16. 16-bit characters *used* to be a reasonable compromise, but aren't any longer. Unicode keeps on growing. There were 1,349 new characters from Unicode 4.1 to Unicode 5.0 (IIRC). There are lots more scripts in the pipeline. (What the heck _is_ Tangut, anyway?) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
On 9/27/07, ok [EMAIL PROTECTED] wrote: (What the heck _is_ Tangut, anyway?) http://en.wikipedia.org/wiki/Tangut_language Juanma ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
I'll look over the proposal more carefully when I get time, but the most important issue is to not let the storage type leak into the interface. Agreed, From an implementation point of view, UTF-16 is the most efficient representation for processing Unicode. It's the native Unicode representation for Windows, Mac OS X, and the ICU open source i18n library. UTF-8 is not very efficient for anything except English. Its most valuable property is compatibility with software that thinks of character strings as byte arrays, and in fact that's why it was invented. If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote: I'll look over the proposal more carefully when I get time, but the most important issue is to not let the storage type leak into the interface. Agreed, From an implementation point of view, UTF-16 is the most efficient representation for processing Unicode. It's the native Unicode representation for Windows, Mac OS X, and the ICU open source i18n library. UTF-8 is not very efficient for anything except English. Its most valuable property is compatibility with software that thinks of character strings as byte arrays, and in fact that's why it was invented. If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise. I disagree. I realize I'm a dissenter in this regard, but my position is: excellent Unix support first, portability second, excellent support for Win32/MacOS a distant third. That seems to be the opposite of every language's position. Unix absolutely needs UTF-8 for backward compatibility. jcc ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
In message [EMAIL PROTECTED] Jonathan Cast [EMAIL PROTECTED] writes: On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote: If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise. I disagree. I realize I'm a dissenter in this regard, but my position is: excellent Unix support first, portability second, excellent support for Win32/MacOS a distant third. That seems to be the opposite of every language's position. Unix absolutely needs UTF-8 for backward compatibility. I think you're talking about different things, internal vs external representations. Certainly we must support UTF-8 as an external representation. The choice of internal representation is independent of that. It could be [Char] or some memory efficient packed format in a standard encoding like UTF-8,16,32. The choice depends mostly on ease of implementation and performance. Some formats are easier/faster to process but there are also conversion costs so in some use cases there is a performance benefit to the internal representation being the same as the external representation. So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8 has the advantage of being the same as a common external representation so conversion is cheap (only need to validate rather than copy). UTF-8 is more compact for western languages but less compact for eastern languages compared to UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the common case UTF-16 is effectively fixed width. According to the ICU implementors this has speed advantages (probably due to branch prediction and smaller code size). One solution is to do both and benchmark them. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
On Wed, 2007-09-26 at 18:46 +0100, Duncan Coutts wrote: In message [EMAIL PROTECTED] Jonathan Cast [EMAIL PROTECTED] writes: On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote: If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise. I disagree. I realize I'm a dissenter in this regard, but my position is: excellent Unix support first, portability second, excellent support for Win32/MacOS a distant third. That seems to be the opposite of every language's position. Unix absolutely needs UTF-8 for backward compatibility. I think you're talking about different things, internal vs external representations. Certainly we must support UTF-8 as an external representation. The choice of internal representation is independent of that. It could be [Char] or some memory efficient packed format in a standard encoding like UTF-8,16,32. The choice depends mostly on ease of implementation and performance. Some formats are easier/faster to process but there are also conversion costs so in some use cases there is a performance benefit to the internal representation being the same as the external representation. So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8 has the advantage of being the same as a common external representation so conversion is cheap (only need to validate rather than copy). UTF-8 is more compact for western languages but less compact for eastern languages compared to UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the common case UTF-16 is effectively fixed width. According to the ICU implementors this has speed advantages (probably due to branch prediction and smaller code size). One solution is to do both and benchmark them. OK, right. jcc ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
Hi, thanks for proposal, Why questions connected with converting are considered only? The library i18n should give a number of other services such as normalization, comparison, sorting, etc. Furthermore it's not so easy to keep such library up to date. Why simply do not make a bindings to IBM ICU (http://www-306.ibm.com/software/globalization/icu/index.jsp) library which is up to date unicode implementation? Vitaliy. 2007/9/25, Johan Tibell [EMAIL PROTECTED]: Dear haskell-cafe, I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there. Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually. Bring out your Unicode kung-fu! http://haskell.org/haskellwiki/UnicodeByteString Cheers, Johan Tibell ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
I'll look over the proposal more carefully when I get time, but the most important issue is to not let the storage type leak into the interface. From an implementation point of view, UTF-16 is the most efficient representation for processing Unicode. It's the native Unicode representation for Windows, Mac OS X, and the ICU open source i18n library. UTF-8 is not very efficient for anything except English. Its most valuable property is compatibility with software that thinks of character strings as byte arrays, and in fact that's why it was invented. UTF-32 is conceptually cleaner, but characters outside the BMP (Basic Multilingual Plane) are rare in actual text, so UTF-16 turns out to be the best combination of space and time efficiency. Deborah On Sep 24, 2007, at 3:52 PM, Johan Tibell wrote: Dear haskell-cafe, I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there. Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually. Bring out your Unicode kung-fu! http://haskell.org/haskellwiki/UnicodeByteString Cheers, Johan Tibell ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] PROPOSAL: New efficient Unicode string library.
Dear haskell-cafe, I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there. Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually. Bring out your Unicode kung-fu! http://haskell.org/haskellwiki/UnicodeByteString Cheers, Johan Tibell ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.
Johan Tibell wrote: Dear haskell-cafe, I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there. Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually. Bring out your Unicode kung-fu! http://haskell.org/haskellwiki/UnicodeByteString Have you looked at my CompactString library[1]? It essentially does exactly this, with one extension: the type is parameterized over the encoding. From the discussion on #haskell it would seem that some people consider this unforgivable, while others consider it essential. In my opinion flexibility should be more important, you can always restrict things later. For the common case where encoding doesn't matter there is Data.CompactString.UTF8, which provides an un-parameterized type. I called this type 'CompactString' as well, which might be a bit unfortunate. I don't like the name UnicodeString, since it suggests that the normal string somehow doesn't support unicode. This module could be made more prominent. Maybe Data.CompactString could be the specialized type, while Data.CompactString.Parameterized supports different encodings. A word of warning: The library is still in the alpha stage of development. I don't fully trust it myself yet :) [1] http://twan.home.fmf.nl/compact-string/ Twan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe