Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread ok

On 26 Sep 2007, at 7:05 pm, Johan Tibell wrote:

If UTF-16 is what's used by everyone else (how about Java? Python?) I
think that's a strong reason to use it. I don't know Unicode well
enough to say otherwise.


Java uses 16-bit variables to hold characters.
This is SOLELY for historical reasons, not because it is a good choice.
The history is a bit funny:  the ISO 10646 group were working away
defining a 31-bit character set, and the industry screamed blue murder
about how this was going to ruin the economy, bring back the Dark Ages,
c, and promptly set up the Unicode consortium to define a 16-bit
character set that could do the same job.  Early versions of Unicode
had only about 30 000 characters, after heroic (and not entirely
appreciated) efforts at unifiying Chinese characters as used in China
with those used in Japan and those used in Korea.  They also lumbered
themselves (so that they would have a fighting chance of getting
Unicode adopted) with a round trip conversion policy, namely that it
should be possible to take characters using ANY current encoding
standard, convert them to Unicode, and then convert back to the original
encoding with no loss of information.  This led to failure of  
unification:

there are two versions of Å (one for ordinary use, one for Angstroms),
two versions of mu (one for Greek, one for micron), three complete  
copies

of ASCII, c).  However, 16 bits really is not enough.

Here's a table from http://www.unicode.org/versions/Unicode5.0.0/

Graphic  98,884
Format  140
Control  65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved875,441

Excluding Private Use and Reserved, I make that 101,203 currently
defined codes.  That's nearly 1.5* the number that would fit in 16
bits.

Java has had to deal with this, don't think it hasn't.  For example,
where Java had one set of functions referring to characters in strings
by position, it now has two complete sets:  one to use *which 16-bit
code* (which is fast) and one to use *which actual Unicode character*
(which is slow).  The key point is that the second set is *always*
slow even when there are no characters outside the basic multilingual
plane.

One Smalltalk system I sometimes use has three complete string
implementations (all characters fit in a byte, all characters fit
in 16 bits, some characters require more) and dynamically switches
from narrow strings to wide strings behind your back.  In a language
with read-only strings, that makes a lot of sense; it's just a pity
Smalltalk isn't one.

If you want to minimize conversion effort when talking to the operating
system, files, and other programs, UTF-8 is probably the way to go.
(That's on Unix.  For Windows it might be different.)

If you want to minimize the effort of recognising character boundaries
while processing strings, 32-bit characters are the way to go.  If you
want to be able to index into a string efficiently, they are the *only*
way to go.  Solaris bit the bullet many years ago; Sun C compilers
jumped straight from 8-bit wchar_t to 32_bit without ever stopping at  
16.


16-bit characters *used* to be a reasonable compromise, but aren't any
longer.  Unicode keeps on growing.  There were 1,349 new characters
from Unicode 4.1 to Unicode 5.0 (IIRC).  There are lots more scripts
in the pipeline.  (What the heck _is_ Tangut, anyway?)




___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Juanma Barranquero
On 9/27/07, ok [EMAIL PROTECTED] wrote:

 (What the heck _is_ Tangut, anyway?)

http://en.wikipedia.org/wiki/Tangut_language

 Juanma
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Johan Tibell
 I'll look over the proposal more carefully when I get time, but the
 most important issue is to not let the storage type leak into the
 interface.

Agreed,

  From an implementation point of view, UTF-16 is the most efficient
 representation for processing Unicode. It's the native Unicode
 representation for Windows, Mac OS X, and the ICU open source i18n
 library. UTF-8 is not very efficient for anything except English. Its
 most valuable property is compatibility with software that thinks of
 character strings as byte arrays, and in fact that's why it was
 invented.

If UTF-16 is what's used by everyone else (how about Java? Python?) I
think that's a strong reason to use it. I don't know Unicode well
enough to say otherwise.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Jonathan Cast
On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:
  I'll look over the proposal more carefully when I get time, but the
  most important issue is to not let the storage type leak into the
  interface.
 
 Agreed,
 
   From an implementation point of view, UTF-16 is the most efficient
  representation for processing Unicode. It's the native Unicode
  representation for Windows, Mac OS X, and the ICU open source i18n
  library. UTF-8 is not very efficient for anything except English. Its
  most valuable property is compatibility with software that thinks of
  character strings as byte arrays, and in fact that's why it was
  invented.
 
 If UTF-16 is what's used by everyone else (how about Java? Python?) I
 think that's a strong reason to use it. I don't know Unicode well
 enough to say otherwise.

I disagree.  I realize I'm a dissenter in this regard, but my position
is: excellent Unix support first, portability second, excellent support
for Win32/MacOS a distant third.  That seems to be the opposite of every
language's position.  Unix absolutely needs UTF-8 for backward
compatibility.

jcc


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Duncan Coutts
In message [EMAIL PROTECTED] Jonathan Cast [EMAIL PROTECTED] writes:
 On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:

  If UTF-16 is what's used by everyone else (how about Java? Python?) I
  think that's a strong reason to use it. I don't know Unicode well
  enough to say otherwise.
 
 I disagree.  I realize I'm a dissenter in this regard, but my position
 is: excellent Unix support first, portability second, excellent support
 for Win32/MacOS a distant third.  That seems to be the opposite of every
 language's position.  Unix absolutely needs UTF-8 for backward
 compatibility.

I think you're talking about different things, internal vs external 
representations.

Certainly we must support UTF-8 as an external representation. The choice of
internal representation is independent of that. It could be [Char] or some
memory efficient packed format in a standard encoding like UTF-8,16,32. The
choice depends mostly on ease of implementation and performance. Some formats
are easier/faster to process but there are also conversion costs so in some use
cases there is a performance benefit to the internal representation being the
same as the external representation.

So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8
has the advantage of being the same as a common external representation so
conversion is cheap (only need to validate rather than copy). UTF-8 is more
compact for western languages but less compact for eastern languages compared to
UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the
common case UTF-16 is effectively fixed width. According to the ICU implementors
this has speed advantages (probably due to branch prediction and smaller code 
size).

One solution is to do both and benchmark them.

Duncan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Jonathan Cast
On Wed, 2007-09-26 at 18:46 +0100, Duncan Coutts wrote:
 In message [EMAIL PROTECTED] Jonathan Cast [EMAIL PROTECTED] writes:
  On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:
 
   If UTF-16 is what's used by everyone else (how about Java? Python?) I
   think that's a strong reason to use it. I don't know Unicode well
   enough to say otherwise.
  
  I disagree.  I realize I'm a dissenter in this regard, but my position
  is: excellent Unix support first, portability second, excellent support
  for Win32/MacOS a distant third.  That seems to be the opposite of every
  language's position.  Unix absolutely needs UTF-8 for backward
  compatibility.
 
 I think you're talking about different things, internal vs external 
 representations.
 
 Certainly we must support UTF-8 as an external representation. The choice of
 internal representation is independent of that. It could be [Char] or some
 memory efficient packed format in a standard encoding like UTF-8,16,32. The
 choice depends mostly on ease of implementation and performance. Some formats
 are easier/faster to process but there are also conversion costs so in some 
 use
 cases there is a performance benefit to the internal representation being the
 same as the external representation.
 
 So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8
 has the advantage of being the same as a common external representation so
 conversion is cheap (only need to validate rather than copy). UTF-8 is more
 compact for western languages but less compact for eastern languages compared 
 to
 UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In 
 the
 common case UTF-16 is effectively fixed width. According to the ICU 
 implementors
 this has speed advantages (probably due to branch prediction and smaller code 
 size).
 
 One solution is to do both and benchmark them.

OK, right.

jcc

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-25 Thread Vitaliy Akimov
Hi, thanks for proposal,
Why questions connected with converting are considered only?
The library i18n should give a number of other services such as
normalization, comparison, sorting, etc.
Furthermore it's not so easy to keep such library up to date.
Why simply do not make a bindings to IBM ICU
(http://www-306.ibm.com/software/globalization/icu/index.jsp) library
which is up to date unicode implementation?

Vitaliy.

2007/9/25, Johan Tibell [EMAIL PROTECTED]:
 Dear haskell-cafe,

 I would like to propose a new, ByteString like, Unicode string library
 which can be used where both efficiency (currently offered by
 ByteString) and i18n support (currently offered by vanilla Strings)
 are needed. I wrote a skeleton draft today but I'm a bit tired so I
 didn't get all the details. Nevertheless I think it fleshed out enough
 for some initial feedback. If I can get the important parts nailed
 down before Hackathon I could hack on it there.

 Apologies for not getting everything we discussed on #haskell down in
 the first draft. It'll get in there eventually.

 Bring out your Unicode kung-fu!

 http://haskell.org/haskellwiki/UnicodeByteString

 Cheers,

 Johan Tibell
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-25 Thread Deborah Goldsmith
I'll look over the proposal more carefully when I get time, but the  
most important issue is to not let the storage type leak into the  
interface.


From an implementation point of view, UTF-16 is the most efficient  
representation for processing Unicode. It's the native Unicode  
representation for Windows, Mac OS X, and the ICU open source i18n  
library. UTF-8 is not very efficient for anything except English. Its  
most valuable property is compatibility with software that thinks of  
character strings as byte arrays, and in fact that's why it was  
invented.


UTF-32 is conceptually cleaner, but characters outside the BMP (Basic  
Multilingual Plane) are rare in actual text, so UTF-16 turns out to  
be the best combination of space and time efficiency.


Deborah

On Sep 24, 2007, at 3:52 PM, Johan Tibell wrote:


Dear haskell-cafe,

I would like to propose a new, ByteString like, Unicode string library
which can be used where both efficiency (currently offered by
ByteString) and i18n support (currently offered by vanilla Strings)
are needed. I wrote a skeleton draft today but I'm a bit tired so I
didn't get all the details. Nevertheless I think it fleshed out enough
for some initial feedback. If I can get the important parts nailed
down before Hackathon I could hack on it there.

Apologies for not getting everything we discussed on #haskell down in
the first draft. It'll get in there eventually.

Bring out your Unicode kung-fu!

http://haskell.org/haskellwiki/UnicodeByteString

Cheers,

Johan Tibell
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-24 Thread Johan Tibell
Dear haskell-cafe,

I would like to propose a new, ByteString like, Unicode string library
which can be used where both efficiency (currently offered by
ByteString) and i18n support (currently offered by vanilla Strings)
are needed. I wrote a skeleton draft today but I'm a bit tired so I
didn't get all the details. Nevertheless I think it fleshed out enough
for some initial feedback. If I can get the important parts nailed
down before Hackathon I could hack on it there.

Apologies for not getting everything we discussed on #haskell down in
the first draft. It'll get in there eventually.

Bring out your Unicode kung-fu!

http://haskell.org/haskellwiki/UnicodeByteString

Cheers,

Johan Tibell
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] PROPOSAL: New efficient Unicode string library.

2007-09-24 Thread Twan van Laarhoven

Johan Tibell wrote:


Dear haskell-cafe,

I would like to propose a new, ByteString like, Unicode string library
which can be used where both efficiency (currently offered by
ByteString) and i18n support (currently offered by vanilla Strings)
are needed. I wrote a skeleton draft today but I'm a bit tired so I
didn't get all the details. Nevertheless I think it fleshed out enough
for some initial feedback. If I can get the important parts nailed
down before Hackathon I could hack on it there.

Apologies for not getting everything we discussed on #haskell down in
the first draft. It'll get in there eventually.

Bring out your Unicode kung-fu!

http://haskell.org/haskellwiki/UnicodeByteString


Have you looked at my CompactString library[1]? It essentially does 
exactly this, with one extension: the type is parameterized over the 
encoding. From the discussion on #haskell it would seem that some people 
consider this unforgivable, while others consider it essential.


In my opinion flexibility should be more important, you can always 
restrict things later. For the common case where encoding doesn't matter 
there is Data.CompactString.UTF8, which provides an un-parameterized 
type. I called this type 'CompactString' as well, which might be a bit 
unfortunate. I don't like the name UnicodeString, since it suggests that 
the normal string somehow doesn't support unicode. This module could be 
made more prominent. Maybe Data.CompactString could be the specialized 
type, while Data.CompactString.Parameterized supports different encodings.


A word of warning: The library is still in the alpha stage of 
development. I don't fully trust it myself yet :)


[1] http://twan.home.fmf.nl/compact-string/

Twan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe