Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-09 Thread Duncan Coutts
On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote:
 On Tue, Feb 06, 2007 at 03:16:17PM +0900, shelarcy wrote:
  I'm afraid that its fantasy is broken again, as no surrogate
  pair UCS-2 cover all language that is trusted before Europe
  and America people.
 
 UCS-2 is a disaster in every way. someone had to say it. :)
 
 everything should be ascii, utf8 or ucs-4 or migrating to it.

Apparently UTF-16 (which is like UCS-2 but covers all code points) is a
good internal format. It is more compact than UTF-32 in almost all cases
and a less complex encoding than UTF-8. So it's faster than either
UTF-32 (because of data-density) or UTF-8 (because of the encoding
complexity). The downside compared to UTF-32 is that it is a more
complex encoding so the code is harder to write (but apparently it
doesn't affect performance much because characters outside the BMP are
very rare).

The ICU lib uses UTF-16 internally I believe, though I can't at the
moment find on their website the bit where they explain why the use
UTF-16 rather than -8 or -32.

http://icu.sourceforge.net/


Btw, when it comes to all these encoding names, I find it helpful to
maintain the fiction that there's no such thing (any more) as UCS-N,
there's only UTF-8, 16 and 32. This is also what the Unicode consortium
tries to encourage.

My view is that we should just provide all three:
Data.PackedString.UTF8
Data.PackedString.UTF16
Data.PackedString.UTF32

that all provide the same interface. This wouldn't actually be too much
code to write since most of it can re-use the streams code, so the only
difference is the single implementation per-encoding of:
stream   :: PackedString - Stream Char
unstream :: Stream Char - PackedString

and then get fusion for free of course.

I have proposed this task as an MSc project in my department. Hopefully
we'll get a student to pick this up.

Duncan

___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-09 Thread Deborah Goldsmith

On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote:

UCS-2 is a disaster in every way. someone had to say it. :)


UCS-2 has been deprecated for many years.



everything should be ascii, utf8 or ucs-4 or migrating to it.


UCS-4 has also been deprecated for many years. The main forms of  
Unicode in use are UTF-16, UTF-8, and (less frequently) UTF-32.


On Feb 9, 2007, at 6:02 AM, Duncan Coutts wrote:
Apparently UTF-16 (which is like UCS-2 but covers all code points)  
is a
good internal format. It is more compact than UTF-32 in almost all  
cases

and a less complex encoding than UTF-8. So it's faster than either
UTF-32 (because of data-density) or UTF-8 (because of the encoding
complexity). The downside compared to UTF-32 is that it is a more
complex encoding so the code is harder to write (but apparently it
doesn't affect performance much because characters outside the BMP are
very rare).


UTF-16 is never less compact than UTF-32. The worst case of UTF-16 is  
that it is the same size as UTF-32. This only happens when a string  
consists entirely of characters from the supplementary planes.


The ICU lib uses UTF-16 internally I believe, though I can't at the
moment find on their website the bit where they explain why the use
UTF-16 rather than -8 or -32.


http://icu.sourceforge.net/userguide/unicodeBasics.html

UTF-16 is the native Unicode encoding for ICU, Microsoft Windows, and  
Mac OS X.



Btw, when it comes to all these encoding names, I find it helpful to
maintain the fiction that there's no such thing (any more) as UCS-N,
there's only UTF-8, 16 and 32. This is also what the Unicode  
consortium

tries to encourage.


It's not a fiction. :-) UCS-2 and UCS-4 are *deprecated*, by the  
merger between Unicode and ISO 10646 that limited the code point  
space to [0..0x10]. In addition to UTF-8, UTF-16, and UTF-32,  
there's SCSU, a compressed form used in some applications. See:


http://www.unicode.org/reports/tr17/


My view is that we should just provide all three:
Data.PackedString.UTF8
Data.PackedString.UTF16
Data.PackedString.UTF32

that all provide the same interface. This wouldn't actually be too  
much
code to write since most of it can re-use the streams code, so the  
only

difference is the single implementation per-encoding of:
stream   :: PackedString - Stream Char
unstream :: Stream Char - PackedString

and then get fusion for free of course.


I agree that all three should be supported. UTF-16 is used in Windows  
and Mac OS X, and UTF-8 is widely used on Unix platforms (and at the  
BSD level of Mac OS X). UTF-32 matches the Char type in Haskell, and  
is used for wchar_t on some platforms. SCSU can be handled the same  
as a non-Unicode encoding (e.g., like GB2312 or Shift JIS).


Note that some Unicode algorithms require the ability to back up in a  
stream of code points, so that may be a consideration in the design  
(maybe they could be implemented in Haskell in a way that doesn't  
require that; I'm still learning, so I'm not sure yet). And regular  
expression processing requires essentially random access (same  
Haskell-fu considerations apply).


I have proposed this task as an MSc project in my department.  
Hopefully

we'll get a student to pick this up.


I hope so!

ICU has a BSD/MIT-style license, so feel free to steal whatever is  
appropriate from there. I would love to see Haskell support Unicode  
operations like locale-sensitive collation, text boundary analysis,  
and more on both [Char] and packed strings.


Deborah

___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-08 Thread John Meacham
On Tue, Feb 06, 2007 at 03:16:17PM +0900, shelarcy wrote:
 I'm afraid that its fantasy is broken again, as no surrogate
 pair UCS-2 cover all language that is trusted before Europe
 and America people.

UCS-2 is a disaster in every way. someone had to say it. :)

everything should be ascii, utf8 or ucs-4 or migrating to it.

John

-- 
John Meacham - ⑆repetae.net⑆john⑈
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-08 Thread John Meacham
On Mon, Feb 05, 2007 at 01:14:26PM +0100, Twan van Laarhoven wrote:
 The reason for inventing my own encoding is that it is easier to use and 
 takes less space than UTF-8. The only advantage UTF-8 has is that it can 
 be read and written directly. I guess this is a trade off, faster 
 manipulation and smaller storage compared to simpler and faster io. I 
 have not benchmarked it either way, so it is just guesswork for now.

I would highly highly recommend using utf8. inventing new formats
without very clear and pervasive benefits is just not good practice and
I wouldn't want to see it in standard libraries.

the ability for conversion between utf8 and ascii bytestrings and
compactstrings being a nop should not be underestimated. 

not to mention that utf8 was designed so things like sorting a raw
bytestring with utf8 in it produces the exact same result as decoding
it, then sorting it. a _very_ large win for the 'Ord' instance for
CompactString.

and it is not just files, foreign functions in utf8 locales often take
or return strings as arguments, being able to just call those directly
with the bytestring contents is also a big win. 


John

-- 
John Meacham - ⑆repetae.net⑆john⑈
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-05 Thread Chris Kuklewicz
Twan van Laarhoven wrote:
 Hello all,
 
 I would like to announce my attempt at making a Unicode version of
 Data.ByteString. The library is named Data.CompactString to avoid
 conflict with other (Fast)PackedString libraries.
 
 The library uses a variable length encoding (1 to 3 bytes) of Chars into
 Word8s, which are then stored in a ByteString.

Can I be among the first to ask that any Unicode variant of ByteString use a
recognized encoding?

You have invented a new encoding:

 -- Reading/writing chars
 --
 
 -- Uses a custom encoding which looks like UTF8, but is slightly more 
 efficient.
 
 -- It requires at most 3 byes, as opposed to 4 for UTF8.
 
 --
 
 -- Encoding looks like
 
 --0zzz - 0zzz
 
 --   00yy yzzz - 1xxx 1yyy
 
 --  000x xxyy yzzz - 1xxx 0yyy 1zzz
 --
 -- The reasoning behind the tag bits is that this allows the char to be read 
 both forwards
 -- and backwards.
 
 -- | Write a character and return the size needed
 pokeCharFun :: Char - (Int, Ptr Word8 - IO ())
 pokeCharFun c = case ord c of
  x | x  0x80   - (1, \p -pokep   $ fromIntegral  x )
| x  0x4000 - (2, \p - do pokep   $ fromIntegral (x `shiftR`  
 7) .|. 0x80
 pokeByteOff p 1 $ fromIntegral  x 
  .|. 0x80 )
| otherwise  - (3, \p - do pokep   $ fromIntegral (x `shiftR` 
 14) .|. 0x80
 pokeByteOff p 1 $ fromIntegral (x `shiftR`  
 7) .. 0x7f
 pokeByteOff p 2 $ fromIntegral  x 
  .|. 0x80 )
 {-# INLINE pokeCharFun #-}
 
 -- | Write a character and return the size used
 pokeChar :: Ptr Word8 - Char - IO Int
 pokeChar p c = case pokeCharFun c of (l,f) - f p  return l
 {-# INLINE pokeChar #-}
 
 -- | Write a character and return the size used
 pokeCharRev :: Ptr Word8 - Char - IO Int
 pokeCharRev p c = case pokeCharFun c of (l,f) - f (p `plusPtr` (1-l))  
 return l
 {-# INLINE pokeCharRev #-}

In reading all the poke/peek function I did not see anything that your tag bits
accomplish that the tag bits in utf-8 do not, except that you want to write only
a single routine for the poke/peek forwards and backwards operations instead of
two routines.  It is definitely more compact in the worst case, and more Once
And Only Once, but at a very high cost of incompatibility.

One of the biggest wins with with a Unicode ByteString will be the ability to
transfer the buffer directly to and from the disk and network.  Your code will
always need the data to be rewritten both incoming and outgoing.

The most ideal case would be the ability to load different encodings via import
statements while using the same API.
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-05 Thread Twan van Laarhoven

Chris Kuklewicz wrote:


Can I be among the first to ask that any Unicode variant of ByteString use a
recognized encoding?

snip

In reading all the poke/peek function I did not see anything that your tag bits
accomplish that the tag bits in utf-8 do not, except that you want to write only
a single routine for the poke/peek forwards and backwards operations instead of
two routines.  It is definitely more compact in the worst case, and more Once
And Only Once, but at a very high cost of incompatibility.


The reason for inventing my own encoding is that it is easier to use and 
takes less space than UTF-8. The only advantage UTF-8 has is that it can 
be read and written directly. I guess this is a trade off, faster 
manipulation and smaller storage compared to simpler and faster io. I 
have not benchmarked it either way, so it is just guesswork for now.


Fortunately the entire library can be easily converted to use a 
different encoding by just changing the peekChar/pokeChar functions.



One of the biggest wins with with a Unicode ByteString will be the ability to
transfer the buffer directly to and from the disk and network.  Your code will
always need the data to be rewritten both incoming and outgoing.

The most ideal case would be the ability to load different encodings via import
statements while using the same API.


I was hoping that there would be only a single string type, with 
different encodings handled by functions:

  encode :: CompactString - ByteString
  decode :: ByteString - CompactString

This is important if it is not know beforehand how a file is encoded. 
For example on windows Unicode files are either UTF-8 or UTF-16, 
identified by a byte order mark.


Twan
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-05 Thread shelarcy
Hello Twan,

On Mon, 05 Feb 2007 08:46:35 +0900, Twan van Laarhoven [EMAIL PROTECTED] 
wrote:
 I would like to announce my attempt at making a Unicode version of
 Data.ByteString. The library is named Data.CompactString to avoid
 conflict with other (Fast)PackedString libraries.

How about add abstract layer?

Spencer Janssen tried to provied abstract layer for Unicode ByteString,
last year's summer of code project.
It has no Unicode support. But it supplied a good layer, Stringable class.

http://code.google.com/soc/haskell/appinfo.html?csaid=B934AEBE95120AB2
http://darcs.haskell.org/SoC/fps-soc/
http://darcs.haskell.org/SoC/fps-soc-aug21/


 The library uses a variable length encoding (1 to 3 bytes) of Chars into
 Word8s, which are then stored in a ByteString. The structure is very
 much based on Data.ByteString, most of the implementation is copied from
 there. Hopefully this means that fusion rules could be copied as well.

UTF-8 also uses 4 to 6 byte encodings now.
CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol,
etc ... use 4 byte encoding.

Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.
But Takusen's implementation support it correctly.

http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs
http://www.haskell.org/pipermail/libraries/2007-February/006841.html

How about support 4 to 6 byte encodings?


Best Regards,

-- 
shelarcy shelarcycapella.freemail.ne.jp
http://page.freett.com/shelarcy/
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-05 Thread Chris Kuklewicz
shelarcy wrote:
 Hello Twan,
 
 On Mon, 05 Feb 2007 08:46:35 +0900, Twan van Laarhoven [EMAIL PROTECTED] 
 wrote:
 I would like to announce my attempt at making a Unicode version of
 Data.ByteString. The library is named Data.CompactString to avoid
 conflict with other (Fast)PackedString libraries.
 
 How about add abstract layer?
 
 Spencer Janssen tried to provied abstract layer for Unicode ByteString,
 last year's summer of code project.
 It has no Unicode support. But it supplied a good layer, Stringable class.
 
 http://code.google.com/soc/haskell/appinfo.html?csaid=B934AEBE95120AB2
 http://darcs.haskell.org/SoC/fps-soc/
 http://darcs.haskell.org/SoC/fps-soc-aug21/
 
 
 The library uses a variable length encoding (1 to 3 bytes) of Chars into
 Word8s, which are then stored in a ByteString. The structure is very
 much based on Data.ByteString, most of the implementation is copied from
 there. Hopefully this means that fusion rules could be copied as well.
 
 UTF-8 also uses 4 to 6 byte encodings now.
 CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol,
 etc ... use 4 byte encoding.

Looking at several sources, it seems you are incorrect.

Haskell Char go up to Unicode 1114111 (decimal) or 0x10 Hexidecimal).
These are encoded by UTF-8 in 1,2,3,or 4 bytes.

CJK Unified Ideographs Extension B starts at 131072 or 0x2
Tai Xuan Jing Symbols start at 119552 or 0x1d300

These are all within the official utf-8 encoding scheme.

 
 Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.

UTF-8 uses 1,2,3, or 4 bytes.  Anything that does not support 4 bytes does  not
support UTF-8

 But Takusen's implementation support it correctly.

The Takusen does have unreachable dead code to serialize Char as (ord c :: Int)
up to 31 bits into as many as 6 bytes.  But it does decode up to 6 bytes to 31
bits and try to chr this from Int to Char.  Decoding that many bits is not
consistent with the UTF-8 standard.

 
 http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs
 http://www.haskell.org/pipermail/libraries/2007-February/006841.html
 
 How about support 4 to 6 byte encodings?

UTF-8 is a 4 byte encoding.  There is no valid UTF-8 5 or 6 byte encoding.

 
 
 Best Regards,
 

___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-05 Thread Alistair Bayley

On 05/02/07, Chris Kuklewicz [EMAIL PROTECTED] wrote:

shelarcy wrote:



 Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.

UTF-8 uses 1,2,3, or 4 bytes.  Anything that does not support 4 bytes does  not
support UTF-8


Well, some of them are probably a bit dated; they likely supported an
older version of the standard.



 But Takusen's implementation support it correctly.

The Takusen does have unreachable dead code to serialize Char as (ord c :: Int)
up to 31 bits into as many as 6 bytes.  But it does decode up to 6 bytes to 31
bits and try to chr this from Int to Char.  Decoding that many bits is not
consistent with the UTF-8 standard.
UTF-8 is a 4 byte encoding.  There is no valid UTF-8 5 or 6 byte encoding.


Chris is right here, in that Takusen's decoder is incorrect w.r.t. the
standard, in allowing up to 6 bytes to encode a single char. If it was
correct, it would reject 5 and 6 byte sequences. I copied the extended
conversion from HXT's code, which was the most correct UTF8 library I
had seen so far (it just didn't marshal directly from a CString, which
was what I was after).

Turns out darcs has the most accurate UTF8 en + de-coders:
 http://abridgegame.org/cgi-bin/darcs.cgi/darcs/UTF8.lhs?c=annotate

There's nothing stopping the Unicode consortium from expanding the
range of codepoints, is there? Or have they said that'll never happen?

Alistair
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-05 Thread David Menendez
Alistair Bayley writes:

 On 05/02/07, Chris Kuklewicz [EMAIL PROTECTED] wrote:

  UTF-8 is a 4 byte encoding.  There is no valid UTF-8 5 or 6 byte
  encoding.
 
 Chris is right here, in that Takusen's decoder is incorrect w.r.t. the
 standard, in allowing up to 6 bytes to encode a single char.

snip 

 There's nothing stopping the Unicode consortium from expanding the
 range of codepoints, is there? Or have they said that'll never happen?

I believe they have. In particular, UTF-16 only supports code points up
to 10.

From http://en.wikipedia.org/wiki/Universal_Character_Set:

 the UCS stops at 10 and ISO/IEC 10646 has stated that all future
 assignments of characters will also take place in that range
[...]
 ISO 10646 was limited to contain as many characters as could be
 encoded by UTF-16 and no more, that is, a little over a million
 characters instead of over 2,000 million
-- 
David Menendez [EMAIL PROTECTED] | In this house, we obey the laws
http://www.eyrie.org/~zednenem  |of thermodynamics!
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-05 Thread shelarcy
On Tue, 06 Feb 2007 00:25:45 +0900, Chris Kuklewicz [EMAIL PROTECTED] wrote:
 UTF-8 also uses 4 to 6 byte encodings now.
 CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol,
 etc ... use 4 byte encoding.

 Looking at several sources, it seems you are incorrect.

 Haskell Char go up to Unicode 1114111 (decimal) or 0x10 Hexidecimal).
 These are encoded by UTF-8 in 1,2,3,or 4 bytes.

I see. I'm confused Unicode support with Charset support.
I'm sorry about it.

UCS-4 can support greater than 1114111 code pages.
So if we want to support full UCS-4 range, we must support
5, 6 byte encoding as RFC2279 decribed before.

http://www.rfc-editor.org/rfc/rfc2279.txt

But ... unfortunately UTF-16 can support only 1114111 code
points, and The Unicode Consortium adhere to UTF-16.
So 5, 6 byte and over 1114111 code pages' 4 byte encodings
are invalid now.

http://www.rfc-editor.org/rfc/rfc3629.txt
(RFC3629 says This memo obsoletes and replaces RFC 2279.)

And Haskell implementation uses only valid rage. I forgot
about that.


I'm afraid that its fantasy is broken again, as no surrogate
pair UCS-2 cover all language that is trusted before Europe
and America people.

-- 
shelarcy shelarcycapella.freemail.ne.jp
http://page.freett.com/shelarcy/
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell


[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

2007-02-04 Thread Twan van Laarhoven

Hello all,

I would like to announce my attempt at making a Unicode version of 
Data.ByteString. The library is named Data.CompactString to avoid 
conflict with other (Fast)PackedString libraries.


The library uses a variable length encoding (1 to 3 bytes) of Chars into 
Word8s, which are then stored in a ByteString. The structure is very 
much based on Data.ByteString, most of the implementation is copied from 
there. Hopefully this means that fusion rules could be copied as well.


This is kind of a pre-release, many functions are still missing, and I 
have not benchmarked yet. I am releasing this in the hopes of getting 
some feedback on the general idea.


Homepage:  http://twan.home.fmf.nl/compact-string/
Haddock:   http://twan.home.fmf.nl/compact-string/doc/html/
Source:darcs get http://twan.home.fmf.nl/repos/compact-string

Twan van Laarhoven
___
Haskell mailing list
Haskell@haskell.org
http://www.haskell.org/mailman/listinfo/haskell