Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote: On Tue, Feb 06, 2007 at 03:16:17PM +0900, shelarcy wrote: I'm afraid that its fantasy is broken again, as no surrogate pair UCS-2 cover all language that is trusted before Europe and America people. UCS-2 is a disaster in every way. someone had to say it. :) everything should be ascii, utf8 or ucs-4 or migrating to it. Apparently UTF-16 (which is like UCS-2 but covers all code points) is a good internal format. It is more compact than UTF-32 in almost all cases and a less complex encoding than UTF-8. So it's faster than either UTF-32 (because of data-density) or UTF-8 (because of the encoding complexity). The downside compared to UTF-32 is that it is a more complex encoding so the code is harder to write (but apparently it doesn't affect performance much because characters outside the BMP are very rare). The ICU lib uses UTF-16 internally I believe, though I can't at the moment find on their website the bit where they explain why the use UTF-16 rather than -8 or -32. http://icu.sourceforge.net/ Btw, when it comes to all these encoding names, I find it helpful to maintain the fiction that there's no such thing (any more) as UCS-N, there's only UTF-8, 16 and 32. This is also what the Unicode consortium tries to encourage. My view is that we should just provide all three: Data.PackedString.UTF8 Data.PackedString.UTF16 Data.PackedString.UTF32 that all provide the same interface. This wouldn't actually be too much code to write since most of it can re-use the streams code, so the only difference is the single implementation per-encoding of: stream :: PackedString - Stream Char unstream :: Stream Char - PackedString and then get fusion for free of course. I have proposed this task as an MSc project in my department. Hopefully we'll get a student to pick this up. Duncan ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote: UCS-2 is a disaster in every way. someone had to say it. :) UCS-2 has been deprecated for many years. everything should be ascii, utf8 or ucs-4 or migrating to it. UCS-4 has also been deprecated for many years. The main forms of Unicode in use are UTF-16, UTF-8, and (less frequently) UTF-32. On Feb 9, 2007, at 6:02 AM, Duncan Coutts wrote: Apparently UTF-16 (which is like UCS-2 but covers all code points) is a good internal format. It is more compact than UTF-32 in almost all cases and a less complex encoding than UTF-8. So it's faster than either UTF-32 (because of data-density) or UTF-8 (because of the encoding complexity). The downside compared to UTF-32 is that it is a more complex encoding so the code is harder to write (but apparently it doesn't affect performance much because characters outside the BMP are very rare). UTF-16 is never less compact than UTF-32. The worst case of UTF-16 is that it is the same size as UTF-32. This only happens when a string consists entirely of characters from the supplementary planes. The ICU lib uses UTF-16 internally I believe, though I can't at the moment find on their website the bit where they explain why the use UTF-16 rather than -8 or -32. http://icu.sourceforge.net/userguide/unicodeBasics.html UTF-16 is the native Unicode encoding for ICU, Microsoft Windows, and Mac OS X. Btw, when it comes to all these encoding names, I find it helpful to maintain the fiction that there's no such thing (any more) as UCS-N, there's only UTF-8, 16 and 32. This is also what the Unicode consortium tries to encourage. It's not a fiction. :-) UCS-2 and UCS-4 are *deprecated*, by the merger between Unicode and ISO 10646 that limited the code point space to [0..0x10]. In addition to UTF-8, UTF-16, and UTF-32, there's SCSU, a compressed form used in some applications. See: http://www.unicode.org/reports/tr17/ My view is that we should just provide all three: Data.PackedString.UTF8 Data.PackedString.UTF16 Data.PackedString.UTF32 that all provide the same interface. This wouldn't actually be too much code to write since most of it can re-use the streams code, so the only difference is the single implementation per-encoding of: stream :: PackedString - Stream Char unstream :: Stream Char - PackedString and then get fusion for free of course. I agree that all three should be supported. UTF-16 is used in Windows and Mac OS X, and UTF-8 is widely used on Unix platforms (and at the BSD level of Mac OS X). UTF-32 matches the Char type in Haskell, and is used for wchar_t on some platforms. SCSU can be handled the same as a non-Unicode encoding (e.g., like GB2312 or Shift JIS). Note that some Unicode algorithms require the ability to back up in a stream of code points, so that may be a consideration in the design (maybe they could be implemented in Haskell in a way that doesn't require that; I'm still learning, so I'm not sure yet). And regular expression processing requires essentially random access (same Haskell-fu considerations apply). I have proposed this task as an MSc project in my department. Hopefully we'll get a student to pick this up. I hope so! ICU has a BSD/MIT-style license, so feel free to steal whatever is appropriate from there. I would love to see Haskell support Unicode operations like locale-sensitive collation, text boundary analysis, and more on both [Char] and packed strings. Deborah ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
On Tue, Feb 06, 2007 at 03:16:17PM +0900, shelarcy wrote: I'm afraid that its fantasy is broken again, as no surrogate pair UCS-2 cover all language that is trusted before Europe and America people. UCS-2 is a disaster in every way. someone had to say it. :) everything should be ascii, utf8 or ucs-4 or migrating to it. John -- John Meacham - ⑆repetae.net⑆john⑈ ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
On Mon, Feb 05, 2007 at 01:14:26PM +0100, Twan van Laarhoven wrote: The reason for inventing my own encoding is that it is easier to use and takes less space than UTF-8. The only advantage UTF-8 has is that it can be read and written directly. I guess this is a trade off, faster manipulation and smaller storage compared to simpler and faster io. I have not benchmarked it either way, so it is just guesswork for now. I would highly highly recommend using utf8. inventing new formats without very clear and pervasive benefits is just not good practice and I wouldn't want to see it in standard libraries. the ability for conversion between utf8 and ascii bytestrings and compactstrings being a nop should not be underestimated. not to mention that utf8 was designed so things like sorting a raw bytestring with utf8 in it produces the exact same result as decoding it, then sorting it. a _very_ large win for the 'Ord' instance for CompactString. and it is not just files, foreign functions in utf8 locales often take or return strings as arguments, being able to just call those directly with the bytestring contents is also a big win. John -- John Meacham - ⑆repetae.net⑆john⑈ ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
Twan van Laarhoven wrote: Hello all, I would like to announce my attempt at making a Unicode version of Data.ByteString. The library is named Data.CompactString to avoid conflict with other (Fast)PackedString libraries. The library uses a variable length encoding (1 to 3 bytes) of Chars into Word8s, which are then stored in a ByteString. Can I be among the first to ask that any Unicode variant of ByteString use a recognized encoding? You have invented a new encoding: -- Reading/writing chars -- -- Uses a custom encoding which looks like UTF8, but is slightly more efficient. -- It requires at most 3 byes, as opposed to 4 for UTF8. -- -- Encoding looks like --0zzz - 0zzz -- 00yy yzzz - 1xxx 1yyy -- 000x xxyy yzzz - 1xxx 0yyy 1zzz -- -- The reasoning behind the tag bits is that this allows the char to be read both forwards -- and backwards. -- | Write a character and return the size needed pokeCharFun :: Char - (Int, Ptr Word8 - IO ()) pokeCharFun c = case ord c of x | x 0x80 - (1, \p -pokep $ fromIntegral x ) | x 0x4000 - (2, \p - do pokep $ fromIntegral (x `shiftR` 7) .|. 0x80 pokeByteOff p 1 $ fromIntegral x .|. 0x80 ) | otherwise - (3, \p - do pokep $ fromIntegral (x `shiftR` 14) .|. 0x80 pokeByteOff p 1 $ fromIntegral (x `shiftR` 7) .. 0x7f pokeByteOff p 2 $ fromIntegral x .|. 0x80 ) {-# INLINE pokeCharFun #-} -- | Write a character and return the size used pokeChar :: Ptr Word8 - Char - IO Int pokeChar p c = case pokeCharFun c of (l,f) - f p return l {-# INLINE pokeChar #-} -- | Write a character and return the size used pokeCharRev :: Ptr Word8 - Char - IO Int pokeCharRev p c = case pokeCharFun c of (l,f) - f (p `plusPtr` (1-l)) return l {-# INLINE pokeCharRev #-} In reading all the poke/peek function I did not see anything that your tag bits accomplish that the tag bits in utf-8 do not, except that you want to write only a single routine for the poke/peek forwards and backwards operations instead of two routines. It is definitely more compact in the worst case, and more Once And Only Once, but at a very high cost of incompatibility. One of the biggest wins with with a Unicode ByteString will be the ability to transfer the buffer directly to and from the disk and network. Your code will always need the data to be rewritten both incoming and outgoing. The most ideal case would be the ability to load different encodings via import statements while using the same API. ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
Chris Kuklewicz wrote: Can I be among the first to ask that any Unicode variant of ByteString use a recognized encoding? snip In reading all the poke/peek function I did not see anything that your tag bits accomplish that the tag bits in utf-8 do not, except that you want to write only a single routine for the poke/peek forwards and backwards operations instead of two routines. It is definitely more compact in the worst case, and more Once And Only Once, but at a very high cost of incompatibility. The reason for inventing my own encoding is that it is easier to use and takes less space than UTF-8. The only advantage UTF-8 has is that it can be read and written directly. I guess this is a trade off, faster manipulation and smaller storage compared to simpler and faster io. I have not benchmarked it either way, so it is just guesswork for now. Fortunately the entire library can be easily converted to use a different encoding by just changing the peekChar/pokeChar functions. One of the biggest wins with with a Unicode ByteString will be the ability to transfer the buffer directly to and from the disk and network. Your code will always need the data to be rewritten both incoming and outgoing. The most ideal case would be the ability to load different encodings via import statements while using the same API. I was hoping that there would be only a single string type, with different encodings handled by functions: encode :: CompactString - ByteString decode :: ByteString - CompactString This is important if it is not know beforehand how a file is encoded. For example on windows Unicode files are either UTF-8 or UTF-16, identified by a byte order mark. Twan ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
Hello Twan, On Mon, 05 Feb 2007 08:46:35 +0900, Twan van Laarhoven [EMAIL PROTECTED] wrote: I would like to announce my attempt at making a Unicode version of Data.ByteString. The library is named Data.CompactString to avoid conflict with other (Fast)PackedString libraries. How about add abstract layer? Spencer Janssen tried to provied abstract layer for Unicode ByteString, last year's summer of code project. It has no Unicode support. But it supplied a good layer, Stringable class. http://code.google.com/soc/haskell/appinfo.html?csaid=B934AEBE95120AB2 http://darcs.haskell.org/SoC/fps-soc/ http://darcs.haskell.org/SoC/fps-soc-aug21/ The library uses a variable length encoding (1 to 3 bytes) of Chars into Word8s, which are then stored in a ByteString. The structure is very much based on Data.ByteString, most of the implementation is copied from there. Hopefully this means that fusion rules could be copied as well. UTF-8 also uses 4 to 6 byte encodings now. CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol, etc ... use 4 byte encoding. Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings. But Takusen's implementation support it correctly. http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs http://www.haskell.org/pipermail/libraries/2007-February/006841.html How about support 4 to 6 byte encodings? Best Regards, -- shelarcy shelarcycapella.freemail.ne.jp http://page.freett.com/shelarcy/ ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
shelarcy wrote: Hello Twan, On Mon, 05 Feb 2007 08:46:35 +0900, Twan van Laarhoven [EMAIL PROTECTED] wrote: I would like to announce my attempt at making a Unicode version of Data.ByteString. The library is named Data.CompactString to avoid conflict with other (Fast)PackedString libraries. How about add abstract layer? Spencer Janssen tried to provied abstract layer for Unicode ByteString, last year's summer of code project. It has no Unicode support. But it supplied a good layer, Stringable class. http://code.google.com/soc/haskell/appinfo.html?csaid=B934AEBE95120AB2 http://darcs.haskell.org/SoC/fps-soc/ http://darcs.haskell.org/SoC/fps-soc-aug21/ The library uses a variable length encoding (1 to 3 bytes) of Chars into Word8s, which are then stored in a ByteString. The structure is very much based on Data.ByteString, most of the implementation is copied from there. Hopefully this means that fusion rules could be copied as well. UTF-8 also uses 4 to 6 byte encodings now. CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol, etc ... use 4 byte encoding. Looking at several sources, it seems you are incorrect. Haskell Char go up to Unicode 1114111 (decimal) or 0x10 Hexidecimal). These are encoded by UTF-8 in 1,2,3,or 4 bytes. CJK Unified Ideographs Extension B starts at 131072 or 0x2 Tai Xuan Jing Symbols start at 119552 or 0x1d300 These are all within the official utf-8 encoding scheme. Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings. UTF-8 uses 1,2,3, or 4 bytes. Anything that does not support 4 bytes does not support UTF-8 But Takusen's implementation support it correctly. The Takusen does have unreachable dead code to serialize Char as (ord c :: Int) up to 31 bits into as many as 6 bytes. But it does decode up to 6 bytes to 31 bits and try to chr this from Int to Char. Decoding that many bits is not consistent with the UTF-8 standard. http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs http://www.haskell.org/pipermail/libraries/2007-February/006841.html How about support 4 to 6 byte encodings? UTF-8 is a 4 byte encoding. There is no valid UTF-8 5 or 6 byte encoding. Best Regards, ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
On 05/02/07, Chris Kuklewicz [EMAIL PROTECTED] wrote: shelarcy wrote: Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings. UTF-8 uses 1,2,3, or 4 bytes. Anything that does not support 4 bytes does not support UTF-8 Well, some of them are probably a bit dated; they likely supported an older version of the standard. But Takusen's implementation support it correctly. The Takusen does have unreachable dead code to serialize Char as (ord c :: Int) up to 31 bits into as many as 6 bytes. But it does decode up to 6 bytes to 31 bits and try to chr this from Int to Char. Decoding that many bits is not consistent with the UTF-8 standard. UTF-8 is a 4 byte encoding. There is no valid UTF-8 5 or 6 byte encoding. Chris is right here, in that Takusen's decoder is incorrect w.r.t. the standard, in allowing up to 6 bytes to encode a single char. If it was correct, it would reject 5 and 6 byte sequences. I copied the extended conversion from HXT's code, which was the most correct UTF8 library I had seen so far (it just didn't marshal directly from a CString, which was what I was after). Turns out darcs has the most accurate UTF8 en + de-coders: http://abridgegame.org/cgi-bin/darcs.cgi/darcs/UTF8.lhs?c=annotate There's nothing stopping the Unicode consortium from expanding the range of codepoints, is there? Or have they said that'll never happen? Alistair ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
Alistair Bayley writes: On 05/02/07, Chris Kuklewicz [EMAIL PROTECTED] wrote: UTF-8 is a 4 byte encoding. There is no valid UTF-8 5 or 6 byte encoding. Chris is right here, in that Takusen's decoder is incorrect w.r.t. the standard, in allowing up to 6 bytes to encode a single char. snip There's nothing stopping the Unicode consortium from expanding the range of codepoints, is there? Or have they said that'll never happen? I believe they have. In particular, UTF-16 only supports code points up to 10. From http://en.wikipedia.org/wiki/Universal_Character_Set: the UCS stops at 10 and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range [...] ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million -- David Menendez [EMAIL PROTECTED] | In this house, we obey the laws http://www.eyrie.org/~zednenem |of thermodynamics! ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
Re: [Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
On Tue, 06 Feb 2007 00:25:45 +0900, Chris Kuklewicz [EMAIL PROTECTED] wrote: UTF-8 also uses 4 to 6 byte encodings now. CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol, etc ... use 4 byte encoding. Looking at several sources, it seems you are incorrect. Haskell Char go up to Unicode 1114111 (decimal) or 0x10 Hexidecimal). These are encoded by UTF-8 in 1,2,3,or 4 bytes. I see. I'm confused Unicode support with Charset support. I'm sorry about it. UCS-4 can support greater than 1114111 code pages. So if we want to support full UCS-4 range, we must support 5, 6 byte encoding as RFC2279 decribed before. http://www.rfc-editor.org/rfc/rfc2279.txt But ... unfortunately UTF-16 can support only 1114111 code points, and The Unicode Consortium adhere to UTF-16. So 5, 6 byte and over 1114111 code pages' 4 byte encodings are invalid now. http://www.rfc-editor.org/rfc/rfc3629.txt (RFC3629 says This memo obsoletes and replaces RFC 2279.) And Haskell implementation uses only valid rage. I forgot about that. I'm afraid that its fantasy is broken again, as no surrogate pair UCS-2 cover all language that is trusted before Europe and America people. -- shelarcy shelarcycapella.freemail.ne.jp http://page.freett.com/shelarcy/ ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString
Hello all, I would like to announce my attempt at making a Unicode version of Data.ByteString. The library is named Data.CompactString to avoid conflict with other (Fast)PackedString libraries. The library uses a variable length encoding (1 to 3 bytes) of Chars into Word8s, which are then stored in a ByteString. The structure is very much based on Data.ByteString, most of the implementation is copied from there. Hopefully this means that fusion rules could be copied as well. This is kind of a pre-release, many functions are still missing, and I have not benchmarked yet. I am releasing this in the hopes of getting some feedback on the general idea. Homepage: http://twan.home.fmf.nl/compact-string/ Haddock: http://twan.home.fmf.nl/compact-string/doc/html/ Source:darcs get http://twan.home.fmf.nl/repos/compact-string Twan van Laarhoven ___ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell