Re: [Haskell-cafe] Has character changed in GHC 6.8?
On 1/22/08, Ian Lynagh [EMAIL PROTECTED] wrote: On Tue, Jan 22, 2008 at 03:59:24PM +, Magnus Therning wrote: Yes, of course, stupid me. But it is still the UTF-8 representation of ö, not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8? Yes (in 6.8.2, to be precise). It's in the release notes: http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html GHCi now treats all input as unicode, except for the Windows console where we do the correct conversion from the current code page. Excellent news. One step closer to sanity when it comes to character encodings on the command line :-) /M ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Peter Verswyvelen [EMAIL PROTECTED] writes: Prelude Data.Char map ord ö [195,182] Prelude Data.Char length ö 2 there are actually 2 bytes there, but your terminal is showing them as one character. So let's all switch to unicode ASAP and leave that horrible multi-byte-string-thing behind us? You are being ironic, I take it? Unicode by its nature implies multi-byte chars, it's just a question of how they are encoded: UTF-8 (one or more bytes, variable), UTF-16 (two or four, variable), or UCS-4 (or should it be UTF-32? - four bytes, fixed). The problem here is that while terminal software have been UTF-8 for some time, GHC only recently caught up. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Ketil Malde wrote: So let's all switch to unicode ASAP and leave that horrible multi-byte-string-thing behind us? You are being ironic, I take it? No I just used wrong terminology. When I said unicode, I actually meant UCS-x, and with multi-byte-string-thing I meant VARIABLE-length, sorry about that. I find variable length chars so much harder to use and reason about than the fixed length characters. UTF-x is a form of compression, which is understandable, but it is IMHO a burden (since it does not allow random access to the n-th character) Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32? BTW: According the Wikipedia, UCS-4 and UTF-32 are functionally equivalent. Cheers, Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Peter Verswyvelen wrote: Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32? How dare you! Such a personal question! This is none of your business. I jest, but the point is sound: the internal storage of Char is ghc's business, and it should not leak to the programmer. All the programmer needs to know is that Char is capable of storing unicode characters. GHC might choose some custom storage method, including making Char an ADT behind the scenes, or whatever it likes. Other haskell compilers or interpreters are free to choose their own representation. In practice, I believe that for GHC it's a wchar, which is typically a 32bit character with reasonably efficient libc support. What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use lower 8 bits of unicode code point which is almost functionally equivalent to latin-1. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Peter Verswyvelen [EMAIL PROTECTED] writes: No I just used wrong terminology. When I said unicode, I actually meant UCS-x, You might as well say UCS-4, nobody uses UCS-2 anymore. It's been replaced by UTF-16, which gives you the complexity of UTF-8 without being compact (for 99% of existing data), endianness-indifferent, or backwards compatibe with ASCII. and with multi-byte-string-thing I meant VARIABLE-length, sorry about that. I find variable length chars so much harder to use and reason about than the fixed length characters. UTF-x is a form of compression, which is understandable, but it is IMHO a burden (since it does not allow random access to the n-th character) Do you really need that, though? Most formats I know with enough structure that you can pick up records by offset either encode the offsets somewhere, or are restricted to ASCII, or both. Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32? Internally, Haskell Chars are Unicode, and stores a code point as a 32bit (well, actually 21 bit or something) value. One Char, one code point. ByteString stores 8-bit chars, and the Char8 interface chops off the top bits, essentially projecting codepoints down to the ISO-8859-1 (latin1) subset. Externally, it depends on what IO library you use. As for the command line, Ian's post links to: http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Jan 23, 2008 11:56 AM, Jules Bean [EMAIL PROTECTED] wrote: Peter Verswyvelen wrote: Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32? [snip] What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use lower 8 bits of unicode code point which is almost functionally equivalent to latin-1. Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Johan Tibell wrote: On Jan 23, 2008 11:56 AM, Jules Bean [EMAIL PROTECTED] wrote: Peter Verswyvelen wrote: Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2 use for [Char]? UCS-32? [snip] What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use lower 8 bits of unicode code point which is almost functionally equivalent to latin-1. Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right. No arguments there. Presumably there wasn't a sufficiently good answer available in time for haskell98. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Jan 23, 2008 12:13 PM, Jules Bean [EMAIL PROTECTED] wrote: Presumably there wasn't a sufficiently good answer available in time for haskell98. Will there be one for haskell prime ? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use lower 8 bits of unicode code point which is almost functionally equivalent to latin-1. Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right. Presumably there wasn't a sufficiently good answer available in time for haskell98. Will there be one for haskell prime ? The I/O library needs an overhaul but I'm not sure how to do this in a backwards compatible manner which probably would be required for inclusion in Haskell'. One could, like Python 3000, break backwards compatibility. I'm not sure about the implications of doing this. Maybe introducing a new System.IO.Unicode module would be an option. If one wants to keep the interface but change the semantics slightly one could define e.g. getChar as: getChar :: IO Char getChar = getWord8 = decodeChar latin1 Assuming latin-1 is what's used now. The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point. I recommend reading about the Python I/O system overhaul for Python 3000 which is outlined in PEP 3116 http://www.python.org/dev/peps/pep-3116/ My proposal is for I/O functions to specify the encoding they use if they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s. Optionally, text I/O functions could default to the system locale setting. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Johan Tibell wrote: What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use lower 8 bits of unicode code point which is almost functionally equivalent to latin-1. Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right. Presumably there wasn't a sufficiently good answer available in time for haskell98. Will there be one for haskell prime ? The I/O library needs an overhaul but I'm not sure how to do this in a backwards compatible manner which probably would be required for inclusion in Haskell'. One could, like Python 3000, break backwards compatibility. I'm not sure about the implications of doing this. Maybe introducing a new System.IO.Unicode module would be an option. If one wants to keep the interface but change the semantics slightly one could define e.g. getChar as: getChar :: IO Char getChar = getWord8 = decodeChar latin1 Assuming latin-1 is what's used now. The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point. I'm not sure what you mean here. All 256 possible values have a meaning. I did say 'lower 8 bits of unicode code point which is almost functionally equivalent to latin-1.' IIUC, it's latin-1 plus the two control-character ranges. There are no decoding errors for haskell98's getChar. My proposal is for I/O functions to specify the encoding they use if they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s. I would be more inclined to suggest they default to a particular well understand encoding, almost certainly UTF8. Another interface could give access to other encodings. Optionally, text I/O functions could default to the system locale setting. That is a disastrous idea. Please read the other flamewars^Wdiscussions on this list about this subject :) One was started by a certain Johann Tibell :) http://haskell.org/pipermail/haskell-cafe/2007-September/031724.html http://haskell.org/pipermail/haskell-cafe/2007-September/032195.html Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point. I'm not sure what you mean here. All 256 possible values have a meaning. You're of course right. So we don't have a problem here. Maybe I was thinking of an encoding (7-bit ASCII?) where some of the 256 values are invalid. My proposal is for I/O functions to specify the encoding they use if they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s. I would be more inclined to suggest they default to a particular well understand encoding, almost certainly UTF8. Another interface could give access to other encodings. That might be a good option. However, it would be nice if beginners could write simple console programs using System.IO and have them work correctly even if their system's encoding is not byte compatible with UTF-8. People who do I/O over the network etc. need to be more careful and should specify the encoding used. How would a UTF-8 default work on different Windows versions? Optionally, text I/O functions could default to the system locale setting. That is a disastrous idea. I'm not sure about that as long as decode is called on the input to make sure that it's a valid encoding given the input bytes. Same point as above. What I would like to avoid is having to write: main = do putStrLn systemLocalEncoding What's your name? name - getLine systemLocalEncoding putStrLn systemLocalEncoding $ Hi ++ name ++ ! I guess we could solve this by putting the functions in different modules: System.IO -- requires explicit encoding System.IO.DefaultEncoding -- implicit use of system locale setting And have the modules export the same functions. Another option would be to include the fact that encoding is implied in the name of the function. Maybe we should start by giving some type signatures and function names. That often helps my thinking. I'll try to write something down when I get home from work. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On 1/23/08, Johan Tibell [EMAIL PROTECTED] wrote: [..] My proposal is for I/O functions to specify the encoding they use if they accept or return Chars (and Strings). If they deal in terms of bytes (e.g. socket functions) they should accept and return Word8s. Optionally, text I/O functions could default to the system locale setting. Yes, this reflects my recent experience, Char is not a good representation for an 8-bit byte. This thread came out of my attempt to add a module to dataenc[1] that would make base64-string[2] obsolete. As you probably can guess I came to the conclusion that a function for data encoding with type 'String - String' is plain wrong. :-) /M [1]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dataenc-0.10.2 [2]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/base64-string-0.1 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Johan Tibell wrote: What *does* matter to the programmer is what encodings putStr and getLine use. AFAIK, they use lower 8 bits of unicode code point which is almost functionally equivalent to latin-1. Which is terrible! You should have to be explicit about what encoding you expect. Python 3000 does it right. Presumably there wasn't a sufficiently good answer available in time for haskell98. Will there be one for haskell prime ? The I/O library needs an overhaul but I'm not sure how to do this in a backwards compatible manner which probably would be required for inclusion in Haskell'. One could, like Python 3000, break backwards compatibility. I'm not sure about the implications of doing this. Maybe introducing a new System.IO.Unicode module would be an option. There are already some libraries that attempt to create a new string and I/O library for Haskell, based on Unicode, with a separation of byte semantics and character semantics. See for example Streams [1] or CompactString [2]. Regards, Reinier [1]: http://haskell.org/haskellwiki/Library/Streams [2]: http://twan.home.fmf.nl/compact-string/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Johan Tibell [EMAIL PROTECTED] writes: The benefit would be that if the input is not in latin-1 an exception could be thrown rather than returning a Char representing the wrong Unicode code point. I'm not sure what you mean here. All 256 possible values have a meaning. OTOH, going the other way could be more troublesome, I'm not sure that outputting a truncated value is what you want. You're of course right. So we don't have a problem here. Maybe I was thinking of an encoding (7-bit ASCII?) where some of the 256 values are invalid. Well - each byte can be converted to the equivalent code point, but 0x80-0x9F are control characters, and some of those are left undefined. Perhaps instead of truncating on output, we should map code points 0xFF to such a value? E.g. 0x81 is undefined in both Unicode and Windows 1252. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Jan 23, 2008 2:11 PM, Magnus Therning [EMAIL PROTECTED] wrote: Yes, this reflects my recent experience, Char is not a good representation for an 8-bit byte. This thread came out of my attempt to add a module to dataenc[1] that would make base64-string[2] obsolete. As you probably can guess I came to the conclusion that a function for data encoding with type 'String - String' is plain wrong. :-) Yes. Functions that deal with bytes shouldn't use Char. Char should be seen as and ADT representing Unicode code points. It has nothing to do with bytes. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Has character changed in GHC 6.8?
I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding: map ord åäö [229,228,246] Is this the case, or is there something strange going on with character encodings? I was hoping that this would mean that 'chr . ord' would basically be a no-op, but no such luck: chr . ord $ 'å' '\229' What would I have to do to get an 'å' from '229'? /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus.therning@gmail.com http://therning.org/magnus What if I don't want to obey the laws? Do they throw me in jail with the other bad monads? -- Daveman signature.asc Description: OpenPGP digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
chr . ord $ 'å' '\229' What would I have to do to get an 'å' from '229'? It seems you already have it; 'å' is the same as '\229'. But IO output is still 8-bit, so when you ask ghci to print 'å', you get '\229'. You can use utf-string library (from hackage). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
2008/1/22 Magnus Therning [EMAIL PROTECTED]: I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string I guess it's not very difficult to prove that ∀ f xs. length xs == length (map f xs) even in the presence of seq. -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote: I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string That seems unlikely. At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding: Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1. map ord åäö [229,228,246] Is this the case, or is there something strange going on with character encodings? That's what we'd expect. Note that GHCi still uses Latin-1. This will change in GHC-6.10. I was hoping that this would mean that 'chr . ord' would basically be a no-op, but no such luck: chr . ord $ 'å' '\229' What would I have to do to get an 'å' from '229'? Easy! Prelude 'å' == '\229' True Prelude 'å' == Char.chr 229 True Remember, when you type: Prelude 'å' what you really get is: Prelude putStrLn (show 'å') So perhaps what is confusing you is the Show instance for Char which converts Char - String into a portable ascii representation. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Tue, 2008-01-22 at 12:56 +0300, Miguel Mitrofanov wrote: chr . ord $ 'å' '\229' What would I have to do to get an 'å' from '229'? It seems you already have it; 'å' is the same as '\229'. Yes. But IO output is still 8-bit, so when you ask ghci to print 'å', you get '\229'. No. :-) if you 'print' it you get: print 'å' = putStrLn (show 'å') = putStrLn \229 this has nothing to do with 8-bit IO. It's just what 'show' does for Char. If you.. putStrLn å then you do get the low 8 bits being printed. But that's not what is going on above. You can use utf-string library (from hackage). import qualified Codec.Binary.UTF8.String as UTF8 putStrLn (UTF8.encodeString å) or just: import qualified System.IO.UTF8 UTF8.putStrLn å Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Tue, 22 Jan 2008, Duncan Coutts wrote: At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding: Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1. Can this be controlled by an option? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Tue, 2008-01-22 at 13:48 +0100, Henning Thielemann wrote: On Tue, 22 Jan 2008, Duncan Coutts wrote: At the time I thought that the encoding (in my case UTF-8) was “leaking through”. After switching to GHC 6.8 the behaviour seems to have changed, and mapping 'ord' on a string results in a list of ints representing the Unicode code point rather than the encoding: Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1. Can this be controlled by an option? From the GHC manual: GHC assumes that source files are ASCII or UTF-8 only, other encodings are not recognised. However, invalid UTF-8 sequences will be ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only. There is no option to have GHC assume a different encoding. You can use something like iconv to convert .hs files from another encoding into UTF-8. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote: On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote: I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string That seems unlikely. Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid): map ord a [97] map ord ö [195,182] Funky, isn't it? ;-) Easy! Prelude 'å' == '\229' True Prelude 'å' == Char.chr 229 True Remember, when you type: Prelude 'å' what you really get is: Prelude putStrLn (show 'å') So perhaps what is confusing you is the Show instance for Char which converts Char - String into a portable ascii representation. Have you tried putting any of this into GHCi (6.6.1)? Any line with 'å' results in the following for me: 'å' interactive:1:2: lexical error in string/character literal at character '\165' å \195\165 Somewhat disappointing. GHCi 6.8.2 does perform better though. /M ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Tue, Jan 22, 2008 at 03:16:15PM +, Magnus Therning wrote: On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote: On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote: I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string That seems unlikely. Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid): map ord a [97] map ord ö [195,182] In 6.6.1: Prelude Data.Char map ord ö [195,182] Prelude Data.Char length ö 2 there are actually 2 bytes there, but your terminal is showing them as one character. Thanks Ian ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re[2]: [Haskell-cafe] Has character changed in GHC 6.8?
Hello Duncan, Tuesday, January 22, 2008, 1:36:44 PM, you wrote: Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them as Latin-1. afair, it was changed since 6.6 -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Ian Lynagh wrote: On Tue, Jan 22, 2008 at 03:16:15PM +, Magnus Therning wrote: On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote: On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote: I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string That seems unlikely. Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid): map ord a [97] map ord ö [195,182] In 6.6.1: Prelude Data.Char map ord ö [195,182] Prelude Data.Char length ö 2 there are actually 2 bytes there, but your terminal is showing them as one character. Still, that seems weird to me. A Haskell Char is a Unicode character. An ö is either one character (unicode point 0xF6) (which, in UTF-8, is coded as two bytes) or a combination of an o with an umlaut (Unicode point 776). But because the last character is not 776, the ö here should just be one character. I'd suspect that the two-character string comes from the terminal speaking UTF-8 to GHC expecting Latin-1. GHC 6.8 expects UTF-8, so all is fine. On my MacBook (OS X 10.4), 'ö' also immediately expands to \303\266 when I type it in my terminal, even outside GHCi. That suggests that the terminal program doesn't handle Unicode and immediately escapes weird characters. Regards, Reinier ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Ian Lynagh wrote: Prelude Data.Char map ord ö [195,182] Prelude Data.Char length ö 2 there are actually 2 bytes there, but your terminal is showing them as one character. So let's all switch to unicode ASAP and leave that horrible multi-byte-string-thing behind us? Cheers, Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On 1/22/08, Ian Lynagh [EMAIL PROTECTED] wrote: On Tue, Jan 22, 2008 at 03:16:15PM +, Magnus Therning wrote: On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote: On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote: I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string That seems unlikely. Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version currently in Debian Sid): map ord a [97] map ord ö [195,182] In 6.6.1: Prelude Data.Char map ord ö [195,182] Prelude Data.Char length ö 2 there are actually 2 bytes there, but your terminal is showing them as one character. Yes, of course, stupid me. But it is still the UTF-8 representation of ö, not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8? map ord ö [246] map ord åɓz퐀 [229,595,65370,119808] 6.8 produces Unicode code points rather then a particular encoding. /M ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
Magnus Therning wrote: Yes, of course, stupid me. But it is still the UTF-8 representation of ö, not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8? map ord ö [246] map ord åɓz퐀 [229,595,65370,119808] 6.8 produces Unicode code points rather then a particular encoding. The key point here is this has nothing to do with GHC. GHC's behaviour has not changed in this regard. This is about GHCi! [And, to some extent, the behaviour of whatever shell / terminal emulator you run ghci in] Sounds like a pedantic difference, but it's not. The difference here is what GHCi is feeding into your haskell code when you type the sequence ö at a ghci prompt, rather than anything different about the underlying behaviour of map, ord, length, show, putStr. map, ord, length, show, putStr have not changed from 6.6 to 6.8. I don't have 6.8 handy myself but from your demonstration is would appear that 6.8's ghci correctly understands whatever input encoding is being used in whatever terminal environment you are choosing to run ghci within. Whereas, 6.6's ghci was using a single-byte terminal approach, and your terminal environment was encoding ö as two characters. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Tue, 2008-01-22 at 07:45 -0200, Felipe Lessa wrote: 2008/1/22 Magnus Therning [EMAIL PROTECTED]: I vaguely remember that in GHC 6.6 code like this length $ map ord a string being able able to generate a different answer than length a string I guess it's not very difficult to prove that ∀ f xs. length xs == length (map f xs) even in the presence of seq. This is the free theorem of length. For it to be wrong, parametric polymorphism would have to be incorrectly implemented. Even seq makes no difference (in this case.) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Has character changed in GHC 6.8?
On Tue, Jan 22, 2008 at 03:59:24PM +, Magnus Therning wrote: Yes, of course, stupid me. But it is still the UTF-8 representation of ö, not Latin-1, and this brings me back to my original question, is this an intentional change in 6.8? Yes (in 6.8.2, to be precise). It's in the release notes: http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html GHCi now treats all input as unicode, except for the Windows console where we do the correct conversion from the current code page. Thanks Ian ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe