Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Magnus Therning
On 1/22/08, Ian Lynagh [EMAIL PROTECTED] wrote:

 On Tue, Jan 22, 2008 at 03:59:24PM +, Magnus Therning wrote:
 
  Yes, of course, stupid me.  But it is still the UTF-8 representation of
 ö,
  not Latin-1, and this brings me back to my original question, is this an
  intentional change in 6.8?

 Yes (in 6.8.2, to be precise).

 It's in the release notes:

 http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html
 GHCi now treats all input as unicode, except for the Windows console
 where we do the correct conversion from the current code page.


Excellent news.  One step closer to sanity when it comes to character
encodings on the command line :-)

/M
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Ketil Malde
Peter Verswyvelen [EMAIL PROTECTED] writes:

 Prelude Data.Char map ord ö
 [195,182]
 Prelude Data.Char length ö
 2

 there are actually 2 bytes there, but your terminal is showing them as
 one character.

 So let's all switch to unicode ASAP and leave that horrible
 multi-byte-string-thing behind us?

You are being ironic, I take it?

Unicode by its nature implies multi-byte chars, it's just a question
of how they are encoded: UTF-8 (one or more bytes, variable), UTF-16
(two or four, variable), or UCS-4 (or should it be UTF-32? - four
bytes, fixed).  The problem here is that while terminal software have
been UTF-8 for some time, GHC only recently caught up.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Peter Verswyvelen

Ketil Malde wrote:

So let's all switch to unicode ASAP and leave that horrible
multi-byte-string-thing behind us?



You are being ironic, I take it?
  
No I just used wrong terminology. When I said unicode, I actually meant 
UCS-x, and with multi-byte-string-thing I meant VARIABLE-length, sorry 
about that. I find variable length chars so much harder to use and 
reason about than the fixed length characters. UTF-x is a form of 
compression, which is understandable, but it is IMHO a burden (since it 
does not allow random access to the n-th character)


Now I'm getting a bit confused here. To summarize, what encoding does 
GHC 6.8.2 use for [Char]? UCS-32?


BTW: According the Wikipedia, UCS-4 and UTF-32 are functionally equivalent.

Cheers,
Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Jules Bean

Peter Verswyvelen wrote:

Now I'm getting a bit confused here. To summarize, what encoding does 
GHC 6.8.2 use for [Char]? UCS-32?


How dare you! Such a personal question! This is none of your business.

I jest, but the point is sound: the internal storage of Char is ghc's 
business, and it should not leak to the programmer. All the programmer 
needs to know is that Char is capable of storing unicode characters. GHC 
might choose some custom storage method, including making Char an ADT 
behind the scenes, or whatever it likes. Other haskell compilers or 
interpreters are free to choose their own representation.


In practice, I believe that for GHC it's a wchar, which is typically a 
32bit character with reasonably efficient libc support.


What *does* matter to the programmer is what encodings putStr and 
getLine use. AFAIK, they use lower 8 bits of unicode code point which 
is almost functionally equivalent to latin-1.


Jules

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Ketil Malde
Peter Verswyvelen [EMAIL PROTECTED] writes:

 No I just used wrong terminology. When I said unicode, I actually meant UCS-x,

You might as well say UCS-4, nobody uses UCS-2 anymore.  It's been
replaced by UTF-16, which gives you the complexity of UTF-8 without
being compact (for 99% of existing data), endianness-indifferent, or backwards
compatibe with ASCII. 

 and with multi-byte-string-thing I meant VARIABLE-length, sorry about that. I
 find variable length chars so much harder to use and reason about than the
 fixed length characters. UTF-x is a form of compression, which is
 understandable, but it is IMHO a burden (since it does not allow random access
 to the n-th character)

Do you really need that, though?  Most formats I know with enough structure
that you can pick up records by offset either encode the offsets
somewhere, or are restricted to ASCII, or both.

 Now I'm getting a bit confused here. To summarize, what encoding does GHC 
 6.8.2
 use for [Char]? UCS-32?

Internally, Haskell Chars are Unicode, and stores a code point as a
32bit (well, actually 21 bit or something) value.  One Char, one code
point. 

ByteString stores 8-bit chars, and the Char8 interface chops off the
top bits, essentially projecting codepoints down to the ISO-8859-1
(latin1) subset.

Externally, it depends on what IO library you use.

As for the command line, Ian's post links to:
  http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Johan Tibell
On Jan 23, 2008 11:56 AM, Jules Bean [EMAIL PROTECTED] wrote:
 Peter Verswyvelen wrote:

  Now I'm getting a bit confused here. To summarize, what encoding does
  GHC 6.8.2 use for [Char]? UCS-32?

 [snip]

 What *does* matter to the programmer is what encodings putStr and
 getLine use. AFAIK, they use lower 8 bits of unicode code point which
 is almost functionally equivalent to latin-1.

Which is terrible! You should have to be explicit about what encoding
you expect. Python 3000 does it right.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Jules Bean

Johan Tibell wrote:

On Jan 23, 2008 11:56 AM, Jules Bean [EMAIL PROTECTED] wrote:

Peter Verswyvelen wrote:


Now I'm getting a bit confused here. To summarize, what encoding does
GHC 6.8.2 use for [Char]? UCS-32?

[snip]

What *does* matter to the programmer is what encodings putStr and
getLine use. AFAIK, they use lower 8 bits of unicode code point which
is almost functionally equivalent to latin-1.


Which is terrible! You should have to be explicit about what encoding
you expect. Python 3000 does it right.


No arguments there.

Presumably there wasn't a sufficiently good answer available in time for 
haskell98.


Jules
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread david48
On Jan 23, 2008 12:13 PM, Jules Bean [EMAIL PROTECTED] wrote:

 Presumably there wasn't a sufficiently good answer available in time for
 haskell98.

Will there be one for haskell prime ?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Johan Tibell
What *does* matter to the programmer is what encodings putStr and
getLine use. AFAIK, they use lower 8 bits of unicode code point which
is almost functionally equivalent to latin-1.
  
   Which is terrible! You should have to be explicit about what encoding
   you expect. Python 3000 does it right.
 
  Presumably there wasn't a sufficiently good answer available in time for
  haskell98.

 Will there be one for haskell prime ?

The I/O library needs an overhaul but I'm not sure how to do this in a
backwards compatible manner which probably would be required for
inclusion in Haskell'. One could, like Python 3000, break backwards
compatibility. I'm not sure about the implications of doing this.
Maybe introducing a new System.IO.Unicode module would be an option.

If one wants to keep the interface but change the semantics slightly
one could define e.g. getChar as:

getChar :: IO Char
getChar = getWord8 = decodeChar latin1

Assuming latin-1 is what's used now.

The benefit would be that if the input is not in latin-1 an exception
could be thrown rather than returning a Char representing the wrong
Unicode code point.

I recommend reading about the Python I/O system overhaul for Python
3000 which is outlined in PEP 3116
http://www.python.org/dev/peps/pep-3116/

My proposal is for I/O functions to specify the encoding they use if
they accept or return Chars (and Strings). If they deal in terms of
bytes (e.g. socket functions) they should accept and return Word8s.
Optionally, text I/O functions could default to the system locale
setting.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Jules Bean

Johan Tibell wrote:

What *does* matter to the programmer is what encodings putStr and
getLine use. AFAIK, they use lower 8 bits of unicode code point which
is almost functionally equivalent to latin-1.

Which is terrible! You should have to be explicit about what encoding
you expect. Python 3000 does it right.

Presumably there wasn't a sufficiently good answer available in time for
haskell98.

Will there be one for haskell prime ?


The I/O library needs an overhaul but I'm not sure how to do this in a
backwards compatible manner which probably would be required for
inclusion in Haskell'. One could, like Python 3000, break backwards
compatibility. I'm not sure about the implications of doing this.
Maybe introducing a new System.IO.Unicode module would be an option.

If one wants to keep the interface but change the semantics slightly
one could define e.g. getChar as:

getChar :: IO Char
getChar = getWord8 = decodeChar latin1

Assuming latin-1 is what's used now.

The benefit would be that if the input is not in latin-1 an exception
could be thrown rather than returning a Char representing the wrong
Unicode code point.


I'm not sure what you mean here. All 256 possible values have a meaning.

I did say 'lower 8 bits of unicode code point which is almost 
functionally equivalent to latin-1.'


IIUC, it's latin-1 plus the two control-character ranges.

There are no decoding errors for haskell98's getChar.


My proposal is for I/O functions to specify the encoding they use if
they accept or return Chars (and Strings). If they deal in terms of
bytes (e.g. socket functions) they should accept and return Word8s.


I would be more inclined to suggest they default to a particular well 
understand encoding, almost certainly UTF8. Another interface could give 
access to other encodings.



Optionally, text I/O functions could default to the system locale
setting.


That is a disastrous idea.

Please read the other flamewars^Wdiscussions on this list about this 
subject :) One was started by a certain Johann Tibell :)


http://haskell.org/pipermail/haskell-cafe/2007-September/031724.html

http://haskell.org/pipermail/haskell-cafe/2007-September/032195.html

Jules

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Johan Tibell
  The benefit would be that if the input is not in latin-1 an exception
  could be thrown rather than returning a Char representing the wrong
  Unicode code point.

 I'm not sure what you mean here. All 256 possible values have a meaning.

You're of course right. So we don't have a problem here. Maybe I was
thinking of an encoding (7-bit ASCII?) where some of the 256 values
are invalid.

  My proposal is for I/O functions to specify the encoding they use if
  they accept or return Chars (and Strings). If they deal in terms of
  bytes (e.g. socket functions) they should accept and return Word8s.

 I would be more inclined to suggest they default to a particular well
 understand encoding, almost certainly UTF8. Another interface could give
 access to other encodings.

That might be a good option. However, it would be nice if beginners
could write simple console programs using System.IO and have them work
correctly even if their system's encoding is not byte compatible with
UTF-8. People who do I/O over the network etc. need to be more careful
and should specify the encoding used. How would a UTF-8 default work
on different Windows versions?

  Optionally, text I/O functions could default to the system locale
  setting.

 That is a disastrous idea.

I'm not sure about that as long as decode is called on the input to
make sure that it's a valid encoding given the input bytes. Same point
as above. What I would like to avoid is having to write:

main = do
  putStrLn systemLocalEncoding What's your name?
  name - getLine systemLocalEncoding
  putStrLn systemLocalEncoding  $ Hi  ++ name ++ !

I guess we could solve this by putting the functions in different modules:

System.IO  -- requires explicit encoding
System.IO.DefaultEncoding  -- implicit use of system locale setting

And have the modules export the same functions. Another option would
be to include the fact that encoding is implied in the name of the
function. Maybe we should start by giving some type signatures and
function names. That often helps my thinking. I'll try to write
something down when I get home from work.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Magnus Therning
On 1/23/08, Johan Tibell [EMAIL PROTECTED] wrote:
[..]

My proposal is for I/O functions to specify the encoding they use if
 they accept or return Chars (and Strings). If they deal in terms of
 bytes (e.g. socket functions) they should accept and return Word8s.
 Optionally, text I/O functions could default to the system locale
 setting.


Yes, this reflects my recent experience, Char is not a good representation
for an 8-bit byte.  This thread came out of my attempt to add a module to
dataenc[1] that would make base64-string[2] obsolete.  As you probably can
guess I came to the conclusion that a function for data encoding with type
'String - String' is plain wrong. :-)

/M

[1]:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dataenc-0.10.2
[2]:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/base64-string-0.1
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Reinier Lamers

Johan Tibell wrote:

What *does* matter to the programmer is what encodings putStr and
getLine use. AFAIK, they use lower 8 bits of unicode code point which
is almost functionally equivalent to latin-1.
  

Which is terrible! You should have to be explicit about what encoding
you expect. Python 3000 does it right.


Presumably there wasn't a sufficiently good answer available in time for
haskell98.
  

Will there be one for haskell prime ?



The I/O library needs an overhaul but I'm not sure how to do this in a
backwards compatible manner which probably would be required for
inclusion in Haskell'. One could, like Python 3000, break backwards
compatibility. I'm not sure about the implications of doing this.
Maybe introducing a new System.IO.Unicode module would be an option.

There are already some libraries that attempt to create a new string and
I/O library for Haskell, based on Unicode, with a separation of byte
semantics and character semantics. See for example Streams [1] or
CompactString [2].

Regards,
Reinier

[1]: http://haskell.org/haskellwiki/Library/Streams
[2]: http://twan.home.fmf.nl/compact-string/



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Ketil Malde
Johan Tibell [EMAIL PROTECTED] writes:

 The benefit would be that if the input is not in latin-1 an exception
 could be thrown rather than returning a Char representing the wrong
 Unicode code point.

 I'm not sure what you mean here. All 256 possible values have a meaning.

OTOH, going the other way could be more troublesome, I'm not sure that
outputting a truncated value is what you want.

 You're of course right. So we don't have a problem here. Maybe I was
 thinking of an encoding (7-bit ASCII?) where some of the 256 values
 are invalid.

Well - each byte can be converted to the equivalent code point, but
0x80-0x9F are control characters, and some of those are left
undefined.  Perhaps instead of truncating on output, we should map
code points  0xFF to such a value?  E.g. 0x81 is undefined in both
Unicode and Windows 1252. 

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-23 Thread Johan Tibell
On Jan 23, 2008 2:11 PM, Magnus Therning [EMAIL PROTECTED] wrote:
 Yes, this reflects my recent experience, Char is not a good representation
 for an 8-bit byte.  This thread came out of my attempt to add a module to
 dataenc[1] that would make base64-string[2] obsolete.  As you probably can
 guess I came to the conclusion that a function for data encoding with type
 'String - String' is plain wrong. :-)

Yes. Functions that deal with bytes shouldn't use Char. Char should be
seen as and ADT representing Unicode code points. It has nothing to do
with bytes.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Magnus Therning
I vaguely remember that in GHC 6.6 code like this

  length $ map ord a string

being able able to generate a different answer than

  length a string

At the time I thought that the encoding (in my case UTF-8) was “leaking
through”.  After switching to GHC 6.8 the behaviour seems to have
changed, and mapping 'ord' on a string results in a list of ints
representing the Unicode code point rather than the encoding:

   map ord åäö
  [229,228,246]

Is this the case, or is there something strange going on with character
encodings?

I was hoping that this would mean that 'chr . ord' would basically be a
no-op, but no such luck:

   chr . ord $ 'å'
  '\229'

What would I have to do to get an 'å' from '229'?

/M

-- 
Magnus Therning (OpenPGP: 0xAB4DFBA4)
magnus@therning.org Jabber: magnus.therning@gmail.com
http://therning.org/magnus

What if I don't want to obey the laws? Do they throw me in jail with
the other bad monads?
 -- Daveman



signature.asc
Description: OpenPGP digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Miguel Mitrofanov
chr . ord $ 'å'
   '\229'
 What would I have to do to get an 'å' from '229'?

It seems you already have it; 'å' is the same as '\229'. But IO output is still 
8-bit, so when you ask ghci to print 'å', you get '\229'. You can use 
utf-string library (from hackage).
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Felipe Lessa
2008/1/22 Magnus Therning [EMAIL PROTECTED]:
 I vaguely remember that in GHC 6.6 code like this

   length $ map ord a string

 being able able to generate a different answer than

   length a string

I guess it's not very difficult to prove that

 ∀ f xs.  length xs == length (map f xs)

even in the presence of seq.

-- 
Felipe.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Duncan Coutts

On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote:
 I vaguely remember that in GHC 6.6 code like this
 
   length $ map ord a string
 
 being able able to generate a different answer than
 
   length a string

That seems unlikely.

 At the time I thought that the encoding (in my case UTF-8) was “leaking
 through”.  After switching to GHC 6.8 the behaviour seems to have
 changed, and mapping 'ord' on a string results in a list of ints
 representing the Unicode code point rather than the encoding:

Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them
as Latin-1.

map ord åäö
   [229,228,246]
 
 Is this the case, or is there something strange going on with character
 encodings?

That's what we'd expect. Note that GHCi still uses Latin-1. This will
change in GHC-6.10.

 I was hoping that this would mean that 'chr . ord' would basically be a
 no-op, but no such luck:
 
chr . ord $ 'å'
   '\229'
 
 What would I have to do to get an 'å' from '229'?

Easy!

Prelude 'å' == '\229'
True
Prelude 'å' == Char.chr 229
True

Remember, when you type:
Prelude 'å'

what you really get is:
Prelude putStrLn (show 'å')

So perhaps what is confusing you is the Show instance for Char which
converts Char - String into a portable ascii representation.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Duncan Coutts

On Tue, 2008-01-22 at 12:56 +0300, Miguel Mitrofanov wrote:
 chr . ord $ 'å'
'\229'
  What would I have to do to get an 'å' from '229'?
 
 It seems you already have it; 'å' is the same as '\229'.

Yes.

  But IO output is still 8-bit, so when you ask ghci to print 'å', you get 
 '\229'.

No. :-)

if you 'print' it you get:

print 'å'
= putStrLn (show 'å')
= putStrLn \229

this has nothing to do with 8-bit IO. It's just what 'show' does for
Char.

If you..
putStrLn å
then you do get the low 8 bits being printed. But that's not what is
going on above.

  You can use utf-string library (from hackage).

import qualified Codec.Binary.UTF8.String as UTF8
putStrLn (UTF8.encodeString å)

or just:

import qualified System.IO.UTF8
UTF8.putStrLn å


Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Henning Thielemann

On Tue, 22 Jan 2008, Duncan Coutts wrote:

  At the time I thought that the encoding (in my case UTF-8) was “leaking
  through”.  After switching to GHC 6.8 the behaviour seems to have
  changed, and mapping 'ord' on a string results in a list of ints
  representing the Unicode code point rather than the encoding:

 Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them
 as Latin-1.

Can this be controlled by an option?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Duncan Coutts

On Tue, 2008-01-22 at 13:48 +0100, Henning Thielemann wrote:
 On Tue, 22 Jan 2008, Duncan Coutts wrote:
 
   At the time I thought that the encoding (in my case UTF-8) was “leaking
   through”.  After switching to GHC 6.8 the behaviour seems to have
   changed, and mapping 'ord' on a string results in a list of ints
   representing the Unicode code point rather than the encoding:
 
  Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them
  as Latin-1.
 
 Can this be controlled by an option?

From the GHC manual:

GHC assumes that source files are ASCII or UTF-8 only, other
encodings are not recognised. However, invalid UTF-8 sequences
will be ignored in comments, so it is possible to use other
encodings such as Latin-1, as long as the non-comment source
code is ASCII only.

There is no option to have GHC assume a different encoding. You can use
something like iconv to convert .hs files from another encoding into
UTF-8.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Magnus Therning
On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote:


 On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote:
  I vaguely remember that in GHC 6.6 code like this
 
length $ map ord a string
 
  being able able to generate a different answer than
 
length a string

 That seems unlikely.


Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version
currently in Debian Sid):

 map ord a
[97]
 map ord ö
[195,182]

Funky, isn't it? ;-)

Easy!

 Prelude 'å' == '\229'
 True
 Prelude 'å' == Char.chr 229
 True

 Remember, when you type:
 Prelude 'å'

 what you really get is:
 Prelude putStrLn (show 'å')

 So perhaps what is confusing you is the Show instance for Char which
 converts Char - String into a portable ascii representation.


Have you tried putting any of this into GHCi (6.6.1)?  Any line with 'å'
results in the following for me:

 'å'
interactive:1:2: lexical error in string/character literal at character
'\165'
 å
\195\165

Somewhat disappointing.  GHCi 6.8.2 does perform better though.

/M
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Ian Lynagh
On Tue, Jan 22, 2008 at 03:16:15PM +, Magnus Therning wrote:
 On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote:
 
 
  On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote:
   I vaguely remember that in GHC 6.6 code like this
  
 length $ map ord a string
  
   being able able to generate a different answer than
  
 length a string
 
  That seems unlikely.
 
 
 Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version
 currently in Debian Sid):
 
  map ord a
 [97]
  map ord ö
 [195,182]

In 6.6.1:

Prelude Data.Char map ord ö
[195,182]
Prelude Data.Char length ö
2

there are actually 2 bytes there, but your terminal is showing them as
one character.


Thanks
Ian

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Bulat Ziganshin
Hello Duncan,

Tuesday, January 22, 2008, 1:36:44 PM, you wrote:

 Yes. GHC 6.8 treats .hs files as UTF-8 where it previously treated them
 as Latin-1.

afair, it was changed since 6.6


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Reinier Lamers

Ian Lynagh wrote:

On Tue, Jan 22, 2008 at 03:16:15PM +, Magnus Therning wrote:
  

On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote:


On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote:
  

I vaguely remember that in GHC 6.6 code like this

  length $ map ord a string

being able able to generate a different answer than

  length a string


That seems unlikely.
  

Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version
currently in Debian Sid):



map ord a
  

[97]


map ord ö
  

[195,182]



In 6.6.1:

Prelude Data.Char map ord ö
[195,182]
Prelude Data.Char length ö
2

there are actually 2 bytes there, but your terminal is showing them as
one character.
Still, that seems weird to me. A Haskell Char is a Unicode character. An 
ö is either one character (unicode point 0xF6) (which, in UTF-8, is 
coded as two bytes) or a combination of an o with an umlaut (Unicode 
point 776). But because the last character is not 776, the ö here 
should just be one character. I'd suspect that the two-character string 
comes from the terminal speaking UTF-8 to GHC expecting Latin-1. GHC 6.8 
expects UTF-8, so all is fine.


On my MacBook (OS X 10.4), 'ö' also immediately expands to \303\266 
when I type it in my terminal, even outside GHCi. That suggests that the 
terminal program doesn't handle Unicode and immediately escapes weird 
characters.


Regards,
Reinier
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Peter Verswyvelen

Ian Lynagh wrote:

Prelude Data.Char map ord ö
[195,182]
Prelude Data.Char length ö
2

there are actually 2 bytes there, but your terminal is showing them as
one character.
  
So let's all switch to unicode ASAP and leave that horrible 
multi-byte-string-thing behind us?


Cheers,
Peter




___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Magnus Therning
On 1/22/08, Ian Lynagh [EMAIL PROTECTED] wrote:

 On Tue, Jan 22, 2008 at 03:16:15PM +, Magnus Therning wrote:
  On 1/22/08, Duncan Coutts [EMAIL PROTECTED] wrote:
  
  
   On Tue, 2008-01-22 at 09:29 +, Magnus Therning wrote:
I vaguely remember that in GHC 6.6 code like this
   
  length $ map ord a string
   
being able able to generate a different answer than
   
  length a string
  
   That seems unlikely.
 
 
  Unlikely yes, yet I get the following in GHCi (ghc 6.6.1, the version
  currently in Debian Sid):
 
   map ord a
  [97]
   map ord ö
  [195,182]

 In 6.6.1:

 Prelude Data.Char map ord ö
 [195,182]
 Prelude Data.Char length ö
 2

 there are actually 2 bytes there, but your terminal is showing them as
 one character.


Yes, of course, stupid me.  But it is still the UTF-8 representation of ö,
not Latin-1, and this brings me back to my original question, is this an
intentional change in 6.8?

 map ord ö
[246]
 map ord åɓz퐀
[229,595,65370,119808]

6.8 produces Unicode code points rather then a particular encoding.

/M
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Jules Bean

Magnus Therning wrote:
Yes, of course, stupid me.  But it is still the UTF-8 representation of 
ö, not Latin-1, and this brings me back to my original question, is 
this an intentional change in 6.8?


  map ord ö
[246]
  map ord åɓz퐀
[229,595,65370,119808]

6.8 produces Unicode code points rather then a particular encoding.


The key point here is this has nothing to do with GHC.

GHC's behaviour has not changed in this regard.

This is about GHCi! [And, to some extent, the behaviour of whatever 
shell / terminal emulator you run ghci in]


Sounds like a pedantic difference, but it's not.

The difference here is what GHCi is feeding into your haskell code when 
you type the sequence ö at a ghci prompt, rather than anything 
different about the underlying behaviour of map, ord, length, show, 
putStr. map, ord, length, show, putStr have not changed from 6.6 to 6.8.


I don't have 6.8 handy myself but from your demonstration is would 
appear that 6.8's ghci correctly understands whatever input encoding is 
being used in whatever terminal environment you are choosing to run ghci 
within.


Whereas, 6.6's ghci was using a single-byte terminal approach, and your 
terminal environment was encoding ö as two characters.


Jules
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Derek Elkins
On Tue, 2008-01-22 at 07:45 -0200, Felipe Lessa wrote:
 2008/1/22 Magnus Therning [EMAIL PROTECTED]:
  I vaguely remember that in GHC 6.6 code like this
 
length $ map ord a string
 
  being able able to generate a different answer than
 
length a string
 
 I guess it's not very difficult to prove that
 
  ∀ f xs.  length xs == length (map f xs)
 
 even in the presence of seq.

This is the free theorem of length.  For it to be wrong, parametric
polymorphism would have to be incorrectly implemented.  Even seq makes
no difference (in this case.)

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Has character changed in GHC 6.8?

2008-01-22 Thread Ian Lynagh
On Tue, Jan 22, 2008 at 03:59:24PM +, Magnus Therning wrote:
 
 Yes, of course, stupid me.  But it is still the UTF-8 representation of ö,
 not Latin-1, and this brings me back to my original question, is this an
 intentional change in 6.8?

Yes (in 6.8.2, to be precise).

It's in the release notes:

http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html
GHCi now treats all input as unicode, except for the Windows console
where we do the correct conversion from the current code page.


Thanks
Ian

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe