Re: [Haskell-cafe] invalid character encoding

2005-03-21 Thread Ross Paterson
On Sun, Mar 20, 2005 at 04:34:12AM +, Ian Lynagh wrote:
 On Sun, Mar 20, 2005 at 01:33:44AM +, [EMAIL PROTECTED] wrote:
  On Sat, Mar 19, 2005 at 07:14:25PM +, Ian Lynagh wrote:
   Is there anything LC_CTYPE can be set to that will act like C/POSIX but
   accept 8-bit bytes as chars too?
  
  en_GB.iso88591 (or indeed any .iso88591 locale) will match the old
  behaviour (and the GHC behaviour).
 
 This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591
 (or en_US). My /etc/locale.gen contains:
 
 en_GB ISO-8859-1
 en_GB.ISO-8859-15 ISO-8859-15
 en_GB.UTF-8 UTF-8
 
 So is there anything that /always/ works?

Since systems may have no locale other than C/POSIX, no.

  Yes, I don't see how to avoid this when using mbtowc() to do the
  conversion: it makes no distinction between a bad byte sequence and an
  incomplete one.
 
 Perhaps you could use mbrtowc instead?

Indeed.  Thanks for pointing it out.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-21 Thread Glynn Clements

John Meacham wrote:

  I'm not suggesting inventing conventions. I'm suggesting leaving such
  issues to the application programmer who, unlike the library
  programmer, probably has enough context to be able to reliably
  determine the correct encoding in any specific instance.
 
 But the whole point of Foreign.C.String is to interface to existing C
 code. And one of the most common conventions of said interfaces is to
 represent strings in the current locale, Which is why locale honoring
 conversion routines are useful. 

My point is that most C functions which accept or return char*s will
work regardless of whether those char*s can be decoded according to
the current locale. E.g.

while (d = readdir(dir), d)
{
stat(d-d_name, st);
...
}

will stat() every filename in the directory regardless of whether or
not the filenames are valid in the locale's encoding.

The Haskell equivalent using FilePath (i.e. String),
getDirectoryContents etc currently only works because the char* -
String conversions are hardcoded to ISO-8859-1, which is infallible
and reversible. If it used e.g. UTF-8, it would fail on any filename
which wasn't valid UTF-8 even though it never actually needs to know
the string of characters which the filename represents.

The same applies to reading filenames from argv[] and passing them to
open() etc. This is one of the most common idioms in Unix programming,
and it doesn't care about encodings at all. Again, it would cease to
work reliably in Haskell if the automatic char* - String conversions
in getArgs etc started using the locale.

I'm not arguing about *how* char* - String conversions should be
performed so much as arguing about *whether* these conversions should
be performed. The conversion issues are only problems because the
conversions are being done at all.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-20 Thread Keean Schupke
One thing I don't like about this automatic conversion is that it is 
hidden magic - and could catch people out. Let's say I don't want to use 
it... How can I do the following
(ie what are the new API calls):

   Open a file with a name that is invalid in the current locale (say a 
zip disc from a computer with a different locale setting).

   Open a file with contents in an unknown encoding.
   What are the new binary API calls for file IO?
   What type is returned from 'getChar' on a binary file. Should it 
even be called getChar? what about getWord8 (getWord16, getWord32 etc...)

   Does the encoding translation occur just on the filename or the 
contents as well? What if I have an encoded filename with binary 
contents and vice-versa.

   Keean.
(I guess I now have to rewrite a lot of file IO code!)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-20 Thread ross
On Sun, Mar 20, 2005 at 12:59:52PM +, Keean Schupke wrote:
 How can I do the following (ie what are the new API calls):
 
Open a file with a name that is invalid in the current locale (say a 
 zip disc from a computer with a different locale setting).

A new API is needed for this.

Open a file with contents in an unknown encoding.
 
What are the new binary API calls for file IO?

see System.IO

What type is returned from 'getChar' on a binary file. Should it 
 even be called getChar? what about getWord8 (getWord16, getWord32 etc...)

Char, of course.  And yes, it's not ideal.  There's also a byte array
interface.

 (I guess I now have to rewrite a lot of file IO code!)

If it was doing binary I/O on H98 Handles, it already needed rewriting.
There's nothing to be done for filenames until a new API emerges.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Einar Karttunen
Wolfgang Thaller [EMAIL PROTECTED] writes:
 In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 
 file name that is converted to Unicode cannot be converted back any 
 more (assuming you know for sure that it was ISO-2022 in the first 
 place)?

I am no expert on ISO-2022 so the following may contain errors,
please correct if it is wrong.

ISO-2022 - Unicode is always possible.
Also Unicode - ISO-2022 should be always possible, but is a relation
not a function. This means there are an infinite? ways of encoding a
particular unicode string in ISO-2022.

ISO-2022 works by providing escape sequences to switch between different
character sets. One can freely use these escapes in almost any way you
wish. Also ISO-2022 makes a difference between the same character in
japanese/chinese/korean - which unicode does not do.

See here for more info on the topic:
http://www.ecma-international.org/publications/files/ecma-st/ECMA-035.pdf


Also trusting system locale for everything is problematic and makes
things quite unbearable for I18N. e.g. on my desktop 95% of things run
with iso-8859-1, 3% of things use utf-8 and a few apps use EUC-JP...

Using filenames as opaque blobs causes the least problems. If the
program wishes to display them in a graphical environment then they have
to be converted to a string, but very many apps never display the
filenames...

- Einar Karttunen
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread David Roundy
On Sat, Mar 19, 2005 at 12:55:54PM +0100, Marcin 'Qrczak' Kowalczyk wrote:
 Glynn Clements [EMAIL PROTECTED] writes:
  The point is that a single program often generates multiple streams of
  text, possibly for different audiences (e.g. humans and machines).
  Different streams may require different conventions (encodings,
  numeric formats, collating orders), but may use the same functions.
 
 A single program has a single stdout and a single filesystem. The
 contexts which use the locale encoding don't need multiple encodings.

That's not true, there could be many filesystems, each of which uses a
different encoding for the filenames.  In the case of removable media, this
scenario isn't even unlikely.
-- 
David Roundy
http://www.darcs.net
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Glynn Clements

Einar Karttunen wrote:

  In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 
  file name that is converted to Unicode cannot be converted back any 
  more (assuming you know for sure that it was ISO-2022 in the first 
  place)?
 
 I am no expert on ISO-2022 so the following may contain errors,
 please correct if it is wrong.
 
 ISO-2022 - Unicode is always possible.
 Also Unicode - ISO-2022 should be always possible, but is a relation
 not a function. This means there are an infinite? ways of encoding a
 particular unicode string in ISO-2022.
 
 ISO-2022 works by providing escape sequences to switch between different
 character sets. One can freely use these escapes in almost any way you
 wish.

Exactly.

Moreover, while there are an infinite number of equivalent
representations in theory (you can add as many redundant switching
sequences as you wish), there are multiple plausible equivalent
representations in practice.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  I'm talking about standard (XSI) curses, which will just pass
  printable (non-control) bytes straight to the terminal. If your
  terminal uses CP437 (or some other non-standard encoding), you can
  just pass the appropriate bytes to waddstr() etc and the corresponding
  characters will appear on the terminal.
 
 Which terminal uses CP437?

Most software terminal emulators can use any encoding. Traditional
comms packages tend to support this (including their own VGA font if
necessary) because of its widespread use on BBSes which were targeted
at MS-DOS systems.

There exist hardware terminals (I can't name specific models, but I
have seen them in use) which support this, specifically for use with
MS-DOS systems.

 Linux console doesn't, except temporarily after switching the mapping
 to builtin CP437 (but this state is not used by curses) or after
 loading CP437 as the user map (nobody does this, and it won't work
 properly with all characters from the range 0x80-0x9F anyway).

I *still* encounter programs written for the linux console which
assume that the built-in CP437 font is being used (if you use an
ISO-8859-1 font, you get dialogs with accented characters where you
would expect line-drawing characters).

  You can treat it as immutable. Just don't call setlocale with
  different arguments again.
 
  Which limits you to a single locale. If you are using the locale's
  encoding, that limits you to a single encoding.
 
 There is no support for changing the encoding of a terminal on the fly
 by programs running inside it.

If you support multiple terminals with different encodings, and the
library uses the global locale settings to determine the encoding, you
need to switch locale every time you write to a different terminal.

  The point is that a single program often generates multiple streams of
  text, possibly for different audiences (e.g. humans and machines).
  Different streams may require different conventions (encodings,
  numeric formats, collating orders), but may use the same functions.
 
 A single program has a single stdout and a single filesystem. The
 contexts which use the locale encoding don't need multiple encodings.
 
 Multiple encodings are needed e.g. for exchanging data with other
 machines for the network, for reading contents of text files after the
 user has specified an encoding explicitly etc. In these cases an API
 with explicitly provided encoding should be used.

A API which is used for reading and writing text files or sockets is
just as applicable to stdin/stdout.

   The current locale mechanism is just a way of avoiding the issues
   as much as possible when you can't get away with avoiding them
   altogether.
  
  It's a way to communicate the encoding of the terminal, filenames,
  strerror, gettext etc.
 
  It's *a* way, but it's not a very good way. It sucks when you can't
  apply a single convention to everything.
 
 It's not so bad to justify inventing our own conventions and forcing
 users to configure the encoding of Haskell programs separately.

I'm not suggesting inventing conventions. I'm suggesting leaving such
issues to the application programmer who, unlike the library
programmer, probably has enough context to be able to reliably
determine the correct encoding in any specific instance.

  Unicode has no viable competition.
 
  There are two viable alternatives. Byte strings with associated
  encodings and ISO-2022.
 
 ISO-2022 is an insanely complicated brain-damaged mess. I know it's
 being used in some parts of the world, but the sooner it will die,
 the better.

ISO-2022 has advantages and disadvantages relative to UTF-8. I don't
want to go on about the specifics here because they aren't
particularly relevant. What's relevant is that it isn't likely to
disappear any time soon.

A large part of the world already has a universal encoding which works
well enough; they don't *need* UTF-8, and aren't going to rebuild
their IT infrastructure from scratch for the sake of it.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Keean Schupke
David Roundy wrote:
That's not true, there could be many filesystems, each of which uses a
different encoding for the filenames.  In the case of removable media, this
scenario isn't even unlikely.
 

I agree - I can quite easily see the situation occuring where a student 
(say from japan) brings in a zip-disk or USB key formatted with a 
japanese filename encoding, that I need to read on my computer (with a 
UK locale).

Also can different windows have different encodings? I might have a web 
browser (written in haskell?) running and have windows with several 
different encodings open at the same time, whist saving things on 
filesystems with differing encodings.

Keean.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Mark Carroll
On Sat, 19 Mar 2005, David Roundy wrote:

 That's not true, there could be many filesystems, each of which uses a
 different encoding for the filenames.  In the case of removable media, this
 scenario isn't even unlikely.

The nearest desktop machine to me right now has in its directory structure
filesystems that use different encodings. So, yes, it's probably not all
that rare.

Mark.

-- 
Haskell vacancies in Columbus, Ohio, USA: see http://www.aetion.com/jobs.html
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Glynn Clements

Wolfgang Thaller wrote:

  Of course, it's quite possible that the only test cases will be people
  using UTF-8-only (or even ASCII-only) systems, in which case you won't
  see any problems.
 
 I'm kind of hoping that we can just ignore a problem that is so rare 
 that a large and well-known project like GTK2 can get away with 
 ignoring it.

1. The filename issues in GTK-2 are likely to be a major problem in
CJK locales, where filenames which don't match the locale (which is
seldom UTF-8) are common.

2. GTK's filename handling only really applies to file selector
dialogs. Most other uses of filenames in a GTK-based application don't
involve GTK; they use the OS API functions which just deal with byte
strings.

3. GTK is a GUI library. Most of the text which it deals with is going
to be rendered, so it *has* to be interpreted as characters. Treating
it as blobs of data won't work. IOW, on the question of whether or not
to interpret byte strings as character strings, GTK is at the far end
of the scale.

 Also, IIRC, Java strings are supposed to be unicode, too - 
 how do they deal with the problem?

Files are represented by instances of the File class:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html

An abstract representation of file and directory pathnames.

You can construct Files from Strings, and convert Files to Strings. 

The File class includes two sets of directory enumeration methods:
list() returns an array of Strings, while listFiles() returns an array
of Files.

The documentation for the File class doesn't mention encoding issues
at all. However, with that interface, it would be possible to
enumerate and open filenames which cannot be decoded.

  So we can't do Unicode-based I18N because there exist a few unix
  systems with messed-up file systems?
 
  Declaring such systems to be messed up won't make the problems go
  away. If a design doesn't work in reality, it's the fault of the
  design, not of reality.
 
 In general, yes. But we're not talking about all of reality here, we're 
 talking about one small part of reality - the question is, can the part 
 of reality where the design doesn't work be ignored?

Sure, you *can* ignore it; KR C ignored everything other than ASCII.
If you limit yourself to locales which use the Roman alphabet (i.e.
ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot.

Most such users avoid encoding issues altogether by dropping the
accents and sticking to ASCII, at least when dealing with files which
might leave their system.

To get a better idea, you would need to consult users whose language
doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately,
you don't usually find too many of them on lists such as this.

I'm only familiar with one OSS project which has a sizeable CJK user
base, and that's XEmacs (whose I18N revolves around ISO-2022, and most
of the documentation is in Japanese). Even there, there are separate
mailing lists for English and Japanese, and the two seldom
communicate.

 I think that if we wait long enough, the filename encoding problems 
 will become irrelevant and we will live in an ideal world where unicode 
 actually works. Maybe next year, maybe only in ten years.

Maybe not even then. If Unicode really solved encoding problems, you'd
expect the CJK world to be the first adopters, but they're actually
the least eager; you are more likely to find UTF-8 in an
English-language HTML page or email message than a Japanese one.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Marcin 'Qrczak' Kowalczyk
Wolfgang Thaller [EMAIL PROTECTED] writes:

 Also, IIRC, Java strings are supposed to be unicode, too -
 how do they deal with the problem?

Java (Sun)
--

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by ?.

Command line arguments and standard I/O are treated in the same way.


Java (GNU)
--

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
   contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified UTF-8.
   Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by ?.


C# (mono)
-

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it's skipped in
   a directory listing.

   The documentation says that if a filename, a command line argument
   etc. looks like valid UTF-8, it is treated as such first, and
   MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
   The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, U+ throws an exception
   (System.ArgumentException: Path contains invalid chars), paired
   surrogates are treated correctly, and an isolated surrogate causes
   an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion 
failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, unpaired surrogates are converted to pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Ian Lynagh
On Wed, Mar 16, 2005 at 11:55:18AM +, Ross Paterson wrote:
 On Wed, Mar 16, 2005 at 03:54:19AM +, Ian Lynagh wrote:
  Do you have a list of functions which behave differently in the new
  release to how they did in the previous release?
  (I'm not interested in changes that will affect only whether something
  compiles, not how it behaves given it compiles both before and after).
 
 I got lost in the negatives here.  It affects all Haskell 98 primitives
 that do character I/O, or that exchange C strings with the C library.

In the below, it looks like there is a bug in getDirectoryContents.

Also, the error from w.hs is going to stdout, not stderr.

Most importantly, though: is there any way to remove this file without
doing something like an FFI import of unlink?

Is there anything LC_CTYPE can be set to that will act like C/POSIX but
accept 8-bit bytes as chars too?


(in the POSIX locale)
$ echo 'import Directory; main = getDirectoryContents . = print'  q.hs
$ runhugs q.hs 
[.,..,q.hs]
$ touch 1`printf \xA2`
$ runhugs q.hs
runhugs: Error occurred

ERROR - Garbage collection fails to reclaim sufficient space


$ echo 'import Directory; main = removeFile 1\xA2'  w.hs
$ runhugs w.hs

Program error: 1?: Directory.removeFile: does not exist (file does not exist)
$ strace -o strace.out runhugs w.hs  /dev/null
$ grep unlink strace.out | head -c 14 | hexdump -C
  75 6e 6c 69 6e 6b 28 22  31 3f 22 29 20 20|unlink(1?)  |
000e
$ strace -o strace2.out rm 1*
$ grep unlink strace2.out | head -c 14 | hexdump -C
  75 6e 6c 69 6e 6b 28 22  31 a2 22 29 20 20|unlink(1.)  |
000e
$ 



Now consider this e.hs:


import IO

main = do hWaitForInput stdin 1
  putStrLn Input is ready
  r - hReady stdin
  print r
  c - hGetChar stdin
  print c
  putStrLn Done!


$ { printf \xC2\xC2\xC2\xC2\xC2\xC2\xC2; sleep 30; } | runhugs e.hs
Input is ready
True

Program error: stdin: IO.hGetChar: protocol error (invalid character encoding)
$ 

It takes 30 seconds for this error to be printed. This shows two issues:
First of all, I think you should be giving an error as soon as you have
a prefix that is the start of no character. Second, hReady now only
guarantees hGetChar won't block on a binary mode handle, but I guess
there is not much we can do except document that (short of some hideous
hacks).


Thanks
Ian

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Wolfgang Thaller
Also, IIRC, Java strings are supposed to be unicode, too -
how do they deal with the problem?
Files are represented by instances of the File class:
[...]
The documentation for the File class doesn't mention encoding issues
at all.
... which led me to conclude that they don't deal with the problem 
properly.

I think that if we wait long enough, the filename encoding problems
will become irrelevant and we will live in an ideal world where 
unicode
actually works. Maybe next year, maybe only in ten years.
Maybe not even then. If Unicode really solved encoding problems, you'd
expect the CJK world to be the first adopters, but they're actually
the least eager; you are more likely to find UTF-8 in an
English-language HTML page or email message than a Japanese one.
Hmm, that's possibly because english-language users can get away with 
just marking their ASCII files as UTF-8. But I'm not arguing files or 
HTML pages here, I'm only concerned with filenames. I prefer unicode 
nowadays because I was born within a hundred kilometers of the border 
between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language 
texts, but as soon as I write about where I went for vacation, I need a 
few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody 
ever tried to sell ISO-2022 to me, so unicode was the only alternative.

So you've now convinced me that there is a considerable number of 
computers using ISO-2022, where there's more than one way to encode the 
same text (how do people use this from the command line??). There is 
also multi-user systems where the user's don't agree on a single 
encoding. I still reserve the right to call those systems messed-up, 
but that's just my personal opinion and reality couldn't care less 
about what I think.

So, as I don't want to stick with the status quo forever (lists of 
bytes that pretend to be lists of unicode chars, even on platforms 
where unicode is used anyway), how about we get to work - what do we 
want?

I don't think we want a type class here, a plain (abstract) data type 
will do:

 data File
Obviously, we'll need conversion from and to C strings. On Mac OS X, 
they'd be guaranteed to be in UTF-8.

 withFilePathCString :: String - (CString - IO a) - IO a
 fileFromCString :: CString - IO File
We will need functions for converting to and from unicode strings. I'm 
pretty sure that we want to keep those functions pure, otherwise 
they'll be very annoying to use.

 fileFromPath :: String - File
Any impure operations that might be needed to decide how to encode the 
file name will have to be delayed until the File is actually used.

 fileToPath :: File - String
Same here: any impure operation necessary to convert the File to a 
unicode string needs to be done when the file is created.

What about failure? If you go from String to File, errors should be 
reported when you actually access the file. At an earlier time, you 
can't know whether the file name is valid (e.g. if you mount a 
classic HFS volume on Mac OS X, you can only create files there whose 
names can be represented in the volume's file name encoding - but you 
only find that out once you try to create a file).

For going from File to String, I'm not so sure, but I would be very 
annoyed if I had to deal with a Maybe String return type on platforms 
where it will always succeed. Maybe there should be separate functions 
for different purposes - i.e. for display, you'd use a File - String 
function that will silently use '?'s when things can't be decoded, but 
in other situations you might use a File - Maybe String function and 
check for Nothing.

If people want to implement more sophisticated ways of decoding file 
names than can be provided by the library, they'd get the C string and 
do the same things.

Of course, there should also be lots of other useful functions that 
make it more or less unnecessary to deal with path names directly in 
most cases.

Thoughts?
Cheers,
Wolfgang
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread ross
On Sat, Mar 19, 2005 at 07:14:25PM +, Ian Lynagh wrote:
 In the below, it looks like there is a bug in getDirectoryContents.

Yes, now fixed in CVS.

 Also, the error from w.hs is going to stdout, not stderr.

It's a nuisance, but noone has got around to changing it.

 Most importantly, though: is there any way to remove this file without
 doing something like an FFI import of unlink?
 
 Is there anything LC_CTYPE can be set to that will act like C/POSIX but
 accept 8-bit bytes as chars too?

en_GB.iso88591 (or indeed any .iso88591 locale) will match the old
behaviour (and the GHC behaviour).

Indeed it's possible to have filenames (under POSIX, anyway) that H98
programs can't touch (under Hugs).  That's pretty much follows from
the Haskell definition FilePath = String.  The other thread under this
subject has touched on the need for an (additional) API using an abstract
FilePath type.

 Now consider this e.hs:
 
 
 import IO
 
 main = do hWaitForInput stdin 1
   putStrLn Input is ready
   r - hReady stdin
   print r
   c - hGetChar stdin
   print c
   putStrLn Done!
 
 
 $ { printf \xC2\xC2\xC2\xC2\xC2\xC2\xC2; sleep 30; } | runhugs e.hs
 Input is ready
 True
 
 Program error: stdin: IO.hGetChar: protocol error (invalid character 
 encoding)
 $ 
 
 It takes 30 seconds for this error to be printed. This shows two issues:
 First of all, I think you should be giving an error as soon as you have
 a prefix that is the start of no character. Second, hReady now only
 guarantees hGetChar won't block on a binary mode handle, but I guess
 there is not much we can do except document that (short of some hideous
 hacks).

Yes, I don't see how to avoid this when using mbtowc() to do the
conversion: it makes no distinction between a bad byte sequence and an
incomplete one.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread John Meacham
On Sat, Mar 19, 2005 at 03:04:04PM +, Glynn Clements wrote:
 I'm not suggesting inventing conventions. I'm suggesting leaving such
 issues to the application programmer who, unlike the library
 programmer, probably has enough context to be able to reliably
 determine the correct encoding in any specific instance.

But the whole point of Foreign.C.String is to interface to existing C
code. And one of the most common conventions of said interfaces is to
represent strings in the current locale, Which is why locale honoring
conversion routines are useful. 

I don't think anyone is arguing that this is the end-all of charset
conversion, far from it. A general conversion library and parameterized
conversion routines are also needed for many of the reasons you said,
and will probably appear at some point. I have my own iconv interface
which I used for my initial implementation of with/peekCString etc. and
I am sure other people have written their own, eventually one will be
standardized. A general conversion facility has been on the wishlist for
a long time.

However, at the moment, the FFI is tackling a much simpler goal of
interfacing with existing C code, and non-parameterized locale-honoring
conversion routines are extremely useful for that. Even if we had a nice
generalized conversion routine, a simple locale-honoring front end would
be a very useful interface because it is so commonly needed when
interfacing to C code.

However, I am sure everyone would be happy if a nice cabalized general
charset conversion library appeared... I have the start of one here, which
should work on any POSIXy system, even if wchar_t is not unicode (no
windows support though)
 http://repetae.net/john/recent/out/HsLocale.html

John

-- 
John Meacham - repetae.netjohn 
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Dimitry Golubovsky
Glynn Clements wrote:

To get a better idea, you would need to consult users whose language
doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately,
you don't usually find too many of them on lists such as this.

In Russia, we still have multiple one byte encodings for Cyrillic: KOI-8 
(Unix), CP1251 (Windows), and getting more and more obsolete CP866 
(MSDOS, OS/2). Regarding filenames, I am sure Windows stores them in 
Unicode regarding of locale (I tried various chcp numbers in a console 
window, printing directory containing filenames in Russian and in German 
altogether, and it showed non-characters as question marks when 
locale-based codepage was set, and showed everything with chcp 65001 
which is Unicode). AFAIK Unix users do not create files named in Russian 
very often, and Windows users do this frequently.

Dimitry  Golubovsky
Middletown, CT
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-19 Thread Ian Lynagh
On Sun, Mar 20, 2005 at 01:33:44AM +, [EMAIL PROTECTED] wrote:
 On Sat, Mar 19, 2005 at 07:14:25PM +, Ian Lynagh wrote:
 
  Most importantly, though: is there any way to remove this file without
  doing something like an FFI import of unlink?
  
  Is there anything LC_CTYPE can be set to that will act like C/POSIX but
  accept 8-bit bytes as chars too?
 
 en_GB.iso88591 (or indeed any .iso88591 locale) will match the old
 behaviour (and the GHC behaviour).

This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591
(or en_US). My /etc/locale.gen contains:

en_GB ISO-8859-1
en_GB.ISO-8859-15 ISO-8859-15
en_GB.UTF-8 UTF-8

So is there anything that /always/ works?

 Indeed it's possible to have filenames (under POSIX, anyway) that H98
 programs can't touch (under Hugs).  That's pretty much follows from
 the Haskell definition FilePath = String.  The other thread under this
 subject has touched on the need for an (additional) API using an abstract
 FilePath type.

Hmm. I can't say I'm convinced by all this without having something like
that API.

 Yes, I don't see how to avoid this when using mbtowc() to do the
 conversion: it makes no distinction between a bad byte sequence and an
 incomplete one.

Perhaps you could use mbrtowc instead?

My manpage says

If the n bytes starting at s do not contain a complete multibyte  char-
acter,  mbrtowc  returns  (size_t)(-2).  This  can  happen even if n =
MB_CUR_MAX, if the multibyte string contains redundant shift sequences.

If  the  multibyte  string  starting at s contains an invalid multibyte
sequence  before  the  next   complete   character,   mbrtowc   returns
(size_t)(-1) and sets errno to EILSEQ. In this case, the effects on *ps
are undefined.

For both functions my manpage says

CONFORMING TO
   ISO/ANSI C, UNIX98


Thanks
Ian

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-18 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

  If you provide wrapper functions which take String arguments,
  either they should have an encoding argument or the encoding should
  be a mutable per-terminal setting.
 
 There is already a mutable setting. It's called locale.

 It isn't a per-terminal setting.

A separate setting would force users to configure an encoding just
for the purposes of Haskell programs, as if the configuration wasn't
already too fragmented. It's unwise to propose a new standard when an
existing standard works well enough.

  It is possible for curses to be used with a terminal which doesn't
  use the locale's encoding.
 
 No, it will break under the new wide character curses API,

 Or expose the fact that the WC API is broken, depending upon your POV.

It's the only curses API which allows to write full-screen programs in
UTF-8 mode.

  Also, it's quite common to use non-standard encodings with terminals
  (e.g. codepage 437, which has graphic characters beyond the ACS_* set
  which terminfo understands).
 
 curses don't support that.

 Sure it does. You pass the appropriate bytes to waddstr() etc and they
 get sent to the terminal as-is.

It doesn't support that and it will switch the terminal mode to user
encoding (which is usually ISO-8859-x) on a first occasion, e.g. after
an ACS_* macro was used, or maybe even at initialization.

curses support two families of encodings: the current locale encoding
and ACS. The locale encoding may be UTF-8 (works only with wide
character API).

 For compatibility the default locale is C, but new programs
 which are prepared for I18N should do setlocale(LC_CTYPE, )
 and setlocale(LC_MESSAGES, ).

 In practice, you end up continuously calling setlocale(LC_CTYPE, )
 and setlocale(LC_CTYPE, C), depending upon whether the text is meant
 to be human-readable (locale-dependent) or a machine-readable format
 (locale-independent, i.e. C locale).

I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting,
it only affects the encoding of texts emitted by gettext (including
strerror) and the meaning of isalpha, toupper etc.

 The LC_* environment variables are the parameters for the encoding.

 But they are only really parameters at the exec() level.

This is usually the right place to specify it. It's rare that they
are even set separately for the given program - usually they are
per-system or per-user.

 Once the program starts, the locale settings become global mutable
 state. I would have thought that, more than anyone else, the
 readership of this list would understand what's bad about that
 concept.

You can treat it as immutable. Just don't call setlocale with
different arguments again.

 Another problem with having a single locale: if a program isn't
 working, and you need to communicate with its developers, you will
 often have to run the program in an English locale just so that you
 will get error messages which the developers understand.

You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

 Then how would a Haskell program know what encoding to use for
 stdout messages?

 It doesn't necessarily need to. If you are using message catalogues,
 you just read bytes from the catalogue and write them to stdout.

gettext uses the locale to choose the encoding. Messages are
internally stored as UTF-8 but emitted in the locale encoding.

You are using the semantics I'm advocating without knowing that...

 How would it know how to interpret filenames for graphical
 display?

 An option menu on the file selector is one option; heuristics are
 another.

Heuristics won't distinguish various ISO-8859-x from each other.

An option menu on the file selector is user-unfriendly because users
don't want to configure it for each program separately. They want to
set it in one place and expect it to work everywhere.

Currently there are two such places: the locale, and
G_FILENAME_ENCODING (or older G_BROKEN_FILENAMES) for glib. It's
unwise to introduce yet another convention, and it would be a horrible
idea to make it per-program.

 At least Gtk-1 would attempt to display the filename; you would get
 the odd question mark but at least you could select the file;

Gtk+2 also attempts to display the filename. It can be opened
even though the filename has inconvertible characters escaped.

 The current locale mechanism is just a way of avoiding the issues
 as much as possible when you can't get away with avoiding them
 altogether.

It's a way to communicate the encoding of the terminal, filenames,
strerror, gettext etc.

 Unicode has been described (accurately, IMHO) as Esperanto for
 computers. Both use the same approach to try to solve essentially the
 same problem. And both will be about as successful in the long run.

Unicode has no viable competition.
Esperanto had English.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___

Re: [Haskell-cafe] invalid character encoding

2005-03-18 Thread Glynn Clements

Wolfgang Thaller wrote:

  If you try to pretend that I18N comes down to shoe-horning everything
  into Unicode, you will turn the language into a joke.
 
 How common will those problems you are describing be by the time this 
 has been implemented?
 How common are they even now?

Right now, GHC assumes ISO-8859-1 whenever it has to automatically
convert between String and CString. Conversions to and from ISO-8859-1
cannot fail, and encoding and decoding are exact inverses.

OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
the correct encoding, but that doesn't actually matter a lot of the
time; frequently, you're just grabbing a blob of data from one
function and passing it to another.

The problems will only appear once you start dealing with fallible or
non-reversible encodings such as UTF-8 or ISO-2022. If and when that
happens, I guess we'll find out how common the problems are. Of
course, it's quite possible that the only test cases will be people
using UTF-8-only (or even ASCII-only) systems, in which case you won't
see any problems.

 I haven't yet encountered a unix box where the file names were not in 
 the system locale encoding. On all reasonably up-to-date Linux boxes 
 that I've seen recently, they were in UTF-8 (and the system locale 
 agreed).

I've encountered boxes where multiple encodings were used; primarily
web and FTP servers which were shared amongst multiple clients. Each
client used whichever encoding(s) they felt like. IIRC, the most
common non-ASCII encoding was MS-DOS codepage 850 (the clients were
mostly using Windows 3.1 at that time).

I haven't done sysadmin for a while, so I don't know the current
situation, but I don't think that the world has switched to UTF-8 in
the mean time. [Most of the non-ASCII filenames which I've seen
recently have been either ISO-8859-1 or Win-12XX; I haven't seen much
UTF-8.]

 On both Windows and Mac OS X, filenames are stored in Unicode, so it is 
 always possible to convert them to unicode.
 So we can't do Unicode-based I18N because there exist a few unix 
 systems with messed-up file systems?

Declaring such systems to be messed up won't make the problems go
away. If a design doesn't work in reality, it's the fault of the
design, not of reality.

  Haskell's Unicode support is a joke because the API designers tried to
  avoid the issues related to encoding with wishful thinking (i.e. you
  open a file and you magically get Unicode characters out of it).
 
 OK, that part is purely wishful thinking, but assuming that filenames 
 are text that can be represented in Unicode is wishful thinking that 
 corresponds to 99% of reality.
 So why can't the remaining 1 percent of reality be fixed instead?

The issue isn't whether the data can be represented as Unicode text,
but whether you can convert it to and from Unicode without problems.
To do this, you need to know the encoding, you need to store the
encoding so that you can convert the wide string back to a byte
string, and the encoding needs to be reversible.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-18 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

   If you provide wrapper functions which take String arguments,
   either they should have an encoding argument or the encoding should
   be a mutable per-terminal setting.
  
  There is already a mutable setting. It's called locale.
 
  It isn't a per-terminal setting.
 
 A separate setting would force users to configure an encoding just
 for the purposes of Haskell programs, as if the configuration wasn't
 already too fragmented.

encoding - localeEncoding
Curses.setupTerm encoding handle

Not a big deal.

 It's unwise to propose a new standard when an existing standard
 works well enough.

Existing standard? The standard curses API deals with bytes; encodings
don't come into it. AFAIK, the wide-character curses API isn't yet a
standard.

   It is possible for curses to be used with a terminal which doesn't
   use the locale's encoding.
  
  No, it will break under the new wide character curses API,
 
  Or expose the fact that the WC API is broken, depending upon your POV.
 
 It's the only curses API which allows to write full-screen programs in
 UTF-8 mode.

All the more reason to fix it.

And where does UTF-8 come into it? I would have expected it to use
wide characters throughout.

   Also, it's quite common to use non-standard encodings with terminals
   (e.g. codepage 437, which has graphic characters beyond the ACS_* set
   which terminfo understands).
  
  curses don't support that.
 
  Sure it does. You pass the appropriate bytes to waddstr() etc and they
  get sent to the terminal as-is.
 
 It doesn't support that and it will switch the terminal mode to user
 encoding (which is usually ISO-8859-x) on a first occasion, e.g. after
 an ACS_* macro was used, or maybe even at initialization.
 
 curses support two families of encodings: the current locale encoding
 and ACS. The locale encoding may be UTF-8 (works only with wide
 character API).

I'm talking about standard (XSI) curses, which will just pass
printable (non-control) bytes straight to the terminal. If your
terminal uses CP437 (or some other non-standard encoding), you can
just pass the appropriate bytes to waddstr() etc and the corresponding
characters will appear on the terminal.

ACS_* codes are a completely separate issue; they allow you to use
line graphics in addition to a full 8-bit character set (e.g. 
ISO-8859-1). If you only need ASCII text, you can use the other 128
codes for graphics characters and never use the ACS_* macros or the
acsc capability.

  For compatibility the default locale is C, but new programs
  which are prepared for I18N should do setlocale(LC_CTYPE, )
  and setlocale(LC_MESSAGES, ).
 
  In practice, you end up continuously calling setlocale(LC_CTYPE, )
  and setlocale(LC_CTYPE, C), depending upon whether the text is meant
  to be human-readable (locale-dependent) or a machine-readable format
  (locale-independent, i.e. C locale).
 
 I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting,
 it only affects the encoding of texts emitted by gettext (including
 strerror) and the meaning of isalpha, toupper etc.

Sorry, I'm confusing two cases here. With LC_CTYPE, the main reason
for continuous switching is when using wcstombs(). printf() uses
LC_NUMERIC, which is switched between the C locale and the user's
locale.

  Once the program starts, the locale settings become global mutable
  state. I would have thought that, more than anyone else, the
  readership of this list would understand what's bad about that
  concept.
 
 You can treat it as immutable. Just don't call setlocale with
 different arguments again.

Which limits you to a single locale. If you are using the locale's
encoding, that limits you to a single encoding.

  Another problem with having a single locale: if a program isn't
  working, and you need to communicate with its developers, you will
  often have to run the program in an English locale just so that you
  will get error messages which the developers understand.
 
 You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

I'm starting to think that you're misunderstanding on purpose. Again.

The point is that a single program often generates multiple streams of
text, possibly for different audiences (e.g. humans and machines).
Different streams may require different conventions (encodings,
numeric formats, collating orders), but may use the same functions.

Those functions need to obtain the conventions from somewhere, and
that means either parameters or state.

Having dealt with state (libc's locale mechanism), I would rather have
parameters.

  Then how would a Haskell program know what encoding to use for
  stdout messages?
 
  It doesn't necessarily need to. If you are using message catalogues,
  you just read bytes from the catalogue and write them to stdout.
 
 gettext uses the locale to choose the encoding. Messages are
 internally stored as UTF-8 but emitted in the locale encoding.

It didn't use to be that 

Re: [Haskell-cafe] invalid character encoding

2005-03-18 Thread Wolfgang Thaller
Glynn Clements wrote:
OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
the correct encoding, but that doesn't actually matter a lot of the
time; frequently, you're just grabbing a blob of data from one
function and passing it to another.
Yes. Of course, this also means that Strings representing non-ASCII 
filenames will *always* be nonsense on Mac OS X and other UTF8-based 
platforms.

The problems will only appear once you start dealing with fallible or
non-reversible encodings such as UTF-8 or ISO-2022.
In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 
file name that is converted to Unicode cannot be converted back any 
more (assuming you know for sure that it was ISO-2022 in the first 
place)?

Of course, it's quite possible that the only test cases will be people
using UTF-8-only (or even ASCII-only) systems, in which case you won't
see any problems.
I'm kind of hoping that we can just ignore a problem that is so rare 
that a large and well-known project like GTK2 can get away with 
ignoring it. Also, IIRC, Java strings are supposed to be unicode, too - 
how do they deal with the problem?

So we can't do Unicode-based I18N because there exist a few unix
systems with messed-up file systems?
Declaring such systems to be messed up won't make the problems go
away. If a design doesn't work in reality, it's the fault of the
design, not of reality.
In general, yes. But we're not talking about all of reality here, we're 
talking about one small part of reality - the question is, can the part 
of reality where the design doesn't work be ignored?

For example, as soon as we use any kind of path names in our APIs, we 
are ignoring reality on good old Classic Mac OS (may it rest in 
piece). Path names don't always uniquely denote a file there (although 
they do most of the time). People writing cross-platform software have 
been ignoring this fact for a long time now.

I think that if we wait long enough, the filename encoding problems 
will become irrelevant and we will live in an ideal world where unicode 
actually works. Maybe next year, maybe only in ten years. And while we 
are arguing about how far we are from that ideal world, we should think 
about alternatives. The current hack is really just a hack, and I don't 
want to see this hack become the new accepted standard.

Do we have other alternatives? Preferably something that provides other 
advantages over a unicode String than just making things work on 
systems that many users never encounter, otherwise almost no one will 
bother to use it. So maybe we should start looking for _other_ reasons 
to represent file names and paths by an abstract datatype or something?

Cheers,
Wolfgang
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Ian Lynagh
On Thu, Mar 17, 2005 at 06:22:25AM +, Ian Lynagh wrote:
 
 [in brief: hugs' (hPutStr h) now behaves differently to
  (mapM_ (hPutChar h)), and ghc writes the empty string for both when
  told to write \128]

Ah, Malcolm's commit messages have just reminded me of the finaliser
changes requiring hflushes in new ghc, so it's just the hugs output that
confuses me now.


Thanks
Ian

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Ross Paterson
On Thu, Mar 17, 2005 at 06:22:25AM +, Ian Lynagh wrote:
 On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote:
  You can select binary I/O using the openBinaryFile and hSetBinaryMode
  functions from System.IO.  After that, the Chars you get from that Handle
  are actually bytes.
 
 What about the ones sent to it?
 Are all the following results intentional?
 Am I doing something stupid?

No, I was.  Output primitives other than hPutChar were ignoring binary
mode (and Hugs has more of these things as primitives than GHC does).
Now fixed in CVS (rev. 1.95 of src/char.c).
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Ross Paterson
On Thu, Mar 17, 2005 at 06:22:25AM +, Ian Lynagh wrote:
 Incidentally, make check in CVS hugs said:
 
 cd tests  sh testScript | egrep -v '^--( |-)'
 ./../src/hugs +q -w -pHugs: static/mod154.hs  /dev/null
 expected stdout not matched by reality
 *** static/Loaded.outputFri Jul 19 22:41:51 2002
 --- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005
 ***
 *** 1,2 
 ! Type :? for help
   Hugs:[Leaving Hugs]
 --- 1,3 
 ! ERROR static/mod154.hs - Conflicting exports of entity sort
 ! *** Could refer to Data.List.sort or M.sort
   Hugs:[Leaving Hugs]

This is a documented bug (though the notes in tests ought to mention
this too).
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

 Glynn Clements [EMAIL PROTECTED] writes:
 
  It should be possible to specify the encoding explicitly.
 
  Conversely, it shouldn't be possible to avoid specifying the
  encoding explicitly.
 
 What encoding should a binding to readline or curses use?
 
 Curses in C comes in two flavors: the traditional byte version and a
 wide character version. The second version is easy if we can assume
 that wchar_t is Unicode, but it's not always available and until
 recently in ncurses it was buggy. Let's assume we are using the byte
 version. How to encode strings?

The (non-wchar) curses API functions take byte strings (char*), so the
Haskell bindings should take CString or [Word8] arguments. If you
provide wrapper functions which take String arguments, either they
should have an encoding argument or the encoding should be a mutable
per-terminal setting.

 A terminal uses an ASCII-compatible encoding. Wide character version
 of curses convert characters to the locale encoding, and byte version
 passes bytes unchanged. This means that if a Haskell binding to the
 wide character version does the obvious thing and passes Unicode
 directly, then an equivalent behavior can be obtained from the byte
 version (only limited to 256-character encodings) by using the locale
 encoding.

I don't know enough about the wchar version of curses to comment on
that.

I do know that, to work reliably, the normal (byte) version of curses
needs to pass printable bytes through unmodified.

It is possible for curses to be used with a terminal which doesn't use
the locale's encoding. Specifically, a single process may use curses
with multiple terminals with differing encodings, e.g. an airport
public information system displaying information in multiple
languages.

Also, it's quite common to use non-standard encodings with terminals
(e.g. codepage 437, which has graphic characters beyond the ACS_* set
which terminfo understands).

 The locale encoding is the right encoding to use for conversion of the
 result of strerror, gai_strerror, msg member of gzip compressor state
 etc. When an I/O error occurs and the error code is translated to a
 Haskell exception and then shown to the user, why would the application
 need to specify the encoding and how?

Because the application may be using multiple locales/encodings.
Having had to do this in C (i.e. repeatedly calling setlocale() to
select the correct encoding), I would much prefer to have been able to
pass the locale as a parameter.

[The most common example is printf(%f). You need to use the C locale
(decimal point) for machine-readable text but the user's locale
(locale-specific decimal separator) for human-readable text. This
isn't directly related to encodings per se, but a good example of why
parameters are preferable to state.]

  If application code doesn't want to use the locale's encoding, it
  shouldn't be shoe-horned into doing so because a library developer
  decided to duck the encoding issues by grabbing whatever encoding
  was readily to hand (i.e. the locale's encoding).
 
 If a C library is written with the assumption that texts are in the
 locale encoding, a Haskell binding to such library should respect that
 assumption.

C libraries which use the locale do so as a last resort. KR C
completely ignored I18N issues. ANSI C added the locale mechanism to
as a hack to provide minimal I18N support while maintaining backward
compatibility and in a minimally-intrusive manner.

The only reason that the C locale mechanism isn't a major nuisance is
that you can largely ignore it altogether. Code which requires real
I18N can use other mechanisms, and code which doesn't require any I18N
can just pass byte strings around and leave encoding issues to code
which actually has enough context to handle them correctly.

 Only some libraries allow to work with different, explicitly specified
 encodings. Many libraries don't, especially if the texts are not the
 core of the library functionality but error messages.

And most such libraries just treat text as byte strings. They don't
care about their interpretation, or even whether or not they are valid
in the locale's encoding.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Glynn Clements

John Meacham wrote:

It doesn't affect functions added by the hierarchical libraries,
i.e. those functions are safe only with the ASCII subset. (There is
a vague plan to make Foreign.C.String conform to the FFI spec,
which mandates locale-based encoding, and thus would change all
those, but it's still up in the air.)
   
Hmm. I'm not convinced that automatically converting to the current
locale is the ideal behaviour (it'd certianly break all my programs!).
Certainly a function for converting into the encoding of the current
locale would be useful for may users but it's important to be able to
know the encoding with certainty.
   
   It should only be the default, not the only option.
  
  I'm not sure that it should be available at all.
  
   It should be possible to specify the encoding explicitly.
  
  Conversely, it shouldn't be possible to avoid specifying the encoding
  explicitly.
  
  Personally, I wouldn't provide an all-in-one convert String to
  CString using locale's encoding function, just in case anyone was
  tempted to actually use it.
 
 But this is exactly what is needed for most C library bindings.

I very much doubt that most is accurate.

C functions which take a char* fall into three main cases:

1. Unspecified encoding, i.e. it's a string of bytes, not characters.

2. Locale's encoding, as determined by nl_langinfo(CODESET);
essentially, whatever was set with setlocale(LC_CTYPE), defaulting to
C/POSIX if setlocale() hasn't been called.

3. Fixed encoding, e.g. UTF-8, ISO-2022, US-ASCII (or EBCDIC on IBM
mainframes).

Historically, library functions have tended to fall into category 1
unless they *need* to know the interpretation of a given byte or
sequence of bytes (e.g. ctype.h), in which case they fall into
category 2. Most of libc falls into category 1, with a minority of
functions in category 2.

Code which is designed to handle multiple languages simultaneously is
more likely to fall into category 3, using one of the universal
encodings (typically ISO-2022 in southeast Asia and UTF-8 elsewhere).

E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the
use of the locale's encoding for filenames (if you have filenames in
multiple encodings, you lose; filenames using the wrong encoding
simply don't appear in file selectors).

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 The (non-wchar) curses API functions take byte strings (char*),
 so the Haskell bindings should take CString or [Word8] arguments.

Programmers will not want to use such interface. When they want to
display a string, it will be in Haskell String type.

And it prevents having a single Haskell interface which uses either
the narrow or wide version of curses interface, depending on what is
available.

 If you provide wrapper functions which take String arguments,
 either they should have an encoding argument or the encoding should
 be a mutable per-terminal setting.

There is already a mutable setting. It's called locale.

 I don't know enough about the wchar version of curses to comment on
 that.

It uses wcsrtombs or eqiuvalents to display characters. And the
reverse to interpret keystrokes.

 It is possible for curses to be used with a terminal which doesn't
 use the locale's encoding.

No, it will break under the new wide character curses API, and it will
confuse programs which use the old narrow character API.

The user (or the administrator) is responsible for matching the locale
encoding with the terminal encoding.

 Also, it's quite common to use non-standard encodings with terminals
 (e.g. codepage 437, which has graphic characters beyond the ACS_* set
 which terminfo understands).

curses don't support that.

 The locale encoding is the right encoding to use for conversion of the
 result of strerror, gai_strerror, msg member of gzip compressor state
 etc. When an I/O error occurs and the error code is translated to a
 Haskell exception and then shown to the user, why would the application
 need to specify the encoding and how?

 Because the application may be using multiple locales/encodings.

But strerror always returns messages in the locale encoding.
Just like Gtk+2 always accepts texts in UTF-8.

For compatibility the default locale is C, but new programs
which are prepared for I18N should do setlocale(LC_CTYPE, )
and setlocale(LC_MESSAGES, ).

There are places where the encoding is settable independently,
or stored explicitly. For them Haskell should have withCString /
peekCString / etc. with an explicit encoding. And there are
places which use the locale encoding instead of having a separate
switch.

 [The most common example is printf(%f). You need to use the C
 locale (decimal point) for machine-readable text but the user's
 locale (locale-specific decimal separator) for human-readable text.

This is a different thing, and it is what IMHO C did wrong.

 This isn't directly related to encodings per se, but a good example
 of why parameters are preferable to state.]

The LC_* environment variables are the parameters for the encoding.
There is no other convention to pass the encoding to be used for
textual output to stdout for example.

 C libraries which use the locale do so as a last resort.

No, they do it by default.

 The only reason that the C locale mechanism isn't a major nuisance
 is that you can largely ignore it altogether.

Then how would a Haskell program know what encoding to use for stdout
messages? How would it know how to interpret filenames for graphical
display?

Do you want to invent a separate mechanism for communicating that, so
that an administrator has to set up a dozen of environment variables
and teach each program separately about the encoding it should assume
by default? We had this mess 10 years ago, and parts of it are still
alive until today - you must sometimes configure xterm or Emacs
separately, but it's being more common that programs know to use the
system-supplied setting and don't have to be configured separately.

 Code which requires real I18N can use other mechanisms, and code
 which doesn't require any I18N can just pass byte strings around and
 leave encoding issues to code which actually has enough context to
 handle them correctly.

Haskell can't just pass byte strings around without turning the
Unicode support into a joke (which it is now).

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the
 use of the locale's encoding for filenames (if you have filenames in
 multiple encodings, you lose; filenames using the wrong encoding
 simply don't appear in file selectors).

Actually they do appear, even though you can't type their names
from the keyboard. The name shown in the GUI used to be escaped in
different ways by different programs or even different places in one
program (question marks, %hex escapes \oct escapes), but recently
they added some functions to glib to make the behavior uniform.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Keean Schupke
I cannot help feeling that all this multi-language support is a mess.
All strings should be coded in a universal encoding (like UTF8) so that 
the code for a character is the same independant of locale.

It seems stupid that the locale affects the character encodings... the 
code for an 'a' should be the same all over the world... as should the 
code for a particular japanese character.

In other words the locale should have no affect on character encodings, 
it should select between multi-lingual error messages which are supplied 
as distinct strings for each region.

While we may have to inter-operate with 'C' code, we could have a 
Haskell library that does things properly.

   Keean.
Marcin 'Qrczak' Kowalczyk wrote:
Glynn Clements [EMAIL PROTECTED] writes:
 

The (non-wchar) curses API functions take byte strings (char*),
so the Haskell bindings should take CString or [Word8] arguments.
   

Programmers will not want to use such interface. When they want to
display a string, it will be in Haskell String type.
And it prevents having a single Haskell interface which uses either
the narrow or wide version of curses interface, depending on what is
available.
 

If you provide wrapper functions which take String arguments,
either they should have an encoding argument or the encoding should
be a mutable per-terminal setting.
   

There is already a mutable setting. It's called locale.
 

I don't know enough about the wchar version of curses to comment on
that.
   

It uses wcsrtombs or eqiuvalents to display characters. And the
reverse to interpret keystrokes.
 

It is possible for curses to be used with a terminal which doesn't
use the locale's encoding.
   

No, it will break under the new wide character curses API, and it will
confuse programs which use the old narrow character API.
The user (or the administrator) is responsible for matching the locale
encoding with the terminal encoding.
 

Also, it's quite common to use non-standard encodings with terminals
(e.g. codepage 437, which has graphic characters beyond the ACS_* set
which terminfo understands).
   

curses don't support that.
 

The locale encoding is the right encoding to use for conversion of the
result of strerror, gai_strerror, msg member of gzip compressor state
etc. When an I/O error occurs and the error code is translated to a
Haskell exception and then shown to the user, why would the application
need to specify the encoding and how?
 

Because the application may be using multiple locales/encodings.
   

But strerror always returns messages in the locale encoding.
Just like Gtk+2 always accepts texts in UTF-8.
For compatibility the default locale is C, but new programs
which are prepared for I18N should do setlocale(LC_CTYPE, )
and setlocale(LC_MESSAGES, ).
There are places where the encoding is settable independently,
or stored explicitly. For them Haskell should have withCString /
peekCString / etc. with an explicit encoding. And there are
places which use the locale encoding instead of having a separate
switch.
 

[The most common example is printf(%f). You need to use the C
locale (decimal point) for machine-readable text but the user's
locale (locale-specific decimal separator) for human-readable text.
   

This is a different thing, and it is what IMHO C did wrong.
 

This isn't directly related to encodings per se, but a good example
of why parameters are preferable to state.]
   

The LC_* environment variables are the parameters for the encoding.
There is no other convention to pass the encoding to be used for
textual output to stdout for example.
 

C libraries which use the locale do so as a last resort.
   

No, they do it by default.
 

The only reason that the C locale mechanism isn't a major nuisance
is that you can largely ignore it altogether.
   

Then how would a Haskell program know what encoding to use for stdout
messages? How would it know how to interpret filenames for graphical
display?
Do you want to invent a separate mechanism for communicating that, so
that an administrator has to set up a dozen of environment variables
and teach each program separately about the encoding it should assume
by default? We had this mess 10 years ago, and parts of it are still
alive until today - you must sometimes configure xterm or Emacs
separately, but it's being more common that programs know to use the
system-supplied setting and don't have to be configured separately.
 

Code which requires real I18N can use other mechanisms, and code
which doesn't require any I18N can just pass byte strings around and
leave encoding issues to code which actually has enough context to
handle them correctly.
   

Haskell can't just pass byte strings around without turning the
Unicode support into a joke (which it is now).
 

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  If you provide wrapper functions which take String arguments,
  either they should have an encoding argument or the encoding should
  be a mutable per-terminal setting.
 
 There is already a mutable setting. It's called locale.

It isn't a per-terminal setting.

  It is possible for curses to be used with a terminal which doesn't
  use the locale's encoding.
 
 No, it will break under the new wide character curses API,

Or expose the fact that the WC API is broken, depending upon your POV.

 and it will confuse programs which use the old narrow character API.

It has no effect on the *byte* API. Characters don't come into it.

 The user (or the administrator) is responsible for matching the locale
 encoding with the terminal encoding.

Which is rather hard to do if you have multiple encodings.

  Also, it's quite common to use non-standard encodings with terminals
  (e.g. codepage 437, which has graphic characters beyond the ACS_* set
  which terminfo understands).
 
 curses don't support that.

Sure it does. You pass the appropriate bytes to waddstr() etc and they
get sent to the terminal as-is. Curses doesn't have ACS_* macros for
those characters, but it doesn't mean that you can't use them.

  The locale encoding is the right encoding to use for conversion of the
  result of strerror, gai_strerror, msg member of gzip compressor state
  etc. When an I/O error occurs and the error code is translated to a
  Haskell exception and then shown to the user, why would the application
  need to specify the encoding and how?
 
  Because the application may be using multiple locales/encodings.
 
 But strerror always returns messages in the locale encoding.

Sorry, I misread that paragraph. I replied to why would ... without
thinking about the context.

When you know that a string is in the locale's encoding, you need to
use it for the conversion. In that case you need to do the conversion
(or at least record the actual encoding) immediately, in case the
locale gets switched.

 Just like Gtk+2 always accepts texts in UTF-8.

Unfortunately. The text probably originated in an encoding other than
UTF-8, and will probably end up getting displayed using a font which
is indexed using the original encoding (rather than e.g. UCS-2/4). 
Converting to Unicode then back again just introduces the potential
for errors. [Particularly for CJK where, due to Han unification,
Chinese characters may mutate into Japanese characters, or vice-versa. 
Fortunately, that doesn't seem to have started any wars. Yet.]

 For compatibility the default locale is C, but new programs
 which are prepared for I18N should do setlocale(LC_CTYPE, )
 and setlocale(LC_MESSAGES, ).

In practice, you end up continuously calling setlocale(LC_CTYPE, )
and setlocale(LC_CTYPE, C), depending upon whether the text is meant
to be human-readable (locale-dependent) or a machine-readable format
(locale-independent, i.e. C locale).

  [The most common example is printf(%f). You need to use the C
  locale (decimal point) for machine-readable text but the user's
  locale (locale-specific decimal separator) for human-readable text.
 
 This is a different thing, and it is what IMHO C did wrong.

It's a different example of the same problem. I agree that C did it
wrong; I'm objecting to the implication that Haskell should make the
same mistakes.

  This isn't directly related to encodings per se, but a good example
  of why parameters are preferable to state.]
 
 The LC_* environment variables are the parameters for the encoding.

But they are only really parameters at the exec() level.

Once the program starts, the locale settings become global mutable
state. I would have thought that, more than anyone else, the
readership of this list would understand what's bad about that
concept.

 There is no other convention to pass the encoding to be used for
 textual output to stdout for example.

That's up to the application. Environment variables are a convenience;
there's no reason why you can't have a command-line switch to select
the encoding. For more complex applications, you often have
user-selectable options and/or encodings specified in the data which
you handle.

Another problem with having a single locale: if a program isn't
working, and you need to communicate with its developers, you will
often have to run the program in an English locale just so that you
will get error messages which the developers understand.

  C libraries which use the locale do so as a last resort.
 
 No, they do it by default.

By default, libc uses the C locale. setlocale() includes a convenience
option to use the LC_* variables. Other libraries may or may not use
the locale settings, and plenty of code will misbehave if the locale
is wrong (e.g. using fprintf(%f) without explicitly setting the C
locale first will do the wrong thing if you're trying to generate
VRML/DXF/whatever files).

Beyond that, libc uses the locale mechanism because it was the

Re: [Haskell-cafe] invalid character encoding

2005-03-17 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the
  use of the locale's encoding for filenames (if you have filenames in
  multiple encodings, you lose; filenames using the wrong encoding
  simply don't appear in file selectors).
 
 Actually they do appear, even though you can't type their names
 from the keyboard. The name shown in the GUI used to be escaped in
 different ways by different programs or even different places in one
 program (question marks, %hex escapes \oct escapes), but recently
 they added some functions to glib to make the behavior uniform.

In the last version of Gtk-2.x which I tried, invalid filenames are
just omitted from the list. Gtk-1.x displayed them (I think with
question marks, but it may have been a box).

I've just tried with a more recent version (2.6.2); the default
behaviour is similar, although you can now get around the issue by
using G_FILENAME_ENCODING=ISO-8859-1. Of course, if your locale is
a long way from ISO-8859-1, that isn't a particularly good solution.

The best test case would be a system used predominantly by Japanese,
where (apparently) it's common to have a mixture of both EUC-JP and
Shift-JIS filenames (occasionally wrapped in ISO-2022, but usually
raw).

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Simon Marlow
On 16 March 2005 03:54, Ian Lynagh wrote:

 On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote:
 On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
 I've got some gzip (and Ian Lynagh's Inflate) code that breaks
 under the new hugs with: 
 
  handle: IO.getContents: protocol error (invalid character
 encoding) 
 
 What is going on, and how can I fix it?
 
 A Haskell 98 Handle is a character stream, and doesn't support binary
 I/O.  This would have bitten you sooner or later on systems that do
 CRLF conversion, but Hugs is now much stricter, because character
 streams now use the encoding determined by the current locale (for
 the C locale, that means ASCII only).
 
 Do you have a list of functions which behave differently in the new
 release to how they did in the previous release?
 (I'm not interested in changes that will affect only whether something
 compiles, not how it behaves given it compiles both before and after).
 
 Simons, Malcolm, are there any such functions in the new ghc/nhc98?
 
 Also, are you all agreed that the hugs interpretation of the report is
 correct, and thus ghc at least is buggy in this respect? (I'm afraid I
 haven't been able to test nhc98 yet).

GHC (and nhc98) assumes a locale of ISO8859-1 for I/O.  You could
consider that to be a bug, I suppose.  We don't plan to do anything
about it in the context of the current IO library, at least.

Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Ross Paterson
On Wed, Mar 16, 2005 at 03:54:19AM +, Ian Lynagh wrote:
 Do you have a list of functions which behave differently in the new
 release to how they did in the previous release?
 (I'm not interested in changes that will affect only whether something
 compiles, not how it behaves given it compiles both before and after).

I got lost in the negatives here.  It affects all Haskell 98 primitives
that do character I/O, or that exchange C strings with the C library.

It doesn't affect functions added by the hierarchical libraries, i.e.
those functions are safe only with the ASCII subset.  (There is a vague
plan to make Foreign.C.String conform to the FFI spec, which mandates
locale-based encoding, and thus would change all those, but it's still
up in the air.)

 Finally, the hugs behaviour seems a little odd to me. The below shows 4
 cases where iconv complains when asked to convert utf8 to utf8, but hugs
 only gives an error in one of them. In the others it just truncates the
 input. Is this really correct? It also seems to behave the same for me
 regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C.

It's a bug: an unrecognized encoding at the end of the input was being
ignored instead of triggering the exception.  Now fixed in CVS
(rev. 1.14 of src/char.c if anyone's backporting).  It was an accident
of this example that the behaviour in all locales was the same.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Duncan Coutts
On Wed, 2005-03-16 at 11:55 +, Ross Paterson wrote:
 On Wed, Mar 16, 2005 at 03:54:19AM +, Ian Lynagh wrote:
  Do you have a list of functions which behave differently in the new
  release to how they did in the previous release?
  (I'm not interested in changes that will affect only whether something
  compiles, not how it behaves given it compiles both before and after).
 
 I got lost in the negatives here.  It affects all Haskell 98 primitives
 that do character I/O, or that exchange C strings with the C library.
 
 It doesn't affect functions added by the hierarchical libraries, i.e.
 those functions are safe only with the ASCII subset.  (There is a vague
 plan to make Foreign.C.String conform to the FFI spec, which mandates
 locale-based encoding, and thus would change all those, but it's still
 up in the air.)

Hmm. I'm not convinced that automatically converting to the current
locale is the ideal behaviour (it'd certianly break all my programs!).
Certainly a function for converting into the encoding of the current
locale would be useful for may users but it's important to be able to
know the encoding with certainty. For example some libraries (eg Gtk+)
take all strings in UTF-8 irrespective of the current locale (it does
locale-dependent conversions on IO etc but the internal representation
is always UTF8). We do the conversion to UTF8 on the Haskell side and so
produce a byte string which we marshal using the FFI CString functions. 

If the implementations get fixed to conform to the FFI spec, I suppose
we could roll our own version of withCString that marshals [Word8] -
char*.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Duncan Coutts
On Wed, 2005-03-16 at 13:09 +, Duncan Coutts wrote:
 On Wed, 2005-03-16 at 11:55 +, Ross Paterson wrote:

  It doesn't affect functions added by the hierarchical libraries, i.e.
  those functions are safe only with the ASCII subset.  (There is a vague
  plan to make Foreign.C.String conform to the FFI spec, which mandates
  locale-based encoding, and thus would change all those, but it's still
  up in the air.)
 
 Hmm. I'm not convinced that automatically converting to the current
 locale is the ideal behaviour (it'd certianly break all my programs!).
 Certainly a function for converting into the encoding of the current
 locale would be useful for may users but it's important to be able to
 know the encoding with certainty. For example some libraries (eg Gtk+)
 take all strings in UTF-8 irrespective of the current locale (it does
 locale-dependent conversions on IO etc but the internal representation
 is always UTF8). We do the conversion to UTF8 on the Haskell side and so
 produce a byte string which we marshal using the FFI CString functions. 

Silly me! There are C marshaling functions that are specified to do just
this but I never noticed them before!

withCAString and similar functions treat haskell Strings as byte
strings.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Marcin 'Qrczak' Kowalczyk
Duncan Coutts [EMAIL PROTECTED] writes:

 It doesn't affect functions added by the hierarchical libraries,
 i.e. those functions are safe only with the ASCII subset. (There is
 a vague plan to make Foreign.C.String conform to the FFI spec,
 which mandates locale-based encoding, and thus would change all
 those, but it's still up in the air.)

 Hmm. I'm not convinced that automatically converting to the current
 locale is the ideal behaviour (it'd certianly break all my programs!).
 Certainly a function for converting into the encoding of the current
 locale would be useful for may users but it's important to be able to
 know the encoding with certainty.

It should only be the default, not the only option. It should be
possible to specify the encoding explicitly.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  It doesn't affect functions added by the hierarchical libraries,
  i.e. those functions are safe only with the ASCII subset. (There is
  a vague plan to make Foreign.C.String conform to the FFI spec,
  which mandates locale-based encoding, and thus would change all
  those, but it's still up in the air.)
 
  Hmm. I'm not convinced that automatically converting to the current
  locale is the ideal behaviour (it'd certianly break all my programs!).
  Certainly a function for converting into the encoding of the current
  locale would be useful for may users but it's important to be able to
  know the encoding with certainty.
 
 It should only be the default, not the only option.

I'm not sure that it should be available at all.

 It should be possible to specify the encoding explicitly.

Conversely, it shouldn't be possible to avoid specifying the encoding
explicitly.

Personally, I wouldn't provide an all-in-one convert String to
CString using locale's encoding function, just in case anyone was
tempted to actually use it.

The decision as to the encoding belongs in application code; not in
(most) libraries, and definitely not in the language.

[Libraries dealing with file formats or communication protocols which
mandate a specific encoding are an exception. But they will be using a
fixed encoding, not the locale's encoding.]

If application code chooses to use the locale's encoding, it can
retrieve it then pass it as the encoding argument to any applicable
functions.

If application code doesn't want to use the locale's encoding, it
shouldn't be shoe-horned into doing so because a library developer
decided to duck the encoding issues by grabbing whatever encoding was
readily to hand (i.e. the locale's encoding).

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 It should be possible to specify the encoding explicitly.

 Conversely, it shouldn't be possible to avoid specifying the
 encoding explicitly.

What encoding should a binding to readline or curses use?

Curses in C comes in two flavors: the traditional byte version and a
wide character version. The second version is easy if we can assume
that wchar_t is Unicode, but it's not always available and until
recently in ncurses it was buggy. Let's assume we are using the byte
version. How to encode strings?

A terminal uses an ASCII-compatible encoding. Wide character version
of curses convert characters to the locale encoding, and byte version
passes bytes unchanged. This means that if a Haskell binding to the
wide character version does the obvious thing and passes Unicode
directly, then an equivalent behavior can be obtained from the byte
version (only limited to 256-character encodings) by using the locale
encoding.

The locale encoding is the right encoding to use for conversion of the
result of strerror, gai_strerror, msg member of gzip compressor state
etc. When an I/O error occurs and the error code is translated to a
Haskell exception and then shown to the user, why would the application
need to specify the encoding and how?

 If application code doesn't want to use the locale's encoding, it
 shouldn't be shoe-horned into doing so because a library developer
 decided to duck the encoding issues by grabbing whatever encoding
 was readily to hand (i.e. the locale's encoding).

If a C library is written with the assumption that texts are in the
locale encoding, a Haskell binding to such library should respect that
assumption.

Only some libraries allow to work with different, explicitly specified
encodings. Many libraries don't, especially if the texts are not the
core of the library functionality but error messages.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread John Meacham
On Wed, Mar 16, 2005 at 05:13:25PM +, Glynn Clements wrote:
 
 Marcin 'Qrczak' Kowalczyk wrote:
 
   It doesn't affect functions added by the hierarchical libraries,
   i.e. those functions are safe only with the ASCII subset. (There is
   a vague plan to make Foreign.C.String conform to the FFI spec,
   which mandates locale-based encoding, and thus would change all
   those, but it's still up in the air.)
  
   Hmm. I'm not convinced that automatically converting to the current
   locale is the ideal behaviour (it'd certianly break all my programs!).
   Certainly a function for converting into the encoding of the current
   locale would be useful for may users but it's important to be able to
   know the encoding with certainty.
  
  It should only be the default, not the only option.
 
 I'm not sure that it should be available at all.
 
  It should be possible to specify the encoding explicitly.
 
 Conversely, it shouldn't be possible to avoid specifying the encoding
 explicitly.
 
 Personally, I wouldn't provide an all-in-one convert String to
 CString using locale's encoding function, just in case anyone was
 tempted to actually use it.

But this is exactly what is needed for most C library bindings. Which is
why I had to write my own and proposed it to the FFI. Most C libraries
expect char * to be in the standard encoding of the current locale.
When a binding explicitly uses another encoding, then great,  we can use
different marshaling functions. In any case, we need tools to be able to
conform to the common cases of ascii-only (withCAStrirg) and current
locale  (withCString).

withUTF8String would be a nice addition, but is much less important to
come standard as it can easily be written by end users, unlike locale
specific versions which are necessarily system dependent.

John

-- 
John Meacham - repetae.netjohn 
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Marcin 'Qrczak' Kowalczyk
John Meacham [EMAIL PROTECTED] writes:

 In any case, we need tools to be able to conform to the common cases
 of ascii-only (withCAStrirg) and current locale (withCString).

 withUTF8String would be a nice addition, but is much less important to
 come standard as it can easily be written by end users, unlike locale
 specific versions which are necessarily system dependent.

IMHO the encoding should be a parameter of an extended variant of
withCString (and peekCString etc.).

We need a framework for implementing encoders/decoders first.
A problem with designing the framework is that it should support
both pure Haskell conversions and C functions like iconv which work
on arrays. We must also provide a way to signal errors.

A bonus is a way to handle errors coming from another recoder without
causing it to fail completely. That way one could add a fallback for
unrepresentable characters, e.g. HTML entities or approximations with
stripped accents.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-16 Thread Ian Lynagh
On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote:
 
 You can select binary I/O using the openBinaryFile and hSetBinaryMode
 functions from System.IO.  After that, the Chars you get from that Handle
 are actually bytes.

What about the ones sent to it?
Are all the following results intentional?
Am I doing something stupid?

[in brief: hugs' (hPutStr h) now behaves differently to
 (mapM_ (hPutChar h)), and ghc writes the empty string for both when
 told to write \128]

Running the following with new ghc 6.4 and hugs 20050308 or 20050317:

echo 'import System.IO; import System.Environment; main = do [o] - getArgs; ho 
- openBinaryFile o WriteMode; hPutStr ho \128'  run1.hs
echo 'import System.IO; import System.Environment; main = do [o] - getArgs; ho 
- openBinaryFile o WriteMode; mapM_ (hPutChar ho) \128'  run2.hs
runhugs run1.hs hugs1
runhugs run2.hs hugs2
runghc run1.hs ghc1
runghc run2.hs ghc2
ls -l hugs1 hugs2 ghc1 ghc2
for f in hugs1 hugs2 ghc1 ghc2; do echo $f; hexdump -C $f; done

gives:

-rw-r--r--  1 igloo igloo 0 Mar 17 06:15 ghc1
-rw-r--r--  1 igloo igloo 0 Mar 17 06:15 ghc2
-rw-r--r--  1 igloo igloo 1 Mar 17 06:15 hugs1
-rw-r--r--  1 igloo igloo 1 Mar 17 06:15 hugs2
hugs1
  3f|?|
0001
hugs2
  80|.|
0001
ghc1
ghc2

With ghc 6.2.2 and hugs November 2003 I get:

-rw-r--r--  1 igloo igloo 1 Mar 17 06:16 ghc1
-rw-r--r--  1 igloo igloo 1 Mar 17 06:16 ghc2
-rw-r--r--  1 igloo igloo 1 Mar 17 06:16 hugs1
-rw-r--r--  1 igloo igloo 1 Mar 17 06:16 hugs2
hugs1
  80|.|
0001
hugs2
  80|.|
0001
ghc1
  80|.|
0001
ghc2
  80|.|
0001


Incidentally, make check in CVS hugs said:

cd tests  sh testScript | egrep -v '^--( |-)'
./../src/hugs +q -w -pHugs: static/mod154.hs  /dev/null
expected stdout not matched by reality
*** static/Loaded.outputFri Jul 19 22:41:51 2002
--- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005
***
*** 1,2 
! Type :? for help
  Hugs:[Leaving Hugs]
--- 1,3 
! ERROR static/mod154.hs - Conflicting exports of entity sort
! *** Could refer to Data.List.sort or M.sort
  Hugs:[Leaving Hugs]


Thanks
Ian

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-15 Thread Ross Paterson
On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
 I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
 the new hugs with:
 
  handle: IO.getContents: protocol error (invalid character encoding)
 
 What is going on, and how can I fix it?

A Haskell 98 Handle is a character stream, and doesn't support binary
I/O.  This would have bitten you sooner or later on systems that do CRLF
conversion, but Hugs is now much stricter, because character streams now
use the encoding determined by the current locale (for the C locale, that
means ASCII only).

You can select binary I/O using the openBinaryFile and hSetBinaryMode
functions from System.IO.  After that, the Chars you get from that Handle
are actually bytes.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-15 Thread John Goerzen
On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote:
 On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
  I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
  the new hugs with:
  
   handle: IO.getContents: protocol error (invalid character encoding)
  
  What is going on, and how can I fix it?
 
 A Haskell 98 Handle is a character stream, and doesn't support binary
 I/O.  This would have bitten you sooner or later on systems that do CRLF

Yes, probably so..

 conversion, but Hugs is now much stricter, because character streams now
 use the encoding determined by the current locale (for the C locale, that
 means ASCII only).

Hmm, this seems to be completely undocumented.  So yes, I'll try using
openBinaryFile, but the docs I have seen still talk only about CRLF and
^Z.

Anyway, I'm intrested in this new feature (I assume GHC 6.4 has it as
well?)  Would it, for instance, automatically convert from Latin-1 to
UTF-16 on read, and the inverse on write?  Or to/from UTF-8?

Thanks,

-- John
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-15 Thread Ross Paterson
On Tue, Mar 15, 2005 at 08:12:48AM -0600, John Goerzen wrote:
  [...] but Hugs is now much stricter, because character streams now
  use the encoding determined by the current locale (for the C locale, that
  means ASCII only).
 
 Hmm, this seems to be completely undocumented.

It's mentioned in the release history in the User's Guide, which refers
to section 3.3 for (some) more details.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] invalid character encoding

2005-03-15 Thread Ian Lynagh
On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote:
 On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
  I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
  the new hugs with:
  
   handle: IO.getContents: protocol error (invalid character encoding)
  
  What is going on, and how can I fix it?
 
 A Haskell 98 Handle is a character stream, and doesn't support binary
 I/O.  This would have bitten you sooner or later on systems that do CRLF
 conversion, but Hugs is now much stricter, because character streams now
 use the encoding determined by the current locale (for the C locale, that
 means ASCII only).

Do you have a list of functions which behave differently in the new
release to how they did in the previous release?
(I'm not interested in changes that will affect only whether something
compiles, not how it behaves given it compiles both before and after).

Simons, Malcolm, are there any such functions in the new ghc/nhc98?

Also, are you all agreed that the hugs interpretation of the report is
correct, and thus ghc at least is buggy in this respect? (I'm afraid I
haven't been able to test nhc98 yet).

Finally, the hugs behaviour seems a little odd to me. The below shows 4
cases where iconv complains when asked to convert utf8 to utf8, but hugs
only gives an error in one of them. In the others it just truncates the
input. Is this really correct? It also seems to behave the same for me
regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C.


Thanks
Ian


printf \x00\x7F  inp1
printf \x00\x80  inp2
printf \x00\xC4  inp3
printf \xFF\xFF  inp4
printf \xb1\x41\x00\x03\x65\x6d\x70\x74\x79\x00\x03\x00\x00\x00\x00\x00  inp5
echo 'main = do xs - getContents; print xs'  run.hs
for i in `seq 1 5`; do runhugs run.hs  inp$i; done
for i in `seq 1 5`; do runghc6 run.hs  inp$i; done
for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8  inp$i; done

which gives me the following output:

$ for i in `seq 1 5`; do runhugs run.hs  inp$i; done
\NUL\DEL
\NUL
\NUL


Program error: stdin: IO.getContents: protocol error (invalid character 
encoding)
$ for i in `seq 1 5`; do runghc6 run.hs  inp$i; done
\NUL\DEL
\NUL\128
\NUL\196
\255\255
\177A\NUL\ETXempty\NUL\ETX\NUL\NUL\NUL\NUL\NUL
$ for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8  inp$i; done
1
2
iconv: illegal input sequence at position 1
3
iconv: incomplete character or shift sequence at end of buffer
4
iconv: illegal input sequence at position 0
5
iconv: illegal input sequence at position 0
$ 


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] invalid character encoding

2005-03-14 Thread John Goerzen
I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
the new hugs with:

 handle: IO.getContents: protocol error (invalid character encoding)

What is going on, and how can I fix it?

Thanks,
John
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe