Re: [Haskell-cafe] invalid character encoding
On Sun, Mar 20, 2005 at 04:34:12AM +, Ian Lynagh wrote: On Sun, Mar 20, 2005 at 01:33:44AM +, [EMAIL PROTECTED] wrote: On Sat, Mar 19, 2005 at 07:14:25PM +, Ian Lynagh wrote: Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too? en_GB.iso88591 (or indeed any .iso88591 locale) will match the old behaviour (and the GHC behaviour). This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591 (or en_US). My /etc/locale.gen contains: en_GB ISO-8859-1 en_GB.ISO-8859-15 ISO-8859-15 en_GB.UTF-8 UTF-8 So is there anything that /always/ works? Since systems may have no locale other than C/POSIX, no. Yes, I don't see how to avoid this when using mbtowc() to do the conversion: it makes no distinction between a bad byte sequence and an incomplete one. Perhaps you could use mbrtowc instead? Indeed. Thanks for pointing it out. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
John Meacham wrote: I'm not suggesting inventing conventions. I'm suggesting leaving such issues to the application programmer who, unlike the library programmer, probably has enough context to be able to reliably determine the correct encoding in any specific instance. But the whole point of Foreign.C.String is to interface to existing C code. And one of the most common conventions of said interfaces is to represent strings in the current locale, Which is why locale honoring conversion routines are useful. My point is that most C functions which accept or return char*s will work regardless of whether those char*s can be decoded according to the current locale. E.g. while (d = readdir(dir), d) { stat(d-d_name, st); ... } will stat() every filename in the directory regardless of whether or not the filenames are valid in the locale's encoding. The Haskell equivalent using FilePath (i.e. String), getDirectoryContents etc currently only works because the char* - String conversions are hardcoded to ISO-8859-1, which is infallible and reversible. If it used e.g. UTF-8, it would fail on any filename which wasn't valid UTF-8 even though it never actually needs to know the string of characters which the filename represents. The same applies to reading filenames from argv[] and passing them to open() etc. This is one of the most common idioms in Unix programming, and it doesn't care about encodings at all. Again, it would cease to work reliably in Haskell if the automatic char* - String conversions in getArgs etc started using the locale. I'm not arguing about *how* char* - String conversions should be performed so much as arguing about *whether* these conversions should be performed. The conversion issues are only problems because the conversions are being done at all. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
One thing I don't like about this automatic conversion is that it is hidden magic - and could catch people out. Let's say I don't want to use it... How can I do the following (ie what are the new API calls): Open a file with a name that is invalid in the current locale (say a zip disc from a computer with a different locale setting). Open a file with contents in an unknown encoding. What are the new binary API calls for file IO? What type is returned from 'getChar' on a binary file. Should it even be called getChar? what about getWord8 (getWord16, getWord32 etc...) Does the encoding translation occur just on the filename or the contents as well? What if I have an encoded filename with binary contents and vice-versa. Keean. (I guess I now have to rewrite a lot of file IO code!) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Sun, Mar 20, 2005 at 12:59:52PM +, Keean Schupke wrote: How can I do the following (ie what are the new API calls): Open a file with a name that is invalid in the current locale (say a zip disc from a computer with a different locale setting). A new API is needed for this. Open a file with contents in an unknown encoding. What are the new binary API calls for file IO? see System.IO What type is returned from 'getChar' on a binary file. Should it even be called getChar? what about getWord8 (getWord16, getWord32 etc...) Char, of course. And yes, it's not ideal. There's also a byte array interface. (I guess I now have to rewrite a lot of file IO code!) If it was doing binary I/O on H98 Handles, it already needed rewriting. There's nothing to be done for filenames until a new API emerges. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller [EMAIL PROTECTED] writes: In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 file name that is converted to Unicode cannot be converted back any more (assuming you know for sure that it was ISO-2022 in the first place)? I am no expert on ISO-2022 so the following may contain errors, please correct if it is wrong. ISO-2022 - Unicode is always possible. Also Unicode - ISO-2022 should be always possible, but is a relation not a function. This means there are an infinite? ways of encoding a particular unicode string in ISO-2022. ISO-2022 works by providing escape sequences to switch between different character sets. One can freely use these escapes in almost any way you wish. Also ISO-2022 makes a difference between the same character in japanese/chinese/korean - which unicode does not do. See here for more info on the topic: http://www.ecma-international.org/publications/files/ecma-st/ECMA-035.pdf Also trusting system locale for everything is problematic and makes things quite unbearable for I18N. e.g. on my desktop 95% of things run with iso-8859-1, 3% of things use utf-8 and a few apps use EUC-JP... Using filenames as opaque blobs causes the least problems. If the program wishes to display them in a graphical environment then they have to be converted to a string, but very many apps never display the filenames... - Einar Karttunen ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Sat, Mar 19, 2005 at 12:55:54PM +0100, Marcin 'Qrczak' Kowalczyk wrote: Glynn Clements [EMAIL PROTECTED] writes: The point is that a single program often generates multiple streams of text, possibly for different audiences (e.g. humans and machines). Different streams may require different conventions (encodings, numeric formats, collating orders), but may use the same functions. A single program has a single stdout and a single filesystem. The contexts which use the locale encoding don't need multiple encodings. That's not true, there could be many filesystems, each of which uses a different encoding for the filenames. In the case of removable media, this scenario isn't even unlikely. -- David Roundy http://www.darcs.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Einar Karttunen wrote: In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 file name that is converted to Unicode cannot be converted back any more (assuming you know for sure that it was ISO-2022 in the first place)? I am no expert on ISO-2022 so the following may contain errors, please correct if it is wrong. ISO-2022 - Unicode is always possible. Also Unicode - ISO-2022 should be always possible, but is a relation not a function. This means there are an infinite? ways of encoding a particular unicode string in ISO-2022. ISO-2022 works by providing escape sequences to switch between different character sets. One can freely use these escapes in almost any way you wish. Exactly. Moreover, while there are an infinite number of equivalent representations in theory (you can add as many redundant switching sequences as you wish), there are multiple plausible equivalent representations in practice. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Marcin 'Qrczak' Kowalczyk wrote: I'm talking about standard (XSI) curses, which will just pass printable (non-control) bytes straight to the terminal. If your terminal uses CP437 (or some other non-standard encoding), you can just pass the appropriate bytes to waddstr() etc and the corresponding characters will appear on the terminal. Which terminal uses CP437? Most software terminal emulators can use any encoding. Traditional comms packages tend to support this (including their own VGA font if necessary) because of its widespread use on BBSes which were targeted at MS-DOS systems. There exist hardware terminals (I can't name specific models, but I have seen them in use) which support this, specifically for use with MS-DOS systems. Linux console doesn't, except temporarily after switching the mapping to builtin CP437 (but this state is not used by curses) or after loading CP437 as the user map (nobody does this, and it won't work properly with all characters from the range 0x80-0x9F anyway). I *still* encounter programs written for the linux console which assume that the built-in CP437 font is being used (if you use an ISO-8859-1 font, you get dialogs with accented characters where you would expect line-drawing characters). You can treat it as immutable. Just don't call setlocale with different arguments again. Which limits you to a single locale. If you are using the locale's encoding, that limits you to a single encoding. There is no support for changing the encoding of a terminal on the fly by programs running inside it. If you support multiple terminals with different encodings, and the library uses the global locale settings to determine the encoding, you need to switch locale every time you write to a different terminal. The point is that a single program often generates multiple streams of text, possibly for different audiences (e.g. humans and machines). Different streams may require different conventions (encodings, numeric formats, collating orders), but may use the same functions. A single program has a single stdout and a single filesystem. The contexts which use the locale encoding don't need multiple encodings. Multiple encodings are needed e.g. for exchanging data with other machines for the network, for reading contents of text files after the user has specified an encoding explicitly etc. In these cases an API with explicitly provided encoding should be used. A API which is used for reading and writing text files or sockets is just as applicable to stdin/stdout. The current locale mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether. It's a way to communicate the encoding of the terminal, filenames, strerror, gettext etc. It's *a* way, but it's not a very good way. It sucks when you can't apply a single convention to everything. It's not so bad to justify inventing our own conventions and forcing users to configure the encoding of Haskell programs separately. I'm not suggesting inventing conventions. I'm suggesting leaving such issues to the application programmer who, unlike the library programmer, probably has enough context to be able to reliably determine the correct encoding in any specific instance. Unicode has no viable competition. There are two viable alternatives. Byte strings with associated encodings and ISO-2022. ISO-2022 is an insanely complicated brain-damaged mess. I know it's being used in some parts of the world, but the sooner it will die, the better. ISO-2022 has advantages and disadvantages relative to UTF-8. I don't want to go on about the specifics here because they aren't particularly relevant. What's relevant is that it isn't likely to disappear any time soon. A large part of the world already has a universal encoding which works well enough; they don't *need* UTF-8, and aren't going to rebuild their IT infrastructure from scratch for the sake of it. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
David Roundy wrote: That's not true, there could be many filesystems, each of which uses a different encoding for the filenames. In the case of removable media, this scenario isn't even unlikely. I agree - I can quite easily see the situation occuring where a student (say from japan) brings in a zip-disk or USB key formatted with a japanese filename encoding, that I need to read on my computer (with a UK locale). Also can different windows have different encodings? I might have a web browser (written in haskell?) running and have windows with several different encodings open at the same time, whist saving things on filesystems with differing encodings. Keean. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Sat, 19 Mar 2005, David Roundy wrote: That's not true, there could be many filesystems, each of which uses a different encoding for the filenames. In the case of removable media, this scenario isn't even unlikely. The nearest desktop machine to me right now has in its directory structure filesystems that use different encodings. So, yes, it's probably not all that rare. Mark. -- Haskell vacancies in Columbus, Ohio, USA: see http://www.aetion.com/jobs.html ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller wrote: Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems. I'm kind of hoping that we can just ignore a problem that is so rare that a large and well-known project like GTK2 can get away with ignoring it. 1. The filename issues in GTK-2 are likely to be a major problem in CJK locales, where filenames which don't match the locale (which is seldom UTF-8) are common. 2. GTK's filename handling only really applies to file selector dialogs. Most other uses of filenames in a GTK-based application don't involve GTK; they use the OS API functions which just deal with byte strings. 3. GTK is a GUI library. Most of the text which it deals with is going to be rendered, so it *has* to be interpreted as characters. Treating it as blobs of data won't work. IOW, on the question of whether or not to interpret byte strings as character strings, GTK is at the far end of the scale. Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem? Files are represented by instances of the File class: http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html An abstract representation of file and directory pathnames. You can construct Files from Strings, and convert Files to Strings. The File class includes two sets of directory enumeration methods: list() returns an array of Strings, while listFiles() returns an array of Files. The documentation for the File class doesn't mention encoding issues at all. However, with that interface, it would be possible to enumerate and open filenames which cannot be decoded. So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems? Declaring such systems to be messed up won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality. In general, yes. But we're not talking about all of reality here, we're talking about one small part of reality - the question is, can the part of reality where the design doesn't work be ignored? Sure, you *can* ignore it; KR C ignored everything other than ASCII. If you limit yourself to locales which use the Roman alphabet (i.e. ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot. Most such users avoid encoding issues altogether by dropping the accents and sticking to ASCII, at least when dealing with files which might leave their system. To get a better idea, you would need to consult users whose language doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately, you don't usually find too many of them on lists such as this. I'm only familiar with one OSS project which has a sizeable CJK user base, and that's XEmacs (whose I18N revolves around ISO-2022, and most of the documentation is in Japanese). Even there, there are separate mailing lists for English and Japanese, and the two seldom communicate. I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years. Maybe not even then. If Unicode really solved encoding problems, you'd expect the CJK world to be the first adopters, but they're actually the least eager; you are more likely to find UTF-8 in an English-language HTML page or email message than a Japanese one. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller [EMAIL PROTECTED] writes: Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem? Java (Sun) -- Filenames are assumed to be in the locale encoding. a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD. b) Creating. Characters which cannot be converted are replaced by ?. Command line arguments and standard I/O are treated in the same way. Java (GNU) -- Filenames are assumed to be in Java-modified UTF-8. a) Interpreting. If a filename cannot be converted, a directory listing contains a null instead of a string object. b) Creating. All Java characters are representable in Java-modified UTF-8. Obviously not all potential filenames can be represented. Command line arguments are interpreted according to the locale. Bytes which cannot be converted are skipped. Standard I/O works in ISO-8859-1 by default. Obviously all input is accepted. On output characters above U+00FF are replaced by ?. C# (mono) - Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS environment variable, with UTF-8 implicitly added at the end. These encodings are tried in order. a) Interpreting. If a filename cannot be converted, it's skipped in a directory listing. The documentation says that if a filename, a command line argument etc. looks like valid UTF-8, it is treated as such first, and MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases. The reality seems to not match this (mono-1.0.5). b) Creating. If UTF-8 is used, U+ throws an exception (System.ArgumentException: Path contains invalid chars), paired surrogates are treated correctly, and an isolated surrogate causes an internal error: ** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL) aborting... Command line arguments are treated in the same way, except that if an argument cannot be converted, the program dies at start: [Invalid UTF-8] Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea). Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again. Console.WriteLine emits UTF-8. Paired surrogates are treated correctly, unpaired surrogates are converted to pseudo-UTF-8. Console.ReadLine interprets text as UTF-8. Bytes which cannot be converted are skipped. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Wed, Mar 16, 2005 at 11:55:18AM +, Ross Paterson wrote: On Wed, Mar 16, 2005 at 03:54:19AM +, Ian Lynagh wrote: Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after). I got lost in the negatives here. It affects all Haskell 98 primitives that do character I/O, or that exchange C strings with the C library. In the below, it looks like there is a bug in getDirectoryContents. Also, the error from w.hs is going to stdout, not stderr. Most importantly, though: is there any way to remove this file without doing something like an FFI import of unlink? Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too? (in the POSIX locale) $ echo 'import Directory; main = getDirectoryContents . = print' q.hs $ runhugs q.hs [.,..,q.hs] $ touch 1`printf \xA2` $ runhugs q.hs runhugs: Error occurred ERROR - Garbage collection fails to reclaim sufficient space $ echo 'import Directory; main = removeFile 1\xA2' w.hs $ runhugs w.hs Program error: 1?: Directory.removeFile: does not exist (file does not exist) $ strace -o strace.out runhugs w.hs /dev/null $ grep unlink strace.out | head -c 14 | hexdump -C 75 6e 6c 69 6e 6b 28 22 31 3f 22 29 20 20|unlink(1?) | 000e $ strace -o strace2.out rm 1* $ grep unlink strace2.out | head -c 14 | hexdump -C 75 6e 6c 69 6e 6b 28 22 31 a2 22 29 20 20|unlink(1.) | 000e $ Now consider this e.hs: import IO main = do hWaitForInput stdin 1 putStrLn Input is ready r - hReady stdin print r c - hGetChar stdin print c putStrLn Done! $ { printf \xC2\xC2\xC2\xC2\xC2\xC2\xC2; sleep 30; } | runhugs e.hs Input is ready True Program error: stdin: IO.hGetChar: protocol error (invalid character encoding) $ It takes 30 seconds for this error to be printed. This shows two issues: First of all, I think you should be giving an error as soon as you have a prefix that is the start of no character. Second, hReady now only guarantees hGetChar won't block on a binary mode handle, but I guess there is not much we can do except document that (short of some hideous hacks). Thanks Ian ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem? Files are represented by instances of the File class: [...] The documentation for the File class doesn't mention encoding issues at all. ... which led me to conclude that they don't deal with the problem properly. I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years. Maybe not even then. If Unicode really solved encoding problems, you'd expect the CJK world to be the first adopters, but they're actually the least eager; you are more likely to find UTF-8 in an English-language HTML page or email message than a Japanese one. Hmm, that's possibly because english-language users can get away with just marking their ASCII files as UTF-8. But I'm not arguing files or HTML pages here, I'm only concerned with filenames. I prefer unicode nowadays because I was born within a hundred kilometers of the border between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language texts, but as soon as I write about where I went for vacation, I need a few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody ever tried to sell ISO-2022 to me, so unicode was the only alternative. So you've now convinced me that there is a considerable number of computers using ISO-2022, where there's more than one way to encode the same text (how do people use this from the command line??). There is also multi-user systems where the user's don't agree on a single encoding. I still reserve the right to call those systems messed-up, but that's just my personal opinion and reality couldn't care less about what I think. So, as I don't want to stick with the status quo forever (lists of bytes that pretend to be lists of unicode chars, even on platforms where unicode is used anyway), how about we get to work - what do we want? I don't think we want a type class here, a plain (abstract) data type will do: data File Obviously, we'll need conversion from and to C strings. On Mac OS X, they'd be guaranteed to be in UTF-8. withFilePathCString :: String - (CString - IO a) - IO a fileFromCString :: CString - IO File We will need functions for converting to and from unicode strings. I'm pretty sure that we want to keep those functions pure, otherwise they'll be very annoying to use. fileFromPath :: String - File Any impure operations that might be needed to decide how to encode the file name will have to be delayed until the File is actually used. fileToPath :: File - String Same here: any impure operation necessary to convert the File to a unicode string needs to be done when the file is created. What about failure? If you go from String to File, errors should be reported when you actually access the file. At an earlier time, you can't know whether the file name is valid (e.g. if you mount a classic HFS volume on Mac OS X, you can only create files there whose names can be represented in the volume's file name encoding - but you only find that out once you try to create a file). For going from File to String, I'm not so sure, but I would be very annoyed if I had to deal with a Maybe String return type on platforms where it will always succeed. Maybe there should be separate functions for different purposes - i.e. for display, you'd use a File - String function that will silently use '?'s when things can't be decoded, but in other situations you might use a File - Maybe String function and check for Nothing. If people want to implement more sophisticated ways of decoding file names than can be provided by the library, they'd get the C string and do the same things. Of course, there should also be lots of other useful functions that make it more or less unnecessary to deal with path names directly in most cases. Thoughts? Cheers, Wolfgang ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Sat, Mar 19, 2005 at 07:14:25PM +, Ian Lynagh wrote: In the below, it looks like there is a bug in getDirectoryContents. Yes, now fixed in CVS. Also, the error from w.hs is going to stdout, not stderr. It's a nuisance, but noone has got around to changing it. Most importantly, though: is there any way to remove this file without doing something like an FFI import of unlink? Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too? en_GB.iso88591 (or indeed any .iso88591 locale) will match the old behaviour (and the GHC behaviour). Indeed it's possible to have filenames (under POSIX, anyway) that H98 programs can't touch (under Hugs). That's pretty much follows from the Haskell definition FilePath = String. The other thread under this subject has touched on the need for an (additional) API using an abstract FilePath type. Now consider this e.hs: import IO main = do hWaitForInput stdin 1 putStrLn Input is ready r - hReady stdin print r c - hGetChar stdin print c putStrLn Done! $ { printf \xC2\xC2\xC2\xC2\xC2\xC2\xC2; sleep 30; } | runhugs e.hs Input is ready True Program error: stdin: IO.hGetChar: protocol error (invalid character encoding) $ It takes 30 seconds for this error to be printed. This shows two issues: First of all, I think you should be giving an error as soon as you have a prefix that is the start of no character. Second, hReady now only guarantees hGetChar won't block on a binary mode handle, but I guess there is not much we can do except document that (short of some hideous hacks). Yes, I don't see how to avoid this when using mbtowc() to do the conversion: it makes no distinction between a bad byte sequence and an incomplete one. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Sat, Mar 19, 2005 at 03:04:04PM +, Glynn Clements wrote: I'm not suggesting inventing conventions. I'm suggesting leaving such issues to the application programmer who, unlike the library programmer, probably has enough context to be able to reliably determine the correct encoding in any specific instance. But the whole point of Foreign.C.String is to interface to existing C code. And one of the most common conventions of said interfaces is to represent strings in the current locale, Which is why locale honoring conversion routines are useful. I don't think anyone is arguing that this is the end-all of charset conversion, far from it. A general conversion library and parameterized conversion routines are also needed for many of the reasons you said, and will probably appear at some point. I have my own iconv interface which I used for my initial implementation of with/peekCString etc. and I am sure other people have written their own, eventually one will be standardized. A general conversion facility has been on the wishlist for a long time. However, at the moment, the FFI is tackling a much simpler goal of interfacing with existing C code, and non-parameterized locale-honoring conversion routines are extremely useful for that. Even if we had a nice generalized conversion routine, a simple locale-honoring front end would be a very useful interface because it is so commonly needed when interfacing to C code. However, I am sure everyone would be happy if a nice cabalized general charset conversion library appeared... I have the start of one here, which should work on any POSIXy system, even if wchar_t is not unicode (no windows support though) http://repetae.net/john/recent/out/HsLocale.html John -- John Meacham - repetae.netjohn ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Glynn Clements wrote: To get a better idea, you would need to consult users whose language doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately, you don't usually find too many of them on lists such as this. In Russia, we still have multiple one byte encodings for Cyrillic: KOI-8 (Unix), CP1251 (Windows), and getting more and more obsolete CP866 (MSDOS, OS/2). Regarding filenames, I am sure Windows stores them in Unicode regarding of locale (I tried various chcp numbers in a console window, printing directory containing filenames in Russian and in German altogether, and it showed non-characters as question marks when locale-based codepage was set, and showed everything with chcp 65001 which is Unicode). AFAIK Unix users do not create files named in Russian very often, and Windows users do this frequently. Dimitry Golubovsky Middletown, CT ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Sun, Mar 20, 2005 at 01:33:44AM +, [EMAIL PROTECTED] wrote: On Sat, Mar 19, 2005 at 07:14:25PM +, Ian Lynagh wrote: Most importantly, though: is there any way to remove this file without doing something like an FFI import of unlink? Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too? en_GB.iso88591 (or indeed any .iso88591 locale) will match the old behaviour (and the GHC behaviour). This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591 (or en_US). My /etc/locale.gen contains: en_GB ISO-8859-1 en_GB.ISO-8859-15 ISO-8859-15 en_GB.UTF-8 UTF-8 So is there anything that /always/ works? Indeed it's possible to have filenames (under POSIX, anyway) that H98 programs can't touch (under Hugs). That's pretty much follows from the Haskell definition FilePath = String. The other thread under this subject has touched on the need for an (additional) API using an abstract FilePath type. Hmm. I can't say I'm convinced by all this without having something like that API. Yes, I don't see how to avoid this when using mbtowc() to do the conversion: it makes no distinction between a bad byte sequence and an incomplete one. Perhaps you could use mbrtowc instead? My manpage says If the n bytes starting at s do not contain a complete multibyte char- acter, mbrtowc returns (size_t)(-2). This can happen even if n = MB_CUR_MAX, if the multibyte string contains redundant shift sequences. If the multibyte string starting at s contains an invalid multibyte sequence before the next complete character, mbrtowc returns (size_t)(-1) and sets errno to EILSEQ. In this case, the effects on *ps are undefined. For both functions my manpage says CONFORMING TO ISO/ANSI C, UNIX98 Thanks Ian ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Glynn Clements [EMAIL PROTECTED] writes: If you provide wrapper functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting. There is already a mutable setting. It's called locale. It isn't a per-terminal setting. A separate setting would force users to configure an encoding just for the purposes of Haskell programs, as if the configuration wasn't already too fragmented. It's unwise to propose a new standard when an existing standard works well enough. It is possible for curses to be used with a terminal which doesn't use the locale's encoding. No, it will break under the new wide character curses API, Or expose the fact that the WC API is broken, depending upon your POV. It's the only curses API which allows to write full-screen programs in UTF-8 mode. Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands). curses don't support that. Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is. It doesn't support that and it will switch the terminal mode to user encoding (which is usually ISO-8859-x) on a first occasion, e.g. after an ACS_* macro was used, or maybe even at initialization. curses support two families of encodings: the current locale encoding and ACS. The locale encoding may be UTF-8 (works only with wide character API). For compatibility the default locale is C, but new programs which are prepared for I18N should do setlocale(LC_CTYPE, ) and setlocale(LC_MESSAGES, ). In practice, you end up continuously calling setlocale(LC_CTYPE, ) and setlocale(LC_CTYPE, C), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. C locale). I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting, it only affects the encoding of texts emitted by gettext (including strerror) and the meaning of isalpha, toupper etc. The LC_* environment variables are the parameters for the encoding. But they are only really parameters at the exec() level. This is usually the right place to specify it. It's rare that they are even set separately for the given program - usually they are per-system or per-user. Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept. You can treat it as immutable. Just don't call setlocale with different arguments again. Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand. You don't need to change LC_CTYPE for that. Just set LC_MESSAGES. Then how would a Haskell program know what encoding to use for stdout messages? It doesn't necessarily need to. If you are using message catalogues, you just read bytes from the catalogue and write them to stdout. gettext uses the locale to choose the encoding. Messages are internally stored as UTF-8 but emitted in the locale encoding. You are using the semantics I'm advocating without knowing that... How would it know how to interpret filenames for graphical display? An option menu on the file selector is one option; heuristics are another. Heuristics won't distinguish various ISO-8859-x from each other. An option menu on the file selector is user-unfriendly because users don't want to configure it for each program separately. They want to set it in one place and expect it to work everywhere. Currently there are two such places: the locale, and G_FILENAME_ENCODING (or older G_BROKEN_FILENAMES) for glib. It's unwise to introduce yet another convention, and it would be a horrible idea to make it per-program. At least Gtk-1 would attempt to display the filename; you would get the odd question mark but at least you could select the file; Gtk+2 also attempts to display the filename. It can be opened even though the filename has inconvertible characters escaped. The current locale mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether. It's a way to communicate the encoding of the terminal, filenames, strerror, gettext etc. Unicode has been described (accurately, IMHO) as Esperanto for computers. Both use the same approach to try to solve essentially the same problem. And both will be about as successful in the long run. Unicode has no viable competition. Esperanto had English. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___
Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller wrote: If you try to pretend that I18N comes down to shoe-horning everything into Unicode, you will turn the language into a joke. How common will those problems you are describing be by the time this has been implemented? How common are they even now? Right now, GHC assumes ISO-8859-1 whenever it has to automatically convert between String and CString. Conversions to and from ISO-8859-1 cannot fail, and encoding and decoding are exact inverses. OK, so the intermediate string will be nonsense if ISO-8859-1 isn't the correct encoding, but that doesn't actually matter a lot of the time; frequently, you're just grabbing a blob of data from one function and passing it to another. The problems will only appear once you start dealing with fallible or non-reversible encodings such as UTF-8 or ISO-2022. If and when that happens, I guess we'll find out how common the problems are. Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems. I haven't yet encountered a unix box where the file names were not in the system locale encoding. On all reasonably up-to-date Linux boxes that I've seen recently, they were in UTF-8 (and the system locale agreed). I've encountered boxes where multiple encodings were used; primarily web and FTP servers which were shared amongst multiple clients. Each client used whichever encoding(s) they felt like. IIRC, the most common non-ASCII encoding was MS-DOS codepage 850 (the clients were mostly using Windows 3.1 at that time). I haven't done sysadmin for a while, so I don't know the current situation, but I don't think that the world has switched to UTF-8 in the mean time. [Most of the non-ASCII filenames which I've seen recently have been either ISO-8859-1 or Win-12XX; I haven't seen much UTF-8.] On both Windows and Mac OS X, filenames are stored in Unicode, so it is always possible to convert them to unicode. So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems? Declaring such systems to be messed up won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality. Haskell's Unicode support is a joke because the API designers tried to avoid the issues related to encoding with wishful thinking (i.e. you open a file and you magically get Unicode characters out of it). OK, that part is purely wishful thinking, but assuming that filenames are text that can be represented in Unicode is wishful thinking that corresponds to 99% of reality. So why can't the remaining 1 percent of reality be fixed instead? The issue isn't whether the data can be represented as Unicode text, but whether you can convert it to and from Unicode without problems. To do this, you need to know the encoding, you need to store the encoding so that you can convert the wide string back to a byte string, and the encoding needs to be reversible. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Marcin 'Qrczak' Kowalczyk wrote: If you provide wrapper functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting. There is already a mutable setting. It's called locale. It isn't a per-terminal setting. A separate setting would force users to configure an encoding just for the purposes of Haskell programs, as if the configuration wasn't already too fragmented. encoding - localeEncoding Curses.setupTerm encoding handle Not a big deal. It's unwise to propose a new standard when an existing standard works well enough. Existing standard? The standard curses API deals with bytes; encodings don't come into it. AFAIK, the wide-character curses API isn't yet a standard. It is possible for curses to be used with a terminal which doesn't use the locale's encoding. No, it will break under the new wide character curses API, Or expose the fact that the WC API is broken, depending upon your POV. It's the only curses API which allows to write full-screen programs in UTF-8 mode. All the more reason to fix it. And where does UTF-8 come into it? I would have expected it to use wide characters throughout. Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands). curses don't support that. Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is. It doesn't support that and it will switch the terminal mode to user encoding (which is usually ISO-8859-x) on a first occasion, e.g. after an ACS_* macro was used, or maybe even at initialization. curses support two families of encodings: the current locale encoding and ACS. The locale encoding may be UTF-8 (works only with wide character API). I'm talking about standard (XSI) curses, which will just pass printable (non-control) bytes straight to the terminal. If your terminal uses CP437 (or some other non-standard encoding), you can just pass the appropriate bytes to waddstr() etc and the corresponding characters will appear on the terminal. ACS_* codes are a completely separate issue; they allow you to use line graphics in addition to a full 8-bit character set (e.g. ISO-8859-1). If you only need ASCII text, you can use the other 128 codes for graphics characters and never use the ACS_* macros or the acsc capability. For compatibility the default locale is C, but new programs which are prepared for I18N should do setlocale(LC_CTYPE, ) and setlocale(LC_MESSAGES, ). In practice, you end up continuously calling setlocale(LC_CTYPE, ) and setlocale(LC_CTYPE, C), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. C locale). I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting, it only affects the encoding of texts emitted by gettext (including strerror) and the meaning of isalpha, toupper etc. Sorry, I'm confusing two cases here. With LC_CTYPE, the main reason for continuous switching is when using wcstombs(). printf() uses LC_NUMERIC, which is switched between the C locale and the user's locale. Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept. You can treat it as immutable. Just don't call setlocale with different arguments again. Which limits you to a single locale. If you are using the locale's encoding, that limits you to a single encoding. Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand. You don't need to change LC_CTYPE for that. Just set LC_MESSAGES. I'm starting to think that you're misunderstanding on purpose. Again. The point is that a single program often generates multiple streams of text, possibly for different audiences (e.g. humans and machines). Different streams may require different conventions (encodings, numeric formats, collating orders), but may use the same functions. Those functions need to obtain the conventions from somewhere, and that means either parameters or state. Having dealt with state (libc's locale mechanism), I would rather have parameters. Then how would a Haskell program know what encoding to use for stdout messages? It doesn't necessarily need to. If you are using message catalogues, you just read bytes from the catalogue and write them to stdout. gettext uses the locale to choose the encoding. Messages are internally stored as UTF-8 but emitted in the locale encoding. It didn't use to be that
Re: [Haskell-cafe] invalid character encoding
Glynn Clements wrote: OK, so the intermediate string will be nonsense if ISO-8859-1 isn't the correct encoding, but that doesn't actually matter a lot of the time; frequently, you're just grabbing a blob of data from one function and passing it to another. Yes. Of course, this also means that Strings representing non-ASCII filenames will *always* be nonsense on Mac OS X and other UTF8-based platforms. The problems will only appear once you start dealing with fallible or non-reversible encodings such as UTF-8 or ISO-2022. In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 file name that is converted to Unicode cannot be converted back any more (assuming you know for sure that it was ISO-2022 in the first place)? Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems. I'm kind of hoping that we can just ignore a problem that is so rare that a large and well-known project like GTK2 can get away with ignoring it. Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem? So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems? Declaring such systems to be messed up won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality. In general, yes. But we're not talking about all of reality here, we're talking about one small part of reality - the question is, can the part of reality where the design doesn't work be ignored? For example, as soon as we use any kind of path names in our APIs, we are ignoring reality on good old Classic Mac OS (may it rest in piece). Path names don't always uniquely denote a file there (although they do most of the time). People writing cross-platform software have been ignoring this fact for a long time now. I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years. And while we are arguing about how far we are from that ideal world, we should think about alternatives. The current hack is really just a hack, and I don't want to see this hack become the new accepted standard. Do we have other alternatives? Preferably something that provides other advantages over a unicode String than just making things work on systems that many users never encounter, otherwise almost no one will bother to use it. So maybe we should start looking for _other_ reasons to represent file names and paths by an abstract datatype or something? Cheers, Wolfgang ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Thu, Mar 17, 2005 at 06:22:25AM +, Ian Lynagh wrote: [in brief: hugs' (hPutStr h) now behaves differently to (mapM_ (hPutChar h)), and ghc writes the empty string for both when told to write \128] Ah, Malcolm's commit messages have just reminded me of the finaliser changes requiring hflushes in new ghc, so it's just the hugs output that confuses me now. Thanks Ian ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Thu, Mar 17, 2005 at 06:22:25AM +, Ian Lynagh wrote: On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote: You can select binary I/O using the openBinaryFile and hSetBinaryMode functions from System.IO. After that, the Chars you get from that Handle are actually bytes. What about the ones sent to it? Are all the following results intentional? Am I doing something stupid? No, I was. Output primitives other than hPutChar were ignoring binary mode (and Hugs has more of these things as primitives than GHC does). Now fixed in CVS (rev. 1.95 of src/char.c). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Thu, Mar 17, 2005 at 06:22:25AM +, Ian Lynagh wrote: Incidentally, make check in CVS hugs said: cd tests sh testScript | egrep -v '^--( |-)' ./../src/hugs +q -w -pHugs: static/mod154.hs /dev/null expected stdout not matched by reality *** static/Loaded.outputFri Jul 19 22:41:51 2002 --- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005 *** *** 1,2 ! Type :? for help Hugs:[Leaving Hugs] --- 1,3 ! ERROR static/mod154.hs - Conflicting exports of entity sort ! *** Could refer to Data.List.sort or M.sort Hugs:[Leaving Hugs] This is a documented bug (though the notes in tests ought to mention this too). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Marcin 'Qrczak' Kowalczyk wrote: Glynn Clements [EMAIL PROTECTED] writes: It should be possible to specify the encoding explicitly. Conversely, it shouldn't be possible to avoid specifying the encoding explicitly. What encoding should a binding to readline or curses use? Curses in C comes in two flavors: the traditional byte version and a wide character version. The second version is easy if we can assume that wchar_t is Unicode, but it's not always available and until recently in ncurses it was buggy. Let's assume we are using the byte version. How to encode strings? The (non-wchar) curses API functions take byte strings (char*), so the Haskell bindings should take CString or [Word8] arguments. If you provide wrapper functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting. A terminal uses an ASCII-compatible encoding. Wide character version of curses convert characters to the locale encoding, and byte version passes bytes unchanged. This means that if a Haskell binding to the wide character version does the obvious thing and passes Unicode directly, then an equivalent behavior can be obtained from the byte version (only limited to 256-character encodings) by using the locale encoding. I don't know enough about the wchar version of curses to comment on that. I do know that, to work reliably, the normal (byte) version of curses needs to pass printable bytes through unmodified. It is possible for curses to be used with a terminal which doesn't use the locale's encoding. Specifically, a single process may use curses with multiple terminals with differing encodings, e.g. an airport public information system displaying information in multiple languages. Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands). The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how? Because the application may be using multiple locales/encodings. Having had to do this in C (i.e. repeatedly calling setlocale() to select the correct encoding), I would much prefer to have been able to pass the locale as a parameter. [The most common example is printf(%f). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text. This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.] If application code doesn't want to use the locale's encoding, it shouldn't be shoe-horned into doing so because a library developer decided to duck the encoding issues by grabbing whatever encoding was readily to hand (i.e. the locale's encoding). If a C library is written with the assumption that texts are in the locale encoding, a Haskell binding to such library should respect that assumption. C libraries which use the locale do so as a last resort. KR C completely ignored I18N issues. ANSI C added the locale mechanism to as a hack to provide minimal I18N support while maintaining backward compatibility and in a minimally-intrusive manner. The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether. Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly. Only some libraries allow to work with different, explicitly specified encodings. Many libraries don't, especially if the texts are not the core of the library functionality but error messages. And most such libraries just treat text as byte strings. They don't care about their interpretation, or even whether or not they are valid in the locale's encoding. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
John Meacham wrote: It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.) Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. It should only be the default, not the only option. I'm not sure that it should be available at all. It should be possible to specify the encoding explicitly. Conversely, it shouldn't be possible to avoid specifying the encoding explicitly. Personally, I wouldn't provide an all-in-one convert String to CString using locale's encoding function, just in case anyone was tempted to actually use it. But this is exactly what is needed for most C library bindings. I very much doubt that most is accurate. C functions which take a char* fall into three main cases: 1. Unspecified encoding, i.e. it's a string of bytes, not characters. 2. Locale's encoding, as determined by nl_langinfo(CODESET); essentially, whatever was set with setlocale(LC_CTYPE), defaulting to C/POSIX if setlocale() hasn't been called. 3. Fixed encoding, e.g. UTF-8, ISO-2022, US-ASCII (or EBCDIC on IBM mainframes). Historically, library functions have tended to fall into category 1 unless they *need* to know the interpretation of a given byte or sequence of bytes (e.g. ctype.h), in which case they fall into category 2. Most of libc falls into category 1, with a minority of functions in category 2. Code which is designed to handle multiple languages simultaneously is more likely to fall into category 3, using one of the universal encodings (typically ISO-2022 in southeast Asia and UTF-8 elsewhere). E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the use of the locale's encoding for filenames (if you have filenames in multiple encodings, you lose; filenames using the wrong encoding simply don't appear in file selectors). -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Glynn Clements [EMAIL PROTECTED] writes: The (non-wchar) curses API functions take byte strings (char*), so the Haskell bindings should take CString or [Word8] arguments. Programmers will not want to use such interface. When they want to display a string, it will be in Haskell String type. And it prevents having a single Haskell interface which uses either the narrow or wide version of curses interface, depending on what is available. If you provide wrapper functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting. There is already a mutable setting. It's called locale. I don't know enough about the wchar version of curses to comment on that. It uses wcsrtombs or eqiuvalents to display characters. And the reverse to interpret keystrokes. It is possible for curses to be used with a terminal which doesn't use the locale's encoding. No, it will break under the new wide character curses API, and it will confuse programs which use the old narrow character API. The user (or the administrator) is responsible for matching the locale encoding with the terminal encoding. Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands). curses don't support that. The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how? Because the application may be using multiple locales/encodings. But strerror always returns messages in the locale encoding. Just like Gtk+2 always accepts texts in UTF-8. For compatibility the default locale is C, but new programs which are prepared for I18N should do setlocale(LC_CTYPE, ) and setlocale(LC_MESSAGES, ). There are places where the encoding is settable independently, or stored explicitly. For them Haskell should have withCString / peekCString / etc. with an explicit encoding. And there are places which use the locale encoding instead of having a separate switch. [The most common example is printf(%f). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text. This is a different thing, and it is what IMHO C did wrong. This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.] The LC_* environment variables are the parameters for the encoding. There is no other convention to pass the encoding to be used for textual output to stdout for example. C libraries which use the locale do so as a last resort. No, they do it by default. The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether. Then how would a Haskell program know what encoding to use for stdout messages? How would it know how to interpret filenames for graphical display? Do you want to invent a separate mechanism for communicating that, so that an administrator has to set up a dozen of environment variables and teach each program separately about the encoding it should assume by default? We had this mess 10 years ago, and parts of it are still alive until today - you must sometimes configure xterm or Emacs separately, but it's being more common that programs know to use the system-supplied setting and don't have to be configured separately. Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly. Haskell can't just pass byte strings around without turning the Unicode support into a joke (which it is now). -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Glynn Clements [EMAIL PROTECTED] writes: E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the use of the locale's encoding for filenames (if you have filenames in multiple encodings, you lose; filenames using the wrong encoding simply don't appear in file selectors). Actually they do appear, even though you can't type their names from the keyboard. The name shown in the GUI used to be escaped in different ways by different programs or even different places in one program (question marks, %hex escapes \oct escapes), but recently they added some functions to glib to make the behavior uniform. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
I cannot help feeling that all this multi-language support is a mess. All strings should be coded in a universal encoding (like UTF8) so that the code for a character is the same independant of locale. It seems stupid that the locale affects the character encodings... the code for an 'a' should be the same all over the world... as should the code for a particular japanese character. In other words the locale should have no affect on character encodings, it should select between multi-lingual error messages which are supplied as distinct strings for each region. While we may have to inter-operate with 'C' code, we could have a Haskell library that does things properly. Keean. Marcin 'Qrczak' Kowalczyk wrote: Glynn Clements [EMAIL PROTECTED] writes: The (non-wchar) curses API functions take byte strings (char*), so the Haskell bindings should take CString or [Word8] arguments. Programmers will not want to use such interface. When they want to display a string, it will be in Haskell String type. And it prevents having a single Haskell interface which uses either the narrow or wide version of curses interface, depending on what is available. If you provide wrapper functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting. There is already a mutable setting. It's called locale. I don't know enough about the wchar version of curses to comment on that. It uses wcsrtombs or eqiuvalents to display characters. And the reverse to interpret keystrokes. It is possible for curses to be used with a terminal which doesn't use the locale's encoding. No, it will break under the new wide character curses API, and it will confuse programs which use the old narrow character API. The user (or the administrator) is responsible for matching the locale encoding with the terminal encoding. Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands). curses don't support that. The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how? Because the application may be using multiple locales/encodings. But strerror always returns messages in the locale encoding. Just like Gtk+2 always accepts texts in UTF-8. For compatibility the default locale is C, but new programs which are prepared for I18N should do setlocale(LC_CTYPE, ) and setlocale(LC_MESSAGES, ). There are places where the encoding is settable independently, or stored explicitly. For them Haskell should have withCString / peekCString / etc. with an explicit encoding. And there are places which use the locale encoding instead of having a separate switch. [The most common example is printf(%f). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text. This is a different thing, and it is what IMHO C did wrong. This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.] The LC_* environment variables are the parameters for the encoding. There is no other convention to pass the encoding to be used for textual output to stdout for example. C libraries which use the locale do so as a last resort. No, they do it by default. The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether. Then how would a Haskell program know what encoding to use for stdout messages? How would it know how to interpret filenames for graphical display? Do you want to invent a separate mechanism for communicating that, so that an administrator has to set up a dozen of environment variables and teach each program separately about the encoding it should assume by default? We had this mess 10 years ago, and parts of it are still alive until today - you must sometimes configure xterm or Emacs separately, but it's being more common that programs know to use the system-supplied setting and don't have to be configured separately. Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly. Haskell can't just pass byte strings around without turning the Unicode support into a joke (which it is now). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Marcin 'Qrczak' Kowalczyk wrote: If you provide wrapper functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting. There is already a mutable setting. It's called locale. It isn't a per-terminal setting. It is possible for curses to be used with a terminal which doesn't use the locale's encoding. No, it will break under the new wide character curses API, Or expose the fact that the WC API is broken, depending upon your POV. and it will confuse programs which use the old narrow character API. It has no effect on the *byte* API. Characters don't come into it. The user (or the administrator) is responsible for matching the locale encoding with the terminal encoding. Which is rather hard to do if you have multiple encodings. Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands). curses don't support that. Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is. Curses doesn't have ACS_* macros for those characters, but it doesn't mean that you can't use them. The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how? Because the application may be using multiple locales/encodings. But strerror always returns messages in the locale encoding. Sorry, I misread that paragraph. I replied to why would ... without thinking about the context. When you know that a string is in the locale's encoding, you need to use it for the conversion. In that case you need to do the conversion (or at least record the actual encoding) immediately, in case the locale gets switched. Just like Gtk+2 always accepts texts in UTF-8. Unfortunately. The text probably originated in an encoding other than UTF-8, and will probably end up getting displayed using a font which is indexed using the original encoding (rather than e.g. UCS-2/4). Converting to Unicode then back again just introduces the potential for errors. [Particularly for CJK where, due to Han unification, Chinese characters may mutate into Japanese characters, or vice-versa. Fortunately, that doesn't seem to have started any wars. Yet.] For compatibility the default locale is C, but new programs which are prepared for I18N should do setlocale(LC_CTYPE, ) and setlocale(LC_MESSAGES, ). In practice, you end up continuously calling setlocale(LC_CTYPE, ) and setlocale(LC_CTYPE, C), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. C locale). [The most common example is printf(%f). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text. This is a different thing, and it is what IMHO C did wrong. It's a different example of the same problem. I agree that C did it wrong; I'm objecting to the implication that Haskell should make the same mistakes. This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.] The LC_* environment variables are the parameters for the encoding. But they are only really parameters at the exec() level. Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept. There is no other convention to pass the encoding to be used for textual output to stdout for example. That's up to the application. Environment variables are a convenience; there's no reason why you can't have a command-line switch to select the encoding. For more complex applications, you often have user-selectable options and/or encodings specified in the data which you handle. Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand. C libraries which use the locale do so as a last resort. No, they do it by default. By default, libc uses the C locale. setlocale() includes a convenience option to use the LC_* variables. Other libraries may or may not use the locale settings, and plenty of code will misbehave if the locale is wrong (e.g. using fprintf(%f) without explicitly setting the C locale first will do the wrong thing if you're trying to generate VRML/DXF/whatever files). Beyond that, libc uses the locale mechanism because it was the
Re: [Haskell-cafe] invalid character encoding
Marcin 'Qrczak' Kowalczyk wrote: E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the use of the locale's encoding for filenames (if you have filenames in multiple encodings, you lose; filenames using the wrong encoding simply don't appear in file selectors). Actually they do appear, even though you can't type their names from the keyboard. The name shown in the GUI used to be escaped in different ways by different programs or even different places in one program (question marks, %hex escapes \oct escapes), but recently they added some functions to glib to make the behavior uniform. In the last version of Gtk-2.x which I tried, invalid filenames are just omitted from the list. Gtk-1.x displayed them (I think with question marks, but it may have been a box). I've just tried with a more recent version (2.6.2); the default behaviour is similar, although you can now get around the issue by using G_FILENAME_ENCODING=ISO-8859-1. Of course, if your locale is a long way from ISO-8859-1, that isn't a particularly good solution. The best test case would be a system used predominantly by Japanese, where (apparently) it's common to have a mixture of both EUC-JP and Shift-JIS filenames (occasionally wrapped in ISO-2022, but usually raw). -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
RE: [Haskell-cafe] invalid character encoding
On 16 March 2005 03:54, Ian Lynagh wrote: On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote: On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote: I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with: handle: IO.getContents: protocol error (invalid character encoding) What is going on, and how can I fix it? A Haskell 98 Handle is a character stream, and doesn't support binary I/O. This would have bitten you sooner or later on systems that do CRLF conversion, but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only). Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after). Simons, Malcolm, are there any such functions in the new ghc/nhc98? Also, are you all agreed that the hugs interpretation of the report is correct, and thus ghc at least is buggy in this respect? (I'm afraid I haven't been able to test nhc98 yet). GHC (and nhc98) assumes a locale of ISO8859-1 for I/O. You could consider that to be a bug, I suppose. We don't plan to do anything about it in the context of the current IO library, at least. Cheers, Simon ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Wed, Mar 16, 2005 at 03:54:19AM +, Ian Lynagh wrote: Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after). I got lost in the negatives here. It affects all Haskell 98 primitives that do character I/O, or that exchange C strings with the C library. It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.) Finally, the hugs behaviour seems a little odd to me. The below shows 4 cases where iconv complains when asked to convert utf8 to utf8, but hugs only gives an error in one of them. In the others it just truncates the input. Is this really correct? It also seems to behave the same for me regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C. It's a bug: an unrecognized encoding at the end of the input was being ignored instead of triggering the exception. Now fixed in CVS (rev. 1.14 of src/char.c if anyone's backporting). It was an accident of this example that the behaviour in all locales was the same. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Wed, 2005-03-16 at 11:55 +, Ross Paterson wrote: On Wed, Mar 16, 2005 at 03:54:19AM +, Ian Lynagh wrote: Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after). I got lost in the negatives here. It affects all Haskell 98 primitives that do character I/O, or that exchange C strings with the C library. It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.) Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. For example some libraries (eg Gtk+) take all strings in UTF-8 irrespective of the current locale (it does locale-dependent conversions on IO etc but the internal representation is always UTF8). We do the conversion to UTF8 on the Haskell side and so produce a byte string which we marshal using the FFI CString functions. If the implementations get fixed to conform to the FFI spec, I suppose we could roll our own version of withCString that marshals [Word8] - char*. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Wed, 2005-03-16 at 13:09 +, Duncan Coutts wrote: On Wed, 2005-03-16 at 11:55 +, Ross Paterson wrote: It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.) Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. For example some libraries (eg Gtk+) take all strings in UTF-8 irrespective of the current locale (it does locale-dependent conversions on IO etc but the internal representation is always UTF8). We do the conversion to UTF8 on the Haskell side and so produce a byte string which we marshal using the FFI CString functions. Silly me! There are C marshaling functions that are specified to do just this but I never noticed them before! withCAString and similar functions treat haskell Strings as byte strings. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Duncan Coutts [EMAIL PROTECTED] writes: It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.) Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. It should only be the default, not the only option. It should be possible to specify the encoding explicitly. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Marcin 'Qrczak' Kowalczyk wrote: It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.) Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. It should only be the default, not the only option. I'm not sure that it should be available at all. It should be possible to specify the encoding explicitly. Conversely, it shouldn't be possible to avoid specifying the encoding explicitly. Personally, I wouldn't provide an all-in-one convert String to CString using locale's encoding function, just in case anyone was tempted to actually use it. The decision as to the encoding belongs in application code; not in (most) libraries, and definitely not in the language. [Libraries dealing with file formats or communication protocols which mandate a specific encoding are an exception. But they will be using a fixed encoding, not the locale's encoding.] If application code chooses to use the locale's encoding, it can retrieve it then pass it as the encoding argument to any applicable functions. If application code doesn't want to use the locale's encoding, it shouldn't be shoe-horned into doing so because a library developer decided to duck the encoding issues by grabbing whatever encoding was readily to hand (i.e. the locale's encoding). -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
Glynn Clements [EMAIL PROTECTED] writes: It should be possible to specify the encoding explicitly. Conversely, it shouldn't be possible to avoid specifying the encoding explicitly. What encoding should a binding to readline or curses use? Curses in C comes in two flavors: the traditional byte version and a wide character version. The second version is easy if we can assume that wchar_t is Unicode, but it's not always available and until recently in ncurses it was buggy. Let's assume we are using the byte version. How to encode strings? A terminal uses an ASCII-compatible encoding. Wide character version of curses convert characters to the locale encoding, and byte version passes bytes unchanged. This means that if a Haskell binding to the wide character version does the obvious thing and passes Unicode directly, then an equivalent behavior can be obtained from the byte version (only limited to 256-character encodings) by using the locale encoding. The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how? If application code doesn't want to use the locale's encoding, it shouldn't be shoe-horned into doing so because a library developer decided to duck the encoding issues by grabbing whatever encoding was readily to hand (i.e. the locale's encoding). If a C library is written with the assumption that texts are in the locale encoding, a Haskell binding to such library should respect that assumption. Only some libraries allow to work with different, explicitly specified encodings. Many libraries don't, especially if the texts are not the core of the library functionality but error messages. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Wed, Mar 16, 2005 at 05:13:25PM +, Glynn Clements wrote: Marcin 'Qrczak' Kowalczyk wrote: It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.) Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. It should only be the default, not the only option. I'm not sure that it should be available at all. It should be possible to specify the encoding explicitly. Conversely, it shouldn't be possible to avoid specifying the encoding explicitly. Personally, I wouldn't provide an all-in-one convert String to CString using locale's encoding function, just in case anyone was tempted to actually use it. But this is exactly what is needed for most C library bindings. Which is why I had to write my own and proposed it to the FFI. Most C libraries expect char * to be in the standard encoding of the current locale. When a binding explicitly uses another encoding, then great, we can use different marshaling functions. In any case, we need tools to be able to conform to the common cases of ascii-only (withCAStrirg) and current locale (withCString). withUTF8String would be a nice addition, but is much less important to come standard as it can easily be written by end users, unlike locale specific versions which are necessarily system dependent. John -- John Meacham - repetae.netjohn ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
John Meacham [EMAIL PROTECTED] writes: In any case, we need tools to be able to conform to the common cases of ascii-only (withCAStrirg) and current locale (withCString). withUTF8String would be a nice addition, but is much less important to come standard as it can easily be written by end users, unlike locale specific versions which are necessarily system dependent. IMHO the encoding should be a parameter of an extended variant of withCString (and peekCString etc.). We need a framework for implementing encoders/decoders first. A problem with designing the framework is that it should support both pure Haskell conversions and C functions like iconv which work on arrays. We must also provide a way to signal errors. A bonus is a way to handle errors coming from another recoder without causing it to fail completely. That way one could add a fallback for unrepresentable characters, e.g. HTML entities or approximations with stripped accents. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote: You can select binary I/O using the openBinaryFile and hSetBinaryMode functions from System.IO. After that, the Chars you get from that Handle are actually bytes. What about the ones sent to it? Are all the following results intentional? Am I doing something stupid? [in brief: hugs' (hPutStr h) now behaves differently to (mapM_ (hPutChar h)), and ghc writes the empty string for both when told to write \128] Running the following with new ghc 6.4 and hugs 20050308 or 20050317: echo 'import System.IO; import System.Environment; main = do [o] - getArgs; ho - openBinaryFile o WriteMode; hPutStr ho \128' run1.hs echo 'import System.IO; import System.Environment; main = do [o] - getArgs; ho - openBinaryFile o WriteMode; mapM_ (hPutChar ho) \128' run2.hs runhugs run1.hs hugs1 runhugs run2.hs hugs2 runghc run1.hs ghc1 runghc run2.hs ghc2 ls -l hugs1 hugs2 ghc1 ghc2 for f in hugs1 hugs2 ghc1 ghc2; do echo $f; hexdump -C $f; done gives: -rw-r--r-- 1 igloo igloo 0 Mar 17 06:15 ghc1 -rw-r--r-- 1 igloo igloo 0 Mar 17 06:15 ghc2 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:15 hugs1 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:15 hugs2 hugs1 3f|?| 0001 hugs2 80|.| 0001 ghc1 ghc2 With ghc 6.2.2 and hugs November 2003 I get: -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 ghc1 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 ghc2 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 hugs1 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 hugs2 hugs1 80|.| 0001 hugs2 80|.| 0001 ghc1 80|.| 0001 ghc2 80|.| 0001 Incidentally, make check in CVS hugs said: cd tests sh testScript | egrep -v '^--( |-)' ./../src/hugs +q -w -pHugs: static/mod154.hs /dev/null expected stdout not matched by reality *** static/Loaded.outputFri Jul 19 22:41:51 2002 --- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005 *** *** 1,2 ! Type :? for help Hugs:[Leaving Hugs] --- 1,3 ! ERROR static/mod154.hs - Conflicting exports of entity sort ! *** Could refer to Data.List.sort or M.sort Hugs:[Leaving Hugs] Thanks Ian ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote: I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with: handle: IO.getContents: protocol error (invalid character encoding) What is going on, and how can I fix it? A Haskell 98 Handle is a character stream, and doesn't support binary I/O. This would have bitten you sooner or later on systems that do CRLF conversion, but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only). You can select binary I/O using the openBinaryFile and hSetBinaryMode functions from System.IO. After that, the Chars you get from that Handle are actually bytes. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote: On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote: I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with: handle: IO.getContents: protocol error (invalid character encoding) What is going on, and how can I fix it? A Haskell 98 Handle is a character stream, and doesn't support binary I/O. This would have bitten you sooner or later on systems that do CRLF Yes, probably so.. conversion, but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only). Hmm, this seems to be completely undocumented. So yes, I'll try using openBinaryFile, but the docs I have seen still talk only about CRLF and ^Z. Anyway, I'm intrested in this new feature (I assume GHC 6.4 has it as well?) Would it, for instance, automatically convert from Latin-1 to UTF-16 on read, and the inverse on write? Or to/from UTF-8? Thanks, -- John ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Tue, Mar 15, 2005 at 08:12:48AM -0600, John Goerzen wrote: [...] but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only). Hmm, this seems to be completely undocumented. It's mentioned in the release history in the User's Guide, which refers to section 3.3 for (some) more details. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] invalid character encoding
On Tue, Mar 15, 2005 at 10:44:28AM +, Ross Paterson wrote: On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote: I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with: handle: IO.getContents: protocol error (invalid character encoding) What is going on, and how can I fix it? A Haskell 98 Handle is a character stream, and doesn't support binary I/O. This would have bitten you sooner or later on systems that do CRLF conversion, but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only). Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after). Simons, Malcolm, are there any such functions in the new ghc/nhc98? Also, are you all agreed that the hugs interpretation of the report is correct, and thus ghc at least is buggy in this respect? (I'm afraid I haven't been able to test nhc98 yet). Finally, the hugs behaviour seems a little odd to me. The below shows 4 cases where iconv complains when asked to convert utf8 to utf8, but hugs only gives an error in one of them. In the others it just truncates the input. Is this really correct? It also seems to behave the same for me regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C. Thanks Ian printf \x00\x7F inp1 printf \x00\x80 inp2 printf \x00\xC4 inp3 printf \xFF\xFF inp4 printf \xb1\x41\x00\x03\x65\x6d\x70\x74\x79\x00\x03\x00\x00\x00\x00\x00 inp5 echo 'main = do xs - getContents; print xs' run.hs for i in `seq 1 5`; do runhugs run.hs inp$i; done for i in `seq 1 5`; do runghc6 run.hs inp$i; done for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8 inp$i; done which gives me the following output: $ for i in `seq 1 5`; do runhugs run.hs inp$i; done \NUL\DEL \NUL \NUL Program error: stdin: IO.getContents: protocol error (invalid character encoding) $ for i in `seq 1 5`; do runghc6 run.hs inp$i; done \NUL\DEL \NUL\128 \NUL\196 \255\255 \177A\NUL\ETXempty\NUL\ETX\NUL\NUL\NUL\NUL\NUL $ for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8 inp$i; done 1 2 iconv: illegal input sequence at position 1 3 iconv: incomplete character or shift sequence at end of buffer 4 iconv: illegal input sequence at position 0 5 iconv: illegal input sequence at position 0 $ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] invalid character encoding
I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with: handle: IO.getContents: protocol error (invalid character encoding) What is going on, and how can I fix it? Thanks, John ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe