Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-18 Thread Max Bolingbroke
On 10 November 2011 14:35, Simon Marlow marlo...@gmail.com wrote: Agreed. Committed. I'm wondering if we should also have hSetLocaleEncoding, hSetFileSystemEncoding :: TextEncoding -  IO () and change localeEncoding, fileSystemEncoding :: IO TextEncoding. hSetFileSystemEncoding in

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Max Bolingbroke
On 10 November 2011 00:17, Ian Lynagh ig...@earth.li wrote: On Wed, Nov 09, 2011 at 03:58:47PM +, Max Bolingbroke wrote: (Note that the above outlined problems are problems in the current implementation too Then the proposal seems to me to be strictly better than the current system.

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Max Bolingbroke
On 9 November 2011 16:29, Simon Marlow marlo...@gmail.com wrote: Ok, so since we need something like  makePrintable :: FilePath - String arguably we might as well make that do the locale decoding.  That's certainly a good point... You could, but getArgs :: IO [String], not :: IO [FilePath].

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Simon Marlow
On 09/11/2011 16:42, John Millikin wrote: On Wed, Nov 9, 2011 at 08:04, Simon Marlowmarlo...@gmail.com wrote: Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings. The Haddocks for my augmented unix package

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread Simon Marlow
On 10/11/2011 09:28, Max Bolingbroke wrote: Is there any consensus about what to do here? My take is that we should move back to lone surrogates. This: 1. Recovers the roundtrip property, which we appear to believe is essential 2. Removes all the weird problems I outlined earlier that can

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread John Millikin
On Thu, Nov 10, 2011 at 03:28, Simon Marlow marlo...@gmail.com wrote: I've done a search/replace and called it RawFilePath.  Ok? Fantastic, thank you very much. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 8 November 2011 11:43, Simon Marlow marlo...@gmail.com wrote: Don't you mean 1 is what we have? Yes, sorry! Failing to roundtrip in some cases, and doing so silently, seems highly suboptimal to me.  I'm sorry I didn't pick up on this at the time (Unicode is a swamp :). I *can* change the

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 7 November 2011 17:32, John Millikin jmilli...@gmail.com wrote: I am also not convinced that it is possible to correctly implement either of these functions if their behavior is dependent on the user's locale. FWIW it's only dependent on the users locale because whether glibc iconv detects

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Ian Lynagh
On Wed, Nov 09, 2011 at 11:02:54AM +, Simon Marlow wrote: I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't Unicode. All you can do with an invalid

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 9 November 2011 13:11, Ian Lynagh ig...@earth.li wrote: If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place? (I think you mean decoded here - my understanding is that decode :: ByteString - String, encode :: String -

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Max Bolingbroke
On 9 November 2011 11:02, Simon Marlow marlo...@gmail.com wrote: The performance overhead of all this worries me.  withCString has taken a huge performance hit, and I think there are people who wnat to know that there aren't several complex encoding/decoding passes between their Haskell code

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Simon Marlow
On 08/11/2011 15:42, John Millikin wrote: On Tue, Nov 8, 2011 at 03:04, Simon Marlowmarlo...@gmail.com wrote: I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Simon Marlow
On 09/11/2011 13:11, Ian Lynagh wrote: On Wed, Nov 09, 2011 at 11:02:54AM +, Simon Marlow wrote: I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread John Millikin
On Wed, Nov 9, 2011 at 08:04, Simon Marlow marlo...@gmail.com wrote: Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings.  The Haddocks for my augmented unix package are here:

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Simon Marlow
On 09/11/2011 15:58, Max Bolingbroke wrote: (Note that the above outlined problems are problems in the current implementation too -- but the current implementation doesn't even pretend to support U+EFxx characters. Its correctness is entirely dependent on them never showing up, which is why we

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread Ian Lynagh
On Wed, Nov 09, 2011 at 03:58:47PM +, Max Bolingbroke wrote: (Note that the above outlined problems are problems in the current implementation too Then the proposal seems to me to be strictly better than the current system. Under both systems the wrong thing happen when U+EFxx is entered

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread John Lask
My primary concerns are (in order of priority - and I only speak for myself) (a) consistency across platforms (b) minimize (unrequired) performance overhead I would prefer an api which is consistent for both win32, posix or other os which only did as much as what the user (us) wanted for

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread Simon Marlow
On 07/11/2011 17:57, Ian Lynagh wrote: On Mon, Nov 07, 2011 at 05:02:32PM +, Simon Marlow wrote: Basically, imagine a reversible transformation: encode :: String - [Word8] decode :: [Word8] - String this transformation is applied in the appropriate direction by the IO library to

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread Simon Marlow
On 07/11/2011 17:32, John Millikin wrote: On Mon, Nov 7, 2011 at 09:02, Simon Marlowmarlo...@gmail.com wrote: I think you might be misunderstanding how the new API works. Basically, imagine a reversible transformation: encode :: String - [Word8] decode :: [Word8] - String this

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread Simon Marlow
On 02/11/2011 21:40, Max Bolingbroke wrote: On 2 November 2011 20:16, Ian Lynaghig...@earth.li wrote: Are you saying there's a bug that should be fixed? You can choose between two options: 1. Failing to roundtrip some strings (in our case, those containing the 0xEFNN byte sequences) 2.

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread John Millikin
On Tue, Nov 8, 2011 at 03:04, Simon Marlow marlo...@gmail.com wrote: As mentioned earlier in the thread, this behavior is breaking things. Due to an implementation error, programs compiled with GHC 7.2 on POSIX systems cannot open files unless their paths also happen to be valid text according

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread wren ng thornton
On 11/8/11 6:04 AM, Simon Marlow wrote: I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning from the System.Directory FilePaths then confusion would ensue. So perhaps

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread Simon Marlow
On 06/11/2011 16:56, John Millikin wrote: 2011/11/6 Max Bolingbrokebatterseapo...@hotmail.com: On 6 November 2011 04:14, John Millikinjmilli...@gmail.com wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me,

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread John Millikin
On Mon, Nov 7, 2011 at 09:02, Simon Marlow marlo...@gmail.com wrote: I think you might be misunderstanding how the new API works.  Basically, imagine a reversible transformation:  encode :: String - [Word8]  decode :: [Word8] - String this transformation is applied in the appropriate

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread Ian Lynagh
On Mon, Nov 07, 2011 at 05:02:32PM +, Simon Marlow wrote: Basically, imagine a reversible transformation: encode :: String - [Word8] decode :: [Word8] - String this transformation is applied in the appropriate direction by the IO library to translate filesystem paths into

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread Yitzchak Gale
Simon Marlow wrote: It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand.  But that is a big change to the API and would break much more code.  One day we'll do this properly; for now we have this, which I think is a pretty reasonble

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread John Millikin
On Mon, Nov 7, 2011 at 15:39, Yitzchak Gale g...@sefer.org wrote: The problem is that Haskell 98 specifies type FilePath = String. In retrospect, we now know that this is too simplistic. But that's what we have right now. This is *a* problem, but not a particularly major one; the definition of

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread Max Bolingbroke
On 6 November 2011 04:14, John Millikin jmilli...@gmail.com wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of locale encoding is entirely vestigal,

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread John Millikin
2011/11/6 Max Bolingbroke batterseapo...@hotmail.com: On 6 November 2011 04:14, John Millikin jmilli...@gmail.com wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option;

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread Donn Cave
Quoth John Millikin jmilli...@gmail.com, ... One is to give low-level access, using abstractions as close to the real API as possible. In this model, unix would provide functions like [[ rename :: ByteString - ByteString - IO () ]], and I would know that it's not going to do anything weird to

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread John Lask
for what it is worth, I would like to see both System.IO and Directory export internal functions where the filepath is a Raw Byte representation. I have utilities that regularly scan 100,000 of files and hash the path the details of which are irrelevant to this discussion, the point being that

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread Daniel Peebles
Can't we just have the usual .Internal module convention, where people who want internals can get at them if they need to, and most people get a simpler interface? It's amazingly frustrating when you have a library that does 99% of what you need it to do, except for one tiny internal detail that

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-05 Thread John Millikin
FYI: I just released new versions of system-filepath and system-fileio, which attempt to work around the changes in GHC 7.2. On Wed, Nov 2, 2011 at 11:55, Max Bolingbroke batterseapo...@hotmail.com wrote: Maybe I'm misunderstanding, but it sounds like you're still trying to treat posix file

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-04 Thread David Brown
On Thu, Nov 03, 2011 at 09:41:32AM +, Max Bolingbroke wrote: On 2 November 2011 21:46, Ganesh Sittampalam gan...@earth.li wrote: The workaround you propose seems a little complex and it might be a bit problematic that 100% roundtripping can't be guaranteed even once your fix is applied. I

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-03 Thread Max Bolingbroke
On 2 November 2011 21:46, Ganesh Sittampalam gan...@earth.li wrote: The workaround you propose seems a little complex and it might be a bit problematic that 100% roundtripping can't be guaranteed even once your fix is applied. I can understand this perspective, although the roundtripping as

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 1 November 2011 20:13, John Millikin jmilli...@gmail.com wrote: $ ghci-7.2.1 GHC import System.Directory GHC getDirectoryContents path-test [\161\165,\61345\61349,..,.] GHC readFile path-test/\161\165 world\n GHC readFile path-test/\61345\61349 *** Exception: path-test/: openFile:

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Jean-Marie Gaillourdet
Hi, On 01.11.2011, at 19:43, Max Bolingbroke wrote: As I pointed out earlier in the thread you can recover the old behaviour if you really want it by manually reencoding the strings, so I would dispute the claim that it is impossible to fix within the given API. As far as I know, not all

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 10:03, Jean-Marie Gaillourdet j...@gaillourdet.net wrote: As far as I know, not all encodings are reversable. I.e. there are byte sequences which are invalid utf-8. Therefore, decoding and re-encoding might not return the exact same byte sequence. The PEP 383 mechanism

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 09:37, Max Bolingbroke batterseapo...@hotmail.com wrote: On 1 November 2011 20:13, John Millikin jmilli...@gmail.com wrote: $ ghci-7.2.1 GHC import System.Directory GHC getDirectoryContents path-test [\161\165,\61345\61349,..,.] GHC readFile path-test/\161\165 world\n

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 13:53, Max Bolingbroke batterseapo...@hotmail.com wrote: I think the only way to fix this last case in general is to fix iconv itself, so I'm going to see if I can get a patch upstream. Fixing it for people with UTF-8 locales should be enough for 99% of users, though. One

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ian Lynagh
On Wed, Nov 02, 2011 at 01:29:16PM +, Max Bolingbroke wrote: On 2 November 2011 10:03, Jean-Marie Gaillourdet j...@gaillourdet.net wrote: As far as I know, not all encodings are reversable. I.e. there are byte sequences which are invalid utf-8. Therefore, decoding and re-encoding

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread John Millikin
On Wed, Nov 2, 2011 at 06:53, Max Bolingbroke batterseapo...@hotmail.com wrote: I've got a patch that will work around the issue in most situations by avoiding the iconv code path. With the patch everything will work OK as long as the system locale is one that we have a native-Haskell decoder

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 17:15, John Millikin jmilli...@gmail.com wrote: What package does this patch -- unix, directory, something else? The base package. The problem lay in the implementation of GHC.IO.Encoding.fileSystemEncoding on non-Windows OSes. Maybe I'm misunderstanding, but it sounds like

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 16:29, Ian Lynagh ig...@earth.li wrote: If I understand correctly, you use U+EF00-U+EFFF to encode the characters 0-255 when they are not a valid part of the UTF8 stream. Yes. So why not encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, and so on?

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ian Lynagh
On Wed, Nov 02, 2011 at 07:02:09PM +, Max Bolingbroke wrote: [snip some stuff I didn't understand. I think I made the mistake of entering a Unicode discussion] This is why the unmodified PEP383 approach is kind of nice - it uses lone surrogate (rather than private use) codepoints to do the

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 19:13, Ian Lynagh ig...@earth.li wrote: [snip some stuff I didn't understand. I think I made the mistake of entering a Unicode discussion] Sorry, perhaps that was too opaque! The problem is that if we commit to support occurrences of the private-use codepoint 0xEF80 then what

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ian Lynagh
On Wed, Nov 02, 2011 at 07:59:21PM +, Max Bolingbroke wrote: On 2 November 2011 19:13, Ian Lynagh ig...@earth.li wrote: They are allowed to occur in Linux/ext2 filenames, anyway, and I think we ought to be able to handle them correctly if they do. In Python, if a filename is decoded

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Max Bolingbroke
On 2 November 2011 20:16, Ian Lynagh ig...@earth.li wrote: Are you saying there's a bug that should be fixed? You can choose between two options: 1. Failing to roundtrip some strings (in our case, those containing the 0xEFNN byte sequences) 2. Having GHC's decoding functions return strings

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread Ganesh Sittampalam
Hi Max, On 01/11/2011 10:23, Max Bolingbroke wrote: This is my implementation of Python's PEP 383 [1] for Haskell. IMHO this behaviour is much closer to what users expect.For example, getDirectoryContents . = print shows Unicode filenames properly. As a result of this change we were able

behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Ganesh Sittampalam
Hi, I'm just investigating what we can do about a problem with darcs' handling of non-ASCII filenames on GHC 7.2. The issue is apparently that as of GHC 7.2, getDirectoryContents now tries to decode filenames in the current locale, rather than converting a stream of bytes into characters:

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Max Bolingbroke
Hi Ganesh, On 1 November 2011 07:16, Ganesh Sittampalam gan...@earth.li wrote: Can anyone point me at the rationale and details of the change and/or suggest workarounds? This is my implementation of Python's PEP 383 [1] for Haskell. IMHO this behaviour is much closer to what users expect.For

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Felipe Almeida Lessa
On Tue, Nov 1, 2011 at 5:16 AM, Ganesh Sittampalam gan...@earth.li wrote: I'm just investigating what we can do about a problem with darcs' handling of non-ASCII filenames on GHC 7.2. The issue is apparently that as of GHC 7.2, getDirectoryContents now tries to decode filenames in the current

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread John Millikin
You're right -- many parts of system-fileio (the parts based on directory) are broken due to this. I'll need to update it to call the posix/win32 functions directly. IMO, the GHC behavior in =7.0 is ugly, but the behavior in 7.2 is fundamentally wrong. Different OSes have different definitions

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread Max Bolingbroke
Hi John, On 1 November 2011 17:14, John Millikin jmilli...@gmail.com wrote: GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all existing code and 2) makes it impossible to fix within the given API. Please can you give an example of code that is broken with the new behaviour?

Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread John Millikin
On Tue, Nov 1, 2011 at 11:43, Max Bolingbroke batterseapo...@hotmail.com wrote: Hi John, On 1 November 2011 17:14, John Millikin jmilli...@gmail.com wrote: GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all existing code and 2) makes it impossible to fix within the given