[Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread Stephane Bortzmeyer
On Thu, Sep 13, 2007 at 12:23:33AM +,
 Aaron Denney [EMAIL PROTECTED] wrote 
 a message of 76 lines which said:

 the characters read and written should correspond to the native
 environment notions and encodings.  These are, under Unix,
 determined by the locale system.

Locales, while fine for things like the language of the error messages
or the format to use to display the time, are *not* a good solution
for things like file names and file contents.

Even on a single Unix machine (without networking), there are
*several* users. Using the locale to find out the charset used for a
file name won't work if these users use different locales.

Same thing for file contents. The charset used must be marked in the
file (XML...) or in the metadata, somehow. Otherwise, there is no way
to exchange files or even to change the locale (if I switch from
Latin1 to UTF-8, what do my files become?)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread Bulat Ziganshin
Hello Stefan,

Thursday, September 13, 2007, 4:40:17 AM, you wrote:

 I'm pretty sure Hugs does the right thing.  NHC is probably broken. In
 any case, we already have hGetBuf / hPutBuf in the standard base
 libaries for raw binary IO, so code that uses getChar for bytes really
 has no excuse.  We can and should fix the bug.

are you ever heard about backward compatibility? :/

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread Andrea Rossato
On Thu, Sep 13, 2007 at 11:07:03AM +0200, Stephane Bortzmeyer wrote:
 On Thu, Sep 13, 2007 at 12:23:33AM +,
  Aaron Denney [EMAIL PROTECTED] wrote 
  a message of 76 lines which said:
 
  the characters read and written should correspond to the native
  environment notions and encodings.  These are, under Unix,
  determined by the locale system.
 
 Locales, while fine for things like the language of the error messages
 or the format to use to display the time, are *not* a good solution
 for things like file names and file contents.
 
 Even on a single Unix machine (without networking), there are
 *several* users. Using the locale to find out the charset used for a
 file name won't work if these users use different locales.

Yes indeed. And I find it a real mess. And I don't see any way out.

 Same thing for file contents. The charset used must be marked in the
 file (XML...) or in the metadata, somehow. Otherwise, there is no way
 to exchange files or even to change the locale (if I switch from
 Latin1 to UTF-8, what do my files become?)

Ok, you are perfectly right, but we live in an imperfect world and we
must come up with a solution. In my case I'm developing this prompt
for xmonad and a Chinese user wants directory and file names to be
correctly displayed. What else can I do but using locale technologies?
This is something I don't know.

The code below is not perfect but it works to some extent.
Nonetheless, if you have 2 users using an iso-8859-1 locale the first
and utf-8 one the second, non ascii characters in file names of the
first users will produce invalid character sequences for the second
users. The reverse will work, though.

I'm still puzzled and still find the thread title appropriate.

Thanks for your kind attention.

Andrea

The locale aware version of the previous code (needs hsc2hs)

{-# OPTIONS -fglasgow-exts #-}
import Prelude hiding (catch)
import System.Process
import System.IO
import Control.Monad
import System.Directory
import Foreign
import Foreign.C
import Data.Char
import Control.Exception

runProcessWithInput cmd args input = do
  (pin, pout, perr, ph) - runInteractiveProcess cmd args Nothing Nothing
  hPutStr pin input
  hClose pin
  output - hGetContents pout
  when (output==output) $ return ()
  hClose pout
  hClose perr
  waitForProcess ph
  return output

main = do
  setupLocale
  l - fmap lines $ runProcessWithInput /bin/bash [] ls ab*\n
  l' - mapM fromLocale l
  l'' - mapM toLocale l'
  putStrLn (show l')
  mapM_ putStrLn l''
  mapM_ (putStrLn . show . length) l'


-- This code comes from John Meacham's HsLocale
-- http://repetae.net/john/repos/HsLocale/
toLocale :: String - IO String
toLocale s = catch (stringToBytes s = return . map (chr . fromIntegral))
   (const $ return invalid character sequence)

fromLocale :: String - IO String
fromLocale s = bytesToString (map (fromIntegral . ord) s) 
  `catch` \_ -  return invalid character sequence 

stringToBytes :: String - IO [Word8]
stringToBytes cs = (withIConv  UTF-32 $ \ic - convertRaw ic cs) 

bytesToString :: [Word8] - IO String
bytesToString xs =  (withIConv UTF-32  $ \ic -  convertRaw ic xs) = 
return . f where
f ('\65279':xs) = xs   -- discard byte order marker
f xs = xs

newtype IConv = IConv (#type intptr_t)
deriving(Num,Eq,Show)

foreign import ccall unsafe iconv.h iconv_open
  iconv_open :: Ptr CChar - Ptr CChar - IO IConv
foreign import ccall unsafe iconv.h iconv_close
  iconv_close :: IConv - IO CInt
foreign import ccall unsafe iconv.h iconv 
  iconv :: IConv - Ptr (Ptr CChar) - Ptr CSize - Ptr (Ptr CChar) - Ptr 
CSize - IO CInt

withIConv :: String - String - (IConv - IO a) - IO a 
withIConv to from action = bracket open close action where
close ic = throwErrnoIfMinus1_ iconv_close (iconv_close ic)
open = throwErrnoIfMinus1 iconv_open iopen
iopen = do
withCAString to $ \to - do
withCAString from $ \from - do
iconv_open to from

convertRaw :: (Storable a, Storable b) = IConv - [a] - IO [b]
convertRaw ic xs = do 
with (fromIntegral $ sizeOf (head xs) * length xs) $ \inptrSz - do
withArray xs $ \arr - do  
with (castPtr arr) $ \inptr - do
allocaBytes (1024) $ \outptr - do
with outptr $ \outptrptr - do
with 1024 $ \outptrSz - do
let outSz = fromIntegral $ sizeOf $ unsafePerformIO (peek outptr) 
let 
go = do 
ret - iconv ic inptr inptrSz (castPtr outptrptr) outptrSz 
err - getErrno
case (ret,err) of
(-1,_) | err == e2BIG - do
oz - peek outptrSz
x - peekArray ((1024 - fromIntegral oz) `div` outSz) 
(castPtr outptr) 
poke outptrptr outptr
poke outptrSz 1024
y - go
return $ x ++ y
(-1,_) - throwErrno iconv
(_,_) - do
oz - peek outptrSz
peekArray ((1024 

Re: [Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread Ketil Malde
On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote:
 On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote:
  Unfortunately, at this point it is a well entrenched bug, and changing
  the behaviour will undoubtedly break programs.
 ...
  There should be another system for getting the exact bytes in and 
  out (as Word8s, say, rather than Chars), 

 I'm pretty sure Hugs does the right thing.

..which makes me wonder what the right thing actually is?

Since IO on Unix (or at least on Linux) consists of bytes, I don't see
how a Unicode-only interface is ever going to do the 'right thing' for
all people.

One possible solution might be to have IO functions deal with [Word8]
instead of [Char]. If string and character constants were polymorphic,
Char and String made aliases for byte-based types, and a new type
introduced for Unicode characters, it might even be possible to fix
without breaking absolutely all legacy code.

But even this would probably only fix the Unix side of things.

-k

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread Aaron Denney
On Thu, Sep 13, 2007 at 11:07:03AM +0200, Stephane Bortzmeyer wrote:
 On Thu, Sep 13, 2007 at 12:23:33AM +,
  Aaron Denney [EMAIL PROTECTED] wrote 
  a message of 76 lines which said:
 
  the characters read and written should correspond to the native
  environment notions and encodings.  These are, under Unix,
  determined by the locale system.
 
 Locales, while fine for things like the language of the error messages
 or the format to use to display the time, are *not* a good solution
 for things like file names and file contents.

I never claimed it was a good system, merely that it was the system.
Yes, serious applications should use byte oriented I/O and explicitly
manage character sets when necessary.  STDIO in general and terminal
interaction in particular should use the locale selected by the user.

 Even on a single Unix machine (without networking), there are
 *several* users. Using the locale to find out the charset used for a
 file name won't work if these users use different locales.
 
 Same thing for file contents. The charset used must be marked in the
 file (XML...) or in the metadata, somehow.

For file system and network access, the justification is a bit more
clouded, but the interfaces there _should not_ be character interfaces.
Character interfaces are _lies_; Word8s are what actually get passed,
and trying to treat them as unicode characters with any fixed mapping
breaks.  At best we get an extremely leaky abstraction.

Filesystems are not uniform across systems, yet Haskell tries to present
a uniform view that manages to capture exactly no existing system.

File contents (almost) everywhere are streams of bytes (ignoring, say,
old record-based OSes, palm databases, and mac resource forks etc.)
Almost all file systems use a hierarchical directory system, but with
significant differences.  Under unixes the names are NUL-terminated
bytestrings that can't contain slashes.  New Macs and Windows have
specific character encodings (UTF-8, and UTF-16, respectively).  DOS,
old Macs, and windows have multiple roots and various directory
seperators and forbidden characters.

Trying to specify some API that is usable for robust programs that work
on any of these is hard.  I'd actually have preferred that the standard
didn't even try, and instead provided system-specific annexes.
Then an external library that was freer to evolve could try to solve
the problem of providing a uniform interface that would not defy
platform expectations.

-- 
Aaron Denney
--
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread Stefan O'Rear
On Thu, Sep 13, 2007 at 12:06:15PM +0200, Ketil Malde wrote:
 On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote:
  On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote:
   Unfortunately, at this point it is a well entrenched bug, and changing
   the behaviour will undoubtedly break programs.
  ...
   There should be another system for getting the exact bytes in and 
   out (as Word8s, say, rather than Chars), 
 
  I'm pretty sure Hugs does the right thing.
 
 ..which makes me wonder what the right thing actually is?
 
 Since IO on Unix (or at least on Linux) consists of bytes, I don't see
 how a Unicode-only interface is ever going to do the 'right thing' for
 all people.

I never said it was Unicode-only.

hGetBuf / hPutBuf - Raw Word8 access
getChar etc   - Uses locale info

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread David Roundy
On Thu, Sep 13, 2007 at 06:49:59AM -0700, Stefan O'Rear wrote:
 On Thu, Sep 13, 2007 at 12:06:15PM +0200, Ketil Malde wrote:
  On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote:
   On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote:
Unfortunately, at this point it is a well entrenched bug, and changing
the behaviour will undoubtedly break programs.
   ...
There should be another system for getting the exact bytes in and 
out (as Word8s, say, rather than Chars), 
  
   I'm pretty sure Hugs does the right thing.
  
  ..which makes me wonder what the right thing actually is?
  
  Since IO on Unix (or at least on Linux) consists of bytes, I don't see
  how a Unicode-only interface is ever going to do the 'right thing' for
  all people.
 
 I never said it was Unicode-only.
 
 hGetBuf / hPutBuf - Raw Word8 access
 getChar etc   - Uses locale info

The problem is that the type of openFile and getArgs is wrong, so there's
no right way to get a Handle (other than stdin) to read from in the first
place, unless we're willing to allow the current weird behavior of treating
a [Char] as [Word8].
-- 
David Roundy
Department of Physics
Oregon State University


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: getting crazy with character encoding

2007-09-13 Thread Aaron Denney
On 2007-09-13, Stefan O'Rear [EMAIL PROTECTED] wrote:
 In any case, we already have hGetBuf / hPutBuf in the standard base
 libaries for raw binary IO, so code that uses getChar for bytes really
 has no excuse.

Except, of course, that hGetBuf and hPutBuf are
(a) allocating the memory for the buffers is a pain (does it require the
FFI?)
(b) are something of a pain to use, requiring explicitly managing what's
valid in these buffers (though a wrapper only need be written once)
(c) while in the standard base libraries are not in the report or
library report.  i.e. there's no guarantee that a conforming Haskell
implementation will have them.  It'd be silly for an implementation to
not support them, of course, but...

The ByteString library at least fixes (a) and (b).

-- 
Aaron Denney
--

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: getting crazy with character encoding

2007-09-12 Thread Stephane Bortzmeyer
On Wed, Sep 12, 2007 at 04:18:43PM +0200,
 Andrea Rossato [EMAIL PROTECTED] wrote 
 a message of 60 lines which said:

 Now, I would expect that the output of a shell command such as 
 ls ab*
 would be a string/list of 5 chars. 

I do not think this expectation is reasonable. I do not think that ls
is Unicode-aware. It probably has only bytes semantic, not characters
semantic.

 I would expect that a file name set in an utf-8 locale should be
 read by locale aware application

locale aware application is too vague. An application can use the
locale and still being unable to separate bytes from characters. ls
may be locale aware but it is probably not Unicode aware.

   l - fmap lines $ runProcessWithInput /bin/bash [] ls ab*

This is not an Haskell issue but a ls issue. use
System.Directory.getDirectoryContents and we'll see.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: getting crazy with character encoding

2007-09-12 Thread Andrea Rossato
On Wed, Sep 12, 2007 at 04:35:50PM +0200, Stephane Bortzmeyer wrote:
 This is not an Haskell issue but a ls issue. use
 System.Directory.getDirectoryContents and we'll see.

I get the very same output.

Thanks for you kind attention.

Andrea



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: getting crazy with character encoding

2007-09-12 Thread Aaron Denney
Hi.  I believe that everything I've said has been said by another
responder, but not all together in one place.

On 2007-09-12, Andrea Rossato [EMAIL PROTECTED] wrote:
 supposed that, in a Linux system, in an utf-8 locale, you create a file
 with non ascii characters. For instance:
 touch abèèè
 

 Now, I would expect that the output of a shell command such as 
 ls ab*
 would be a string/list of 5 chars. Instead I find it to be a list of 8
 chars...;-)

 That is to say, each non ascii character is read as 2 characters, as
 if the string were an ISO-8859-1 string - the string is actually
 treated as an ISO-8859-1 string. But when I print it, now it is
 displayed correctly.

The Linux kernel doesn't really have a notion of characters, only bytes
in its interfaces.  (This isn't strictly true: it needs to in some cases
when it's interacting with other systems, but it's 99% true.)  In the
UTF-8 representation of these 5 characters are 8 bytes, as indeed each
non-ASCII character takes two bytes.

The various C runtimes do have some notion of various character sets,
and locales, and so forth, and build on top of the byte interface to
represent characters.  But not all programs use these.  Your example of
ls just takes the bytes from the kernel, and perhaps does some minimal
sanitizing (munging control codes) before sending them to the tty.  If
the terminal understands UTF-8, everything works great.

On the other hand, GHC's runtime always interprets these bytes as
meaning the characters in ISO-8859-1 (this just takes the bytes to the
unicode code points), and does not pay attention to locale settings
such as LC_CHARSET, etc.  While this has some nice properties (totally
invertible, no code to maintain (as the first 256 code points of Unicode
are ISO-8859-1), etc.), personally, I think this is a bug.  The Haskell
standard talks about characters, not bytes, and the characters read
and written should correspond to the native environment notions and
encodings.  These are, under Unix, determined by the locale system.

Unfortunately, at this point it is a well entrenched bug, and changing
the behaviour will undoubtedly break programs.

There should be another system for getting the exact bytes in and out
(as Word8s, say, rather than Chars), and there are in fact external
libraries using lower level interfaces, rather than the things like
putStr, getLine, etc. that do this.  An external library works, of
course, but it should be part of the standard so implementors know that
character based routines actually are character based, not byte based.

 After reading about character encoding, the way the linux kernel
 manages file names, I would expect that a file name set in an utf-8
 locale should be read by locale aware application as an utf-8 string,
 and each character a unicode code point which can be represented by a
 Haskell char. What's wrong with that?

That's a reasonable assumption.  The problem is that GHC doesn't support
locales.  But byte-sequences do round-trip, as long as you don't try to
process them, so not as much breaks as one might think.

I don't know what NHC and hugs do, though I assume they also provide
no translations.  I'm also not sure what JHC does, though I do see
mentions of UTF-8, UTF-16 (for windows), and UTF-32 (for internal usage
of C libraries), and I do know that John is fairly careful about locale
issues.

-- 
Aaron Denney
--

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: getting crazy with character encoding

2007-09-12 Thread Stefan O'Rear
On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote:
 Unfortunately, at this point it is a well entrenched bug, and changing
 the behaviour will undoubtedly break programs.
...
 There should be another system for getting the exact bytes in and out
 (as Word8s, say, rather than Chars), and there are in fact external
 libraries using lower level interfaces, rather than the things like
 putStr, getLine, etc. that do this.  An external library works, of
 course, but it should be part of the standard so implementors know that
 character based routines actually are character based, not byte based.
...
 I don't know what NHC and hugs do, though I assume they also provide
 no translations.  I'm also not sure what JHC does, though I do see
 mentions of UTF-8, UTF-16 (for windows), and UTF-32 (for internal usage
 of C libraries), and I do know that John is fairly careful about locale
 issues.

I'm pretty sure Hugs does the right thing.  NHC is probably broken.  In
any case, we already have hGetBuf / hPutBuf in the standard base
libaries for raw binary IO, so code that uses getChar for bytes really
has no excuse.  We can and should fix the bug.

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe