Re: Last IO discussion

2009-08-19 Thread Troels Liebe Bentsen
Very interesting read, that opens a whole new can of worms. How should we
behave when we actually read file names from the filesystem.

As for the path literal the newest revision of S32-setting-library should make
most people happy as the default is OS independent and abstract. More
strictness can be set with use flags or more verbose syntax, this should also
make it easier to make portable programmes in Perl 6. So far I'm quite happy
with the current result, way to go people :)

But what should we do when reading path's from the filesystem is still
a problem.

We can go the old Perl 5 way of treating filenames as binary by default and
then trying to convert it based on local encoding settings.

But this just mean any sane program will have to do an explicit, decoding to a
Unicode path or string.

Like we do in Perl 5:

my $file = readdir $dir;
$decoded_file = eval { decode(utf8, $file, Encode::FB_CROAK); };
if($@) {
  # Try something else as this was clearly not utf8.
} else {
  $file = $decoded_file;
}

But then again is this reasonable, on both Windows and MacOS X we know exactly
what we get as the filesystem will tell us. Even FAT has an encoding attribute
telling us what encoding the filesystem is in. And given that the OS actually
refuses to write files that are not valid, it would be a safe bet that a Path
can be decoded with that encoding.

So the problem of knowing encoding really only exists on Unix/Linux. This is
mainly because As POSIX does not care about encoding and most filesystems seem
to follow. But who knows if future filesystems will still be so lax with input,
the current trend of putting more database features in the filesystem might
also bring some more input validation, and the future we might not have to deal
with the insanity of multiple encodings.

Apparently JFS today has the option of limiting file name encoding.

http://lwn.net/Articles/71472/

Even without a filesystem restriction, on Linux/Unix we have a default encoding
specified in the locale that most software will respect, so when I name a file
ÆØÅ on my Ubuntu box all my programs will show it as such and not give me a
garbled string. So even if we have no guaranty that file names are encoded in
what the locale is set to, it's the best information we have.

One could always argue that even if the filesystem restricts file name input,
one still have the option of ignoring this as one encoded string of bytes will
be valid under the rules of another encoding just with another meaning. But
this file name will be wrong in all other programs, so why should it be correct
or unspecified(as in just a stream of bytes) in Perl 6?

My idea of working with file names would be that we default to locale or
filesystem settings, but give the options of working with paths/file names as
binary or a specific encoding.

my $file = readdir $dir; # Default to locale settings. fx utf8

This will return a UTF8 encoded Path unless and if this fails, no decoding will
be done and we return a binary Path.

my $file = readdir $dir, :utf8; # Decodes as utf8

my $file = readdir $dir, :bin; # No decoding is done

The whole reason for this is paths and filenames should not be special, it's
just another form of user input, where we should have some sane default so it
does what we expect.

More reading on the topic:

Python 3 problems:
http://bugs.python.org/issue4006

Unicode handling in Linux:
http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

Regards Troels.

On Wed, Aug 19, 2009 at 03:17, Timothy S. Nelsonwayl...@wayland.id.au wrote:
        See this link.

 http://archive.netbsd.se/?ml=perl6-languagea=2008-11t=9170058

        In particular, I thought Tom Christiansen's long message had some
 relevant info about filename literals.

        :)


 -
 | Name: Tim Nelson                 | Because the Creator is,        |
 | E-mail: wayl...@wayland.id.au    | I am                           |
 -

 BEGIN GEEK CODE BLOCK
 Version 3.12
 GCS d+++ s+: a- C++$ U+++$ P+++$ L+++ E- W+ N+ w--- V- PE(+) Y+++ PGP-+++
 R(+) !tv b++ DI D G+ e++ h! y-
 -END GEEK CODE BLOCK-




Re: Last IO discussion

2009-08-19 Thread David Green

On 2009-Aug-19, at 5:00 am, Troels Liebe Bentsen wrote:
My idea of working with file names would be that we default to  
locale or
filesystem settings, but give the options of working with paths/file  
names as

binary or a specific encoding.


As mentioned in the old thread, encoding is only vaguely related to  
locale.
The problem (or one of them) is that if I create a file today, and  
then change my locale tomorrow, I end up with a garbled filename.  Of  
course, people don't as a rule change to a different locale every day,  
but I still think this is a situation where we need to put the onus on  
the user.


That is, either Perl can determine the encoding (e.g. because the  
filesystem indicates it in some way), or else the user needs to pick  
one explicitly.  If you get a list of files from reading a dir, and  
don't need to display the names or anything, you might get away with  
treating them as undistinguished bytes; but as soon as Perl does  
anything that needs an encoding and one hasn't been specified, it  
should complain bitterly.


It's the same reasoning why I think specifying a timezone should be  
required: it's not that much work for the user to add use  
IO::Encoding $volume = utf-8, and at least that way naive users  
will be alerted to the fact that something's going on.  It's up to  
them how much effort they think is worth devoting to the issue, but at  
least they will be warned that there's an issue there to grapple with.



-David