Re: Last IO discussion

2009-08-19 Thread Troels Liebe Bentsen
Very interesting read, that opens a whole new can of worms. How should we
behave when we actually read file names from the filesystem.

As for the path literal the newest revision of S32-setting-library should make
most people happy as the default is OS independent and abstract. More
strictness can be set with use flags or more verbose syntax, this should also
make it easier to make portable programmes in Perl 6. So far I'm quite happy
with the current result, way to go people :)

But what should we do when reading path's from the filesystem is still
a problem.

We can go the old Perl 5 way of treating filenames as binary by default and
then trying to convert it based on local encoding settings.

But this just mean any sane program will have to do an explicit, decoding to a
Unicode path or string.

Like we do in Perl 5:

my $file = readdir $dir;
$decoded_file = eval { decode("utf8", $file, Encode::FB_CROAK); };
if($@) {
  # Try something else as this was clearly not utf8.
} else {
  $file = $decoded_file;
}

But then again is this reasonable, on both Windows and MacOS X we know exactly
what we get as the filesystem will tell us. Even FAT has an encoding attribute
telling us what encoding the filesystem is in. And given that the OS actually
refuses to write files that are not valid, it would be a safe bet that a Path
can be decoded with that encoding.

So the problem of knowing encoding really only exists on Unix/Linux. This is
mainly because As POSIX does not care about encoding and most filesystems seem
to follow. But who knows if future filesystems will still be so lax with input,
the current trend of putting more database features in the filesystem might
also bring some more input validation, and the future we might not have to deal
with the insanity of multiple encodings.

Apparently JFS today has the option of limiting file name encoding.

http://lwn.net/Articles/71472/

Even without a filesystem restriction, on Linux/Unix we have a default encoding
specified in the locale that most software will respect, so when I name a file
"ÆØÅ" on my Ubuntu box all my programs will show it as such and not give me a
garbled string. So even if we have no guaranty that file names are encoded in
what the locale is set to, it's the best information we have.

One could always argue that even if the filesystem restricts file name input,
one still have the option of ignoring this as one encoded string of bytes will
be valid under the rules of another encoding just with another meaning. But
this file name will be wrong in all other programs, so why should it be correct
or unspecified(as in just a stream of bytes) in Perl 6?

My idea of working with file names would be that we default to locale or
filesystem settings, but give the options of working with paths/file names as
binary or a specific encoding.

my $file = readdir $dir; # Default to locale settings. fx utf8

This will return a UTF8 encoded Path unless and if this fails, no decoding will
be done and we return a binary Path.

my $file = readdir $dir, :utf8; # Decodes as utf8

my $file = readdir $dir, :bin; # No decoding is done

The whole reason for this is paths and filenames should not be special, it's
just another form of user input, where we should have some sane default so it
does what we expect.

More reading on the topic:

Python 3 problems:
http://bugs.python.org/issue4006

Unicode handling in Linux:
http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

Regards Troels.

On Wed, Aug 19, 2009 at 03:17, Timothy S. Nelson wrote:
>        See this link.
>
> http://archive.netbsd.se/?ml=perl6-language&a=2008-11&t=9170058
>
>        In particular, I thought Tom Christiansen's long message had some
> relevant info about filename literals.
>
>        :)
>
>
> -
> | Name: Tim Nelson                 | Because the Creator is,        |
> | E-mail: wayl...@wayland.id.au    | I am                           |
> -
>
> BEGIN GEEK CODE BLOCK
> Version 3.12
> GCS d+++ s+: a- C++$ U+++$ P+++$ L+++ E- W+ N+ w--- V- PE(+) Y+>++ PGP->+++
> R(+) !tv b++ DI D G+ e++> h! y-
> -END GEEK CODE BLOCK-
>
>


Re: r28017 - in docs/Perl6/Spec: . S32-setting-library

2009-08-18 Thread Troels Liebe Bentsen
I don't think python is the only one with that problem, try saving a file with
non utf8 chars in subversion and see what happens.

We should be liberal in what we accept and strict in what we send as we really
don't know the filesystem will return to us. I guess a file read from the
filesystem could have PathBinary as it's default object type. And the
programmer would have the option of converting it.

But you are right we have the same problem with Perl 5 today:

my $file = readdir $dir;

Should $file have the utf8 flag set if my locale is set to uft8?

or should i have to do a :

my $file = eval { decode("utf8", $file, Encode::FB_CROAK); };

every time i get a filename?

Regards Troels.

On Tue, Aug 18, 2009 at 16:37, Nicholas Clark wrote:
> On Tue, Aug 18, 2009 at 01:10:58PM +0200, Jan Ingvoldstad wrote:
>> On Tue, Aug 18, 2009 at 12:54 PM, Troels Liebe Bentsen 
>> wrote:
>
>> > Besides that, a simple check on Unix for what the locale is set to might 
>> > also be
>> > nice, so we don't write UTF8 files on a filesystem where the rest for the 
>> > files
>> > are in Latin1.
>>
>> The locale doesn't say what format the filenames are on the
>> filesystem, though, merely the current user's language preferences may
>> be.
>
> We don't want to make the same mistakes as Python 3:
>
> http://mail.python.org/pipermail/python-dev/2008-December/083856.html
>
> The summary is that different file names in the same directory might be
> in different encodings, and your programming language runtime sucks big time
> if it doesn't offer you a way to iterate over all of them somehow, even if
> you can't render their names.
>
> [Consider a security critical program scanning using glob('*'), which gives
> a clean bill of health because it opened "all" files and found no problems.]
>
> I don't know how Python 3 resolved this.
>
> Nicholas Clark
>


Re: Filename literals

2009-08-18 Thread Troels Liebe Bentsen
On Tue, Aug 18, 2009 at 15:20, Carl Mäsak wrote:
> Leon (>):
>> Reading this discussion, I'm getting the feeling that filename
>> literals are increasingly getting magical, something that I don't
>> think is a good development. The only sane way to deal with filenames
>> is treating them as opaque binary strings, making any more assumptions
>> is bound to get you into trouble. I don't want to deal with Windows'
>> strange restrictions on characters when I'm working on Linux. I don't
>> want to deal with any other platform's particularities either.
>> Portability should be positive, not negative IMNSHO.

The whole reason filenames/paths is a mess to code if because they are treated
as binary strings in most cases. This is also why we have modules like
File::Spec and bunch more on CPAN all trying to do the same thing. And today if
I want to code something that works on all platforms I have to use that
instead. How can this be positive?

For me a Path literal is a way to get rid of all this bandage so we don't have
to bother with the strange restrictions later when we get a bug report from a
CPAN user. And there is nothing magical about it, no more so than if I ask for
the length of UTF8 string I expect get back the number of characters not the
number of bytes.

A path is a well defined size on all platforms and should be treated as such.
The main problems is that POSIX really never did cover this part too well. But
today we have Unicode and UTF8 and as such this is the de facto default on most
modern unix'es as most libraries and tools will write filenames in this format
if so defined in the locale.

Just writing binary data to a filename is bound to get you into trouble and you
will quickly find that many of the common C libraries will fail if locale and
filename does not match.

So even on Linux/Unix a path really not just any number of bytes with / as
delimiter. It depends on the locale and the encoding set for the file system
and not caring about that will get you into trouble.

But than again you always have the option of using p:unix{}, it's also a clear
way to signal you really don't care about portability and that this will only
work on Unix. Or you could even use Q{} as this pretty much will allow you to
anything.

>>
>> As for comparing paths: reimplementing logic that belongs to the
>> filesystem sounds like really Bad Idea™ to me. Two paths can't be
>> reliably compared without choosing to make some explicit assumptions,
>> and I don't think Perl should make such choices for the programmer.

Getting any kind of path's from user input will require you to reimplement that
logic if you care about validate data before throwing it at the file system.

If you buy that paths are well defined types, then comparing paths should not
require making any assumptions. We can compare Unicode string without making
assumptions.

>
> Very nicely put. We can't predict the future, but in creating
> something that'll at least persist through the next decade, let's not
> do elaborate things with lots of moving parts.
>
> Let's make a solid ground to stand on; something so stable that it
> works uphill and underwater. People with expertise and tuits will
> write the facilitating modules.
>
>  To quote Kernighan and Pike:  Simplicity. Clarity. Generality.
>  I agree.
>  magic can always be added with module goodness
>

I completely agree we can't predict the future but we do have to make some sane
choices about how the default should work, who knows if UTF8 will still be hot
new thing in 10 years, but that's still the default assumption for much of Perl
6 if nothing else is known about the input we get.

And I totally agree path literals should not be magically, they should be well
defined and you should not suffer when using them because platform X or Y has
strange restrictions. But when finding the sane default we have to make
restrictions and POSIX's path is binary data, simply is to lax.

My idea about using the lowest common denominator for modern Unix and windows
was that we could get as much of Unicode in path names as possible without
breaking on modern platforms and as a way to get Simplicity, Clarity and
Generality into paths.

Because this will never be simple, clear or general:

  File::Spec->catfile(qw(.. ext Sys Syslog macros.all));

or any of the other example that we can find:

http://www.google.com/codesearch?hl=en&start=10&sa=N&q=FIle::Spec-%3Ecatfile

Regards Troels


Re: r28017 - in docs/Perl6/Spec: . S32-setting-library

2009-08-18 Thread Troels Liebe Bentsen
On Tue, Aug 18, 2009 at 13:10, Jan Ingvoldstad wrote:
> On Tue, Aug 18, 2009 at 12:54 PM, Troels Liebe Bentsen 
> wrote:
>
>> My idea with portable by default was only portability for modern Unix and
>> modern Windows. So DOS and VMS limitations would not apply. The problem of
>> enforcing truly "portable" filenames is that the files names get too
>> restrictive and for most applications targeting 98% of systems out there 
>> would
>> be enough.
>
> That's a decent enough point, but it may be unwise to ignore legacy
> systems that where Perl 5 is in common use, unless we want to shed
> that userbase. (Mark this down as a "I don't know, and I don't have a
> stake in it, but…)

I completely agree and we might make a p:posix{} or p:strict{} to handle that.

But one thing to remember is that having the default p{} allow characters and
formats that is not supported on VMS or DOS, only means the programmer won't
get a compiler error. In other words he is only a little better off than using
a normal string or Q{}. Also in cases where systems has special requirements
the local version should be used so, fx. VMS would have p:vms{} and DOS
p:dos{}.

The defaults should handle most cases and limiting path's to only ASCII might
be a bit much. For the default path literal to be useful it should work in most
normal cases, and I would say international characters would be a normal case.

What I would like to go for is the lowest reasonable common denominator for the
default p{}, for me at least this is MacOSX, Windows XP, Linux, *BSD, etc.

>
>> With modern Unix/Windows I'm thinking about systems that support and use UCS2
>> or UTF8 and where "." or other common characters does not have special 
>> meaning
>> for the filesystem.
>
> We also need to keep in mind the Unicode problems between certain
> unixy platforms (i.e. MacOS X vs. most if not all the rest).
>
> If I recall correctly, the internal Unicode format chosen for Perl 6
> is incompatible with MacOS X, because MacOS X implemented Unicode
> support at a time when the standard as we know it today wasn't
> finalized.
>
> This has bearing on filenames, and MacOS X isn't a small enough
> platform that it can simply be ignored, either.

I guess if this puts limits of what characters can be in file names it should
also go in the default limits. But how Perl 6 stores the Path internally is not
really important, so long as it can be automatically converted with out
changing meaning before it is parsed to the OS. So what we should limit is what
can not be automatically converted.

Regards Troels.


Re: r28017 - in docs/Perl6/Spec: . S32-setting-library

2009-08-18 Thread Troels Liebe Bentsen
>> Perl 5 runs on (at least) VMS and VOS too. So, if Perl 6 is to adopt a policy
>> of enforced portable filenames by default, it should (at least) also exclude
>> - as the first character, and forbid more than one . in a filename.
>
> And, as I mentioned in an earlier post during the discussion, the
> restrictions for Windows are numerous:
>
> http://msdn.microsoft.com/en-us/library/aa365247(VS.85).aspx
>
> Enforcing truly "portable" filenames is unrealistic, I think, but
> having a POSIX-checking default is a good thing.

My idea with portable by default was only portability for modern Unix and
modern Windows. So DOS and VMS limitations would not apply. The problem of
enforcing truly "portable" filenames is that the files names get too
restrictive and for most applications targeting 98% of systems out there would
be enough.

With modern Unix/Windows I'm thinking about systems that support and use UCS2
or UTF8 and where "." or other common characters does not have special meaning
for the filesystem.

Using POSIX as the basis would properly not be enough as it's either to lax or
to strict.

I think Windows NTFS/VFAT32 + Kernel limits on characters might be a good basis:

http://en.wikipedia.org/wiki/Filename

 * So paths can be up to 259 characters long
 * No characters in the range 0x01-0x1F
 * No control characters
 * Non of the following characters: " * : < > ? \ / |.
 * Directory or filename up to 255 characters long
 * Only Unicode that fits in UCS-2(the 16bit subset)

Any system specific restrictions would not apply like AUX, CLOCK$, COM, etc.
would not apply and the programmer would be allow to shoot him/herself in the
foot.

Besides that, a simple check on Unix for what the locale is set to might also be
nice, so we don't write UTF8 files on a filesystem where the rest for the files
are in Latin1.

Regards Troels.


Re: Filename literals

2009-08-18 Thread Troels Liebe Bentsen
On Mon, Aug 17, 2009 at 23:11, Jon Lang wrote:
>> The default p{} should only allow "/" as separator and should not allow
>> characters that won't work on modern Windows and Unix like \ / ? % * : | " > 
>> <,
>> etc. The reason for this is that portable Path's should be the default and if
>> you really need platform specific behavior it should be shown in the code.
>
> I note that you explicitly included * and ? in the list of forbidden
> characters; I take it, then, that you're not in favor of Path as a
> glob-based pattern-matching utility?  E.g.:
>
>my Path $path;
> ...
>unless $path ~~ p { say "the file doesn't begin with 'astro'". }
>
> Admittedly, this particular example _could_ be accomplished through
> the use of a regex; but there _are_ cases where the use of wildcard
> characters would be easier than the series of equivalent tests that
> Perl would otherwise have to perform in order to achieve the same
> result.  Hmm... maybe we need something analogous to q vs. qq; that
> is:
>
>p #`{ syntax error: '*' is not a valid filename character. }
>pp #`{ returns an object that is used for Path
> pattern-matching; perhaps Pathglob or somesuch? }

Glob's are special and should properly have it's own sub format, the problem of
including * and ? in Path's is that on Unix this is a allowed file system name.

Q:glob{} and g{} might be nice.

Buy then again as * and ? is disallowed for the default portable p{}, using any
of these might make it return a PathGlob instead. But I think an glob
interpolating pp{} might be the best solution, so the default p{} is strict and
returns and error.

my Path $path = p{file.txt};
my PathGlob $pathglob = pp{*.txt};

Allowing code like this:

for(pp{*.txt}.open -> $file) {
 my @lines = $file.lines;
}

>> Urls could also be support with:
>>
>> my Path $path = p:url{file:///home/test.file}
>
> I would be very careful here, in that I wouldn't want to open the can
> of worms inherent in non-file protocols (e.g., ftp, http, gopher,
> mail), or even in file protocols with hosts other than localhost.

You are properly right, but then again if it's not up to Path to actual know
how to open or work with the file/url. It would only have to know the rules for
how urls work not how to open them. In it's basic form Path should not be able
to do any IO.

But I'm not too sure URL's belong in Path.

>> ** Utility functions **
>>
>> Path in itself knows nothing about the filesystem and files but might have a
>> peek in $*CWD to do some path logic. Except for that a number of File related
>> functions might be available to make it easy to open and slurp a file a Path
>> points to.
>>
>> my File $file = p{/etc/passwd}.open;
>> if($file.type ~~ 'text/plain') {
>>  say "looks like a password file";
>> }
>>
>> my @passwd = p{/etc/passwd}.lines;
>>
>>
>> if(p{/etc/passwd}.exists) {
>>  say "passwd file exists";
>> }
>
> As soon as you allow methods such as .exists, it undermines your claim
> that Path knows nothing about the file system or files.  IMHO, you
> should still include such methods.

My idea with the utility functions are that they are merely wrappers or pretty
syntax for using the build in IO call. So if fx. url paths where included in
the spec, most of these functions would be provided by libraries
handling the IO.

How this can be done in a nice OO way, I don't know, but I'm sure people a lot
smarter than me can figure that out.

Regards Troels


Re: Filename literals

2009-08-17 Thread Troels Liebe Bentsen
Hey,

Just joined the list, and I too have been thinking about a good path literal
for Perl 6. Nice to see so many other people are thinking the same :).

Not knowing where to start in this long thread, I will instead try to show how
I would like a path literal to work. For me a path literal is a way to make the
code pretty and clean. And for multi platform coding this is mostly where it
gets hard to do. So I think a path literal should make it possible to use both
a native style and a more modern portable one, without having to give up using
spaces like in Path::Spec from Perl 5 or have to do verbose object creation.

First I think extending Q with a Q:path{} and making the alias Q:p{} and p{}
would be the most consistent with the current string literal API. Also it
should be possible to sub type the literals to further limit format and
content. This should be done so we can get compile time error when path's are
know to be incorrect or that we throw an exception or return a undef with an
error type(or whatever Larry called it) when we interpolate and return
something that is known to be incorrect.

The default p{} should only allow "/" as separator and should not allow
characters that won't work on modern Windows and Unix like \ / ? % * : | " > <,
etc. The reason for this is that portable Path's should be the default and if
you really need platform specific behavior it should be shown in the code.

my Path $path = p{../ext/dictonary.txt};

or

my Path $path = p{c:/ext/dictonary.txt};

We should allow windows style paths so converting and maintaining code on this
platform is not a pain.

my Path $path = p:win{C:\Program Files\MS Access\file.file};

For Unix specific behavior we should have a p:unix{} literal, here the only
limit are what is defined by locale. So we won't be able to write full Unicode
if locale is set to Latin1. Writing filenames to the filesystem that other
programs won't be able to read should be hard.

my Path $path = p:unix{/usr/src/bla/myfile?:%.file};

And for people where this is a problem p:bin{} can be used as no checking is
done here.

my $path = p:bin{/usr/src/bla/??/adasd/myfile};

Old style Mac paths could also be supported where the : is used as separator.

my Path $path = p:mac{usr:src:bla};

Or old dos paths where 8 char limits and all the old dos stuff apply.

my Path $path = p:dos{c:\windows\test.fil};

Urls could also be support with:

my Path $path = p:url{file:///home/test.file}

** Path Object like File::Spec, etc. just nicer **

All the different variants for p{} return a Path object that offers much of
what is found in File::Spec, Cwd and Path::Class in Perl 5 today in a more
Perl 6 way.

my Path $real_path = $path.realpath; # Like Cwd's realpath

my Path $volume = $path.volume; # Returns the volume part if relevant
my Path $dir = $path.dir; # Returns the directory part
my Path $file = $path.file; # Returns the file part

$path.shift(); # Get rid of last part of path
$path.pop(); # Get rid of first part or path

my @paths = $path.dirs; # Returns the directory parts of the path

etc.

** Comparing Paths should do the right thing **

As we have the option of specifying what type a Path object is, this should
also count when comparing the them. So fx. p:win{} are case insensitive.

my $file = p:win{c:\My File.txt};

my $path = p:win{C:\Program Files\..};

if($path.is_in($file)) { # Check if the path is contained in another path
  say "$file is in $path\n"; # C:\My File.txt is C:
}

if(p{../test} ~~ p{../dir/../test}) {
  say "Comparing two Path works as it should";
}

Also Path handles Unicode normalization so this won't be a problem:

http://lists.zerezo.com/git/msg643117.html

Meaning that both "Mrchen" and "Marchen" are
the same path, but without normalizing the path behind the users back.

** Utility functions **

Path in itself knows nothing about the filesystem and files but might have a
peek in $*CWD to do some path logic. Except for that a number of File related
functions might be available to make it easy to open and slurp a file a Path
points to.

my File $file = p{/etc/passwd}.open;
if($file.type ~~ 'text/plain') {
  say "looks like a password file";
}

my @passwd = p{/etc/passwd}.lines;


if(p{/etc/passwd}.exists) {
  say "passwd file exists";
}

This is my thought so far, hope it helps the discussion.

Regards Troels