[racket-dev] Symlink trouble

2013-04-17 Thread Tobias Hammer

Hi,

i am currently implementing an application that heavily relies on rackets  
great serialize functionality to exchange data between racket processes on  
different computers. That works well until i stumbled over a very  
confusion behavior of rackets filesystem and module path resolution.


I will explain first, what i observed and then why this causes some  
trouble:
* relative (module) paths are resolved with something like (or  
(current-load-directory) (current-directory))

* collection paths are resolved with
 (find-executable-path (find-system-path 'exec-file) (find-system-path  
'collects-dir)) for the system collection and with the given path for the  
others
* you can require a module relative and via collection, if they resolve to  
the same name, there is no error


serialize stores the module path and symbol where the deserialize function  
can be found. It's interesting how this module path is determined
* If the file containing the deserialize identifier (if implemented by  
hand or the file where e.g serializable-stuct is used) is loaded via  
collection, then the serialized stream contains a collection path  
(determined via identifier binding and mpi magic)
* If this file is loaded relative, the fallback method with  
current-(load)-directory is used


Nothing special so far, but the fun starts with how current-directory is  
initialized. It uses (on *nix systems) getcwd() but this function returns  
the path with all symbolic links resolved (getcwd is only a thin  
OS-wrapper, and the OS provides nothing else).
This little detail can easily break the serialization framework (and maybe  
other things too).
The scenario is a file that is in a path containing a symlink and that is  
in the current collections, e.g

/abc/symlink/more/def/file.rkt
and PLTCOLLECTS=/abc/symlink/more:
and file.rkt contains a serializable-struct definition.

Now one racket process loads file.rkt relative, serializes a struct  
instance and sends it to another racket process. The other process loads  
def/file via collection and deserialies the struct. The receiver now has a  
struct that is of a different type and that he can't access.
This fails because the serialized data contains the absolute symlink-free  
path that differs from the path the receiver used to load file.rkt  
(because for collection dirs symlinks are not resolved).


The same happens of course when the data is send to another computer that  
has a symlink in the path to file.rkt, even if they both load the same way.


The confusing thing is that from the users point of view everything is  
consistent. His working directory and collections all point to the same  
location.


It is clear that this behavior is by far not limited to racket as nearly  
all programming languages use getcwd internally. A quick google search for  
getcwd and symlinks gives a lot of results...


I came up with a few solutions but i would like to get some feedback on  
them. They all more or less use that the shell keeps track of the 'real'  
(better: visible) working directory. Most *nix shells set 'PWD' in the  
environment but it is not guaranteed and can of cause be altered by the  
user.


- The quick and very dirty hack is to set the current-directoy before any  
use code is executed
racket -e '(current-directory (or (getenv PWD) (current-directory)))'  
program.rkt

Too ugly to really use it...

- A better fix would be to change how the current-directory parameter is  
initialized during the startup. It could be some heuristic that tries to  
use the env-variable if it is a complete and existing path and falls back  
to getcwd otherwise. As far as i can tell this won't break anything  
because after this one time at startup the C-sides cwd and rackets  
parameter are completely decoupled.


- A more conservative solution would be a command line argument to racket  
to set the initial value for current-directory. One could then populate it  
with env's PWD or from `pwd` or whatever suits.


I would appreciate any feedback on how i can work around this behavior  
(except don't use symlinks ...) or if i missed something obvious. If not,  
would any of the two real solutions be viable? They shouldn't be too hard  
to implement i could create a patch if one of them seems ok.


Tobias



--
-
Tobias Hammer
DLR / Robotics and Mechatronics Center (RMC)
Muenchner Str. 20, D-82234 Wessling
Tel.: 08153/28-1487
Mail: tobias.ham...@dlr.de
_
 Racket Developers list:
 http://lists.racket-lang.org/dev


Re: [racket-dev] Symlink trouble

2013-04-17 Thread Matthew Flatt
Yes, I think Racket should use PWD --- if the expansion of soft links
produces the same path as getcwd(), which seems to be what /bin/pwd
does.

Should Racket also set PWD (optionally, but by default) when it creates
a subprocess? I think probably so.


To make sure we're all on the same page:

The general problem is that there can be more than one filesystem path
that reaches a file. It would be great if we could normalize every path
to a canonical form, but path normalization in general seems to
intractable due to the possibilities of soft links, hard links,
multiple mount points, case-sensitivity choices, and probably other
twists that I'm forgetting. We have therefore settled on different
definitions of same file, depending on the context.

For module paths, same file involves only syntactic normalizations of
the pathname (e.g., no checking for soft links). Various pieces of the
system are carefully implemented to be consistent with syntactic
normalization. For example, suppose that PLTCOLLECTS is set to
/home/mflatt/plt, but /home/mflatt is a symlink to /Users/mflatt;
pathnames associated to modules that are accessed via collection will
consistently use /home/mflatt, and not somehow hop over to
/Users/mflatt. As long as a user is similarly consistent when
supplying paths, it all works out.

Unfortunately, `current-directory' is a place where you don't get to
choose the path. You might say /home/mflatt/plt to get to a Racket
installation, but to initialize `current-directory', the path gets
turned into an inode and back to a path via getcwd() --- exactly the
sort of thing that breaks a syntactic view of same.

The PWD environment variable addresses the problem with getcwd(): nice
shells set PWD based on a syntactic derivation of the current
directory, instead of an inode-based derivation.

So, Racket should take advantage of the information that nice shells
provide. Probably it should also act as a nice shell by default.


(As it happens, I use csh on Mac OS X, and it's not nice in the above
sense. That helps explain why I never got PWD vs. cwd() before.)


At Wed, 17 Apr 2013 12:06:29 +0200, Tobias Hammer wrote:
 Hi,
 
 i am currently implementing an application that heavily relies on rackets  
 great serialize functionality to exchange data between racket processes on  
 different computers. That works well until i stumbled over a very  
 confusion behavior of rackets filesystem and module path resolution.
 
 I will explain first, what i observed and then why this causes some  
 trouble:
 * relative (module) paths are resolved with something like (or  
 (current-load-directory) (current-directory))
 * collection paths are resolved with
   (find-executable-path (find-system-path 'exec-file) (find-system-path  
 'collects-dir)) for the system collection and with the given path for the  
 others
 * you can require a module relative and via collection, if they resolve to  
 the same name, there is no error
 
 serialize stores the module path and symbol where the deserialize function  
 can be found. It's interesting how this module path is determined
 * If the file containing the deserialize identifier (if implemented by  
 hand or the file where e.g serializable-stuct is used) is loaded via  
 collection, then the serialized stream contains a collection path  
 (determined via identifier binding and mpi magic)
 * If this file is loaded relative, the fallback method with  
 current-(load)-directory is used
 
 Nothing special so far, but the fun starts with how current-directory is  
 initialized. It uses (on *nix systems) getcwd() but this function returns  
 the path with all symbolic links resolved (getcwd is only a thin  
 OS-wrapper, and the OS provides nothing else).
 This little detail can easily break the serialization framework (and maybe  
 other things too).
 The scenario is a file that is in a path containing a symlink and that is  
 in the current collections, e.g
 /abc/symlink/more/def/file.rkt
 and PLTCOLLECTS=/abc/symlink/more:
 and file.rkt contains a serializable-struct definition.
 
 Now one racket process loads file.rkt relative, serializes a struct  
 instance and sends it to another racket process. The other process loads  
 def/file via collection and deserialies the struct. The receiver now has a  
 struct that is of a different type and that he can't access.
 This fails because the serialized data contains the absolute symlink-free  
 path that differs from the path the receiver used to load file.rkt  
 (because for collection dirs symlinks are not resolved).
 
 The same happens of course when the data is send to another computer that  
 has a symlink in the path to file.rkt, even if they both load the same way.
 
 The confusing thing is that from the users point of view everything is  
 consistent. His working directory and collections all point to the same  
 location.
 
 It is clear that this behavior is by far not limited to racket as nearly  
 all programming languages use 

Re: [racket-dev] Symlink trouble

2013-04-17 Thread Neil Van Dyke

Matthew Flatt wrote at 04/17/2013 10:39 AM:

It would be great if we could normalize every path
to a canonical form, but path normalization in general seems to
intractable due to the possibilities of soft links, hard links,
multiple mount points, case-sensitivity choices, and probably other
twists that I'm forgetting.


Agreed intractable, and agreed with your approach.

Just FYI for anyone who Googles this in the future and wants a limited 
but tractable pretty-good path canonicalization for their application 
(at least on Unix-y filesystems)... canonicalize-path from the 
following library might be good enough: 
http://www.neilvandyke.org/racket-path-misc/  (I wrote this for 
scan-mediafiles in http://www.neilvandyke.org/racket-mediafile/;, and 
put it in a separate PLaneT package because I've needed such a procedure 
from time to time.)


Neil V.

_
 Racket Developers list:
 http://lists.racket-lang.org/dev


Re: [racket-dev] Symlink trouble

2013-04-17 Thread Tobias Hammer
On Wed, 17 Apr 2013 16:39:22 +0200, Matthew Flatt mfl...@cs.utah.edu  
wrote:



Yes, I think Racket should use PWD --- if the expansion of soft links
produces the same path as getcwd(), which seems to be what /bin/pwd
does.


That check is even better than the one i had in mind. That should prevent  
any

possible ambiguities.


Should Racket also set PWD (optionally, but by default) when it creates
a subprocess? I think probably so.


Yes, i think that is a sensible default. Setting only the os cwd, as
racket currently does would cause the same problems for any subprocesses
that uses this information (at least most shells and, in the future,  
racket).




To make sure we're all on the same page:

The general problem is that there can be more than one filesystem path
that reaches a file. It would be great if we could normalize every path
to a canonical form, but path normalization in general seems to
intractable due to the possibilities of soft links, hard links,
multiple mount points, case-sensitivity choices, and probably other
twists that I'm forgetting. We have therefore settled on different
definitions of same file, depending on the context.


Right


For module paths, same file involves only syntactic normalizations of
the pathname (e.g., no checking for soft links). Various pieces of the
system are carefully implemented to be consistent with syntactic
normalization. For example, suppose that PLTCOLLECTS is set to
/home/mflatt/plt, but /home/mflatt is a symlink to /Users/mflatt;
pathnames associated to modules that are accessed via collection will
consistently use /home/mflatt, and not somehow hop over to
/Users/mflatt. As long as a user is similarly consistent when
supplying paths, it all works out.


That matches my observations. Files accessed via collection always keep  
their
paths 'as is'. But it is enough to start a program via racket file  
instead of
racket -l what/ever to break this. Therefore it seems to be the normal  
case that

it fails (at least for console racket).



Unfortunately, `current-directory' is a place where you don't get to
choose the path. You might say /home/mflatt/plt to get to a Racket
installation, but to initialize `current-directory', the path gets
turned into an inode and back to a path via getcwd() --- exactly the
sort of thing that breaks a syntactic view of same.


Correct


The PWD environment variable addresses the problem with getcwd(): nice
shells set PWD based on a syntactic derivation of the current
directory, instead of an inode-based derivation.

So, Racket should take advantage of the information that nice shells
provide. Probably it should also act as a nice shell by default.


What exactly do you mean by acting as a nice shell? Setting PWD for  
subprocesses?

In that sense it should definitely be nice (by default).


(As it happens, I use csh on Mac OS X, and it's not nice in the above
sense. That helps explain why I never got PWD vs. cwd() before.)


Just tried bash, csh and ksh on linux and they all seem to set PWD. But i  
can't tell

if thats the default or specific to my installation.




At Wed, 17 Apr 2013 12:06:29 +0200, Tobias Hammer wrote:

Hi,

i am currently implementing an application that heavily relies on  
rackets
great serialize functionality to exchange data between racket processes  
on

different computers. That works well until i stumbled over a very
confusion behavior of rackets filesystem and module path resolution.

I will explain first, what i observed and then why this causes some
trouble:
* relative (module) paths are resolved with something like (or
(current-load-directory) (current-directory))
* collection paths are resolved with
  (find-executable-path (find-system-path 'exec-file) (find-system-path
'collects-dir)) for the system collection and with the given path for  
the

others
* you can require a module relative and via collection, if they resolve  
to

the same name, there is no error

serialize stores the module path and symbol where the deserialize  
function

can be found. It's interesting how this module path is determined
* If the file containing the deserialize identifier (if implemented by
hand or the file where e.g serializable-stuct is used) is loaded via
collection, then the serialized stream contains a collection path
(determined via identifier binding and mpi magic)
* If this file is loaded relative, the fallback method with
current-(load)-directory is used

Nothing special so far, but the fun starts with how current-directory is
initialized. It uses (on *nix systems) getcwd() but this function  
returns

the path with all symbolic links resolved (getcwd is only a thin
OS-wrapper, and the OS provides nothing else).
This little detail can easily break the serialization framework (and  
maybe

other things too).
The scenario is a file that is in a path containing a symlink and that  
is

in the current collections, e.g
/abc/symlink/more/def/file.rkt
and PLTCOLLECTS=/abc/symlink/more:
and file.rkt 

Re: [racket-dev] Symlink trouble

2013-04-17 Thread Matthew Flatt
At Wed, 17 Apr 2013 17:19:58 +0200, Tobias Hammer wrote:
 On Wed, 17 Apr 2013 16:39:22 +0200, Matthew Flatt mfl...@cs.utah.edu  
 wrote:
  For module paths, same file involves only syntactic normalizations of
  the pathname (e.g., no checking for soft links). Various pieces of the
  system are carefully implemented to be consistent with syntactic
  normalization. For example, suppose that PLTCOLLECTS is set to
  /home/mflatt/plt, but /home/mflatt is a symlink to /Users/mflatt;
  pathnames associated to modules that are accessed via collection will
  consistently use /home/mflatt, and not somehow hop over to
  /Users/mflatt. As long as a user is similarly consistent when
  supplying paths, it all works out.
 
 That matches my observations. Files accessed via collection always keep  
 their
 paths 'as is'. But it is enough to start a program via racket file  
 instead of
 racket -l what/ever to break this.

I should have mentioned that you could use `racket full-path-to-file'
to avoid the problem, which is a workaround that I have used often.

  So, Racket should take advantage of the information that nice shells
  provide. Probably it should also act as a nice shell by default.
 
 What exactly do you mean by acting as a nice shell? Setting PWD for  
 subprocesses?
 In that sense it should definitely be nice (by default).

Yes.

  (As it happens, I use csh on Mac OS X, and it's not nice in the above
  sense. That helps explain why I never got PWD vs. cwd() before.)
 
 Just tried bash, csh and ksh on linux and they all seem to set PWD. But i  
 can't tell
 if thats the default or specific to my installation.

I think it may be part of the BSD legacy for Mac OS X. On my Linux
installations, csh works as you describe.

_
  Racket Developers list:
  http://lists.racket-lang.org/dev


Re: [racket-dev] Symlink trouble

2013-04-17 Thread Tobias Hammer
On Wed, 17 Apr 2013 17:25:02 +0200, Matthew Flatt mfl...@cs.utah.edu  
wrote:



That matches my observations. Files accessed via collection always keep
their
paths 'as is'. But it is enough to start a program via racket file
instead of
racket -l what/ever to break this.


I should have mentioned that you could use `racket full-path-to-file'
to avoid the problem, which is a workaround that I have used often.


Thanks. I didn't know that. That seems to solve it for everything relying  
on

the module path of the initial file.
Unfortunately, but understandably current-directory is not affected.
_
 Racket Developers list:
 http://lists.racket-lang.org/dev