Proposed function path_in_directory() [was: Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths]

2012-10-06 Thread Michael Haggerty
On 10/01/2012 06:51 AM, Michael Haggerty wrote:
> I think I would advocate that the prefix has to match the front of the
> path exactly (including any trailing slashes) and either
> 
> strlen(prefix) == 0
> or the prefix ended with a '/'
> or the prefix and path are identical
> or the character in path following the matching part is a '/'
> 
> This would allow the "is path its own prefix" policy to be decided by
> the caller by either including or omitting a trailing slash on the
> prefix argument.

Thinking about this more, I don't think it will work.  As usual, the
special cases around "/" and "//" make things awkward.  I think it is
necessary to have a separate argument to specify whether "path is its
own prefix".

So I am trying to decide how a function path_in_directory() should work,
and would like to get some feedback, especially on the following two points:

1. How should "//" be handled?  I don't really have experience with the
peculiar paths that start with "//", so I'm not sure how they should be
handled (or even if the handling needs to be platform-dependent).  My
working hypothesis is that the inputs should be normalized by the
caller, so if the caller wants "//" to be treated as equivalent to "/"
then the caller should normalize them *before* calling this function.
Conversely, if the caller passes "//" to the function, that implies that
"//" is distinct from "/" and "//" is considered a proper subdirectory
of "/".  See the cases marked with "??" below.

2. Does there need to be any special related to DOS paths?

> /*
>  * Return true iff path is within dir.  The comparison is textual,
>  * meaning that path and dir should be normalized and either both be
>  * absolute or both be relative to the same directory.  If path and
>  * dir represent the *same* path, then return true iff allow_equal is
>  * true.  Single trailing slashes on either path or dir are ignored,
>  * (except for the special case "//"); i.e., "a/b" and "a/b/" are
>  * treated equivalently, as are "" and "/".  Examples (* means "don't
>  * care"):
>  *
>  * - path_in_directory("a/b", "a", *) -> true
>  * - path_in_directory("a", "a/b", *) -> false
>  * - path_in_directory("ab", "a", *) -> false
>  * - path_in_directory("a/b", "a/b", 0) -> false
>  *   (same if either argument is replaced with "a/b/")
>  * - path_in_directory("a/b", "a/b", 1) -> true
>  *   (same if either argument is replaced with "a/b/")
>  * - path_in_directory(*, "/", 1) -> true
>  * - path_in_directory("/", "/", 0) -> false
>  * - path_in_directory("//", "/", 0) -> true??
>  * - path_in_directory("//", "/", 1) -> true
>  * - path_in_directory("/", "//", 0) -> false
>  * - path_in_directory("/", "//", 1) -> false   ??
>  * - path_in_directory("/a/b", "//", *) -> false
>  */
> int path_in_directory(const char *path, const char *dir, int allow_equal);

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths

2012-09-30 Thread Junio C Hamano
Michael Haggerty  writes:

> I think I would advocate that the prefix has to match the front of
> the path exactly (including any trailing slashes) and either
>
> strlen(prefix) == 0
> or the prefix ended with a '/'
> or the prefix and path are identical
> or the character in path following the matching part is a '/'
>
> This would allow the "is path its own prefix" policy to be decided by
> the caller by either including or omitting a trailing slash on the
> prefix argument.

I think that is sensible thing to do.

The primary thing I found questionable was that the function, given
"/net/wink/project/frotz" to check against "/pub:/s" (or "/pub/:/s/"
if you like), will report that "/net/wink/project" directory is the
longest ancestor, when "/s" is a symlink that happens to point at
"/net/wink/project".  It is very counter-intuitive when you view its
two input strings as strings.  By making its sole caller expand the
symbolic links, it would be a lot clearer what is going on to
anybody who follow the codepath.  We have one path obtained from
getcwd() and a set of paths all of which are real paths without
symbolic aliasing, and we check if one among the latter cover
an earlier part of the former.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths

2012-09-30 Thread Michael Haggerty
On 09/30/2012 10:00 AM, Junio C Hamano wrote:
> Michael Haggerty  writes:
> 
>> longest_ancestor_length() relies on a textual comparison of directory
>> parts to find the part of path that overlaps with one of the paths in
>> prefix_list.  But this doesn't work if any of the prefixes involves a
>> symbolic link, because the directories will look different even though
>> they might logically refer to the same directory.  So canonicalize the
>> paths listed in prefix_list using real_path_if_valid() before trying
>> to find matches.
> 
> I somehow feel that this is making the logic unnecessarily
> convoluted by solving the problem at a wrong level.
> 
> If longest_ancestor_length() takes a single string and a list of
> candidate string prefixes, conceptually it should be usable for any
> hierarchy-looking string that uses slashes as hierarchy separator
> (e.g. refs that may be stored in packed-refs that you cannot expect
> lstat(2) or readlink(2) to give you any sensible results).
> 
> The real problem is that the list given from the environment may
> contain a path that violates that "it suffices to take the longest
> string-prefix taking slashes into account" assumption in such a
> generic l-a-l implementation, no?  And this patch solves it by
> making l-a-l _less_ generic and forcing it to be aware of the glitch
> of its caller (you can either blame clueless user who lies when
> setting the GIT_CEILING_DIRECTORIES by including paths contaminated
> with symlinks, or blame the calling setup_git_directory_gently_1()
> function that does not resolve the symbolic links before calling
> this function).
> 
> As l-a-l is only used by the "stop at the ceiling" logic, isn't it a
> far simpler solution to keep the function work at the string level,
> and make the caller fix its input, i.e. the value taken from the
> environment variable, before calling it?  That is, grab the value of
> GIT_CEILING_DIRECTORIES, split it into a list at PATH_SEP (while
> rejecting any non-absolute paths), normalize the elements of that
> list by resolving symbolic links, and then pass the cwd[] and the
> normalized string list to l-a-l?
> 
> The resulting callsite in setup_git_directory_gently_1() would pass
> cwd[] and the ceiling_dirs (which now is a string list), all of
> whose elements would happen to begin with "/" (or dos drive prefix
> followed by the "root" symbol), but l-a-l can be written in such a
> way that it does not even require that all the input has to begin at
> root, which would later make it usable with things that are not
> paths (references, for example, all of which begin with "refs/" and
> not "/refs/").

I agree that longest_ancestor_length() is not very generic or
interesting anymore.  Nearly all of its "added value" comes from the
normalize_path_callback() helper function.  One possibility would be to
inline it at the one place it is called.

The function string_list_longest_prefix() was my attempt to isolate the
kernel of generic functionality from longest_ancestor_path().  It is
*almost* the function that you are proposing, with the exception that it
does not ensure that the prefix match ends at a "/" boundary.  So
another alternative could be to change this function to respect "/"
boundaries.  (After the change, the function might not belong in the
string-list API anymore.)

However, the semantics of a function that matches prefixes at "/"
boundaries is not entirely obvious.  Suppose we would implement the test
via a function like path_prefixcmp(path, prefix).  I can think of a few
policy questions that would have to be answered:

* Is a trailing slash on the prefix argument required, optional, or
prohibited?  What if the prefix is "/" or "//" or "c:/"?

* Is a trailing slash on the path argument optional/prohibited?  Are "/"
or "//" allowed as path arguments?

* Is a path its own prefix?

path_prefixcmp("a/b", "a/b") -> true or false?

(For the implementation of longest_ancestor_path(), we would prefer this
to return "false".)  Or does the answer depend on whether the prefix has
a trailing slash?

path_prefixcmp("a/b", "a/b") -> true?
path_prefixcmp("a/b", "a/b/") -> false?

Part of the reason that I implemented string_list_longest_prefix()
instead of the function that you suggest is that the behavior of the
former is far more obvious.

I think I would advocate that the prefix has to match the front of the
path exactly (including any trailing slashes) and either

strlen(prefix) == 0
or the prefix ended with a '/'
or the prefix and path are identical
or the character in path following the matching part is a '/'

This would allow the "is path its own prefix" policy to be decided by
the caller by either including or omitting a trailing slash on the
prefix argument.

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordom

Re: [PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths

2012-09-30 Thread Junio C Hamano
Michael Haggerty  writes:

> longest_ancestor_length() relies on a textual comparison of directory
> parts to find the part of path that overlaps with one of the paths in
> prefix_list.  But this doesn't work if any of the prefixes involves a
> symbolic link, because the directories will look different even though
> they might logically refer to the same directory.  So canonicalize the
> paths listed in prefix_list using real_path_if_valid() before trying
> to find matches.

I somehow feel that this is making the logic unnecessarily
convoluted by solving the problem at a wrong level.

If longest_ancestor_length() takes a single string and a list of
candidate string prefixes, conceptually it should be usable for any
hierarchy-looking string that uses slashes as hierarchy separator
(e.g. refs that may be stored in packed-refs that you cannot expect
lstat(2) or readlink(2) to give you any sensible results).

The real problem is that the list given from the environment may
contain a path that violates that "it suffices to take the longest
string-prefix taking slashes into account" assumption in such a
generic l-a-l implementation, no?  And this patch solves it by
making l-a-l _less_ generic and forcing it to be aware of the glitch
of its caller (you can either blame clueless user who lies when
setting the GIT_CEILING_DIRECTORIES by including paths contaminated
with symlinks, or blame the calling setup_git_directory_gently_1()
function that does not resolve the symbolic links before calling
this function).

As l-a-l is only used by the "stop at the ceiling" logic, isn't it a
far simpler solution to keep the function work at the string level,
and make the caller fix its input, i.e. the value taken from the
environment variable, before calling it?  That is, grab the value of
GIT_CEILING_DIRECTORIES, split it into a list at PATH_SEP (while
rejecting any non-absolute paths), normalize the elements of that
list by resolving symbolic links, and then pass the cwd[] and the
normalized string list to l-a-l?

The resulting callsite in setup_git_directory_gently_1() would pass
cwd[] and the ceiling_dirs (which now is a string list), all of
whose elements would happen to begin with "/" (or dos drive prefix
followed by the "root" symbol), but l-a-l can be written in such a
way that it does not even require that all the input has to begin at
root, which would later make it usable with things that are not
paths (references, for example, all of which begin with "refs/" and
not "/refs/").

Hrm?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 8/9] longest_ancestor_length(): resolve symlinks before comparing paths

2012-09-28 Thread Michael Haggerty
longest_ancestor_length() relies on a textual comparison of directory
parts to find the part of path that overlaps with one of the paths in
prefix_list.  But this doesn't work if any of the prefixes involves a
symbolic link, because the directories will look different even though
they might logically refer to the same directory.  So canonicalize the
paths listed in prefix_list using real_path_if_valid() before trying
to find matches.

path is already in canonical form, so doesn't need to be canonicalized
again.

This fixes some problems with using GIT_CEILING_DIRECTORIES that
contains paths involving symlinks, including t4035 if run with --root
set to a path involving symlinks.

Remove a number of tests of longest_ancestor_length().  It is awkward
to test longest_ancestor_length() now, because its new path
normalization behavior depends on the contents of the whole
filesystem.  On the other hand:

* longest_ancestor_length() is now built of reusable components that
  are themselves tested separately (string_list_split(),
  string_list_longest_prefix(), and real_path_if_valid()), so it
  contains less code that can go wrong.

* longest_ancestor_length() gets some testing (albeit not systematic)
  via the GIT_CEILING_DIRECTORIES tests.

Therefore the work of updating these tests exceeds any expected
benefits.

Signed-off-by: Michael Haggerty 
---
 path.c| 18 +--
 t/t0060-path-utils.sh | 64 ---
 2 files changed, 11 insertions(+), 71 deletions(-)

diff --git a/path.c b/path.c
index b20f2fb..40d7360 100644
--- a/path.c
+++ b/path.c
@@ -570,21 +570,25 @@ int normalize_path_copy(char *dst, const char *src)
 
 static int normalize_path_callback(struct string_list_item *item, void 
*cb_data)
 {
-   char buf[PATH_MAX+2];
+   char *buf;
const char *ceil = item->string;
-   int len = strlen(ceil);
+   const char *realpath;
+   int len;
 
-   if (len == 0 || len > PATH_MAX || !is_absolute_path(ceil))
+   if (!*ceil || !is_absolute_path(ceil))
return 0;
-   if (normalize_path_copy(buf, ceil) < 0)
+   realpath = real_path_if_valid(ceil);
+   if (!realpath)
return 0;
-   len = strlen(buf);
+   len = strlen(realpath);
+   buf = xmalloc(len + 2); /* Leave space for possible trailing slash */
+   strcpy(buf, realpath);
if (len == 0 || buf[len-1] != '/') {
buf[len++] = '/';
-   buf[len++] = '\0';
+   buf[len] = '\0';
}
free(item->string);
-   item->string = xstrdup(buf);
+   item->string = buf;
return 1;
 }
 
diff --git a/t/t0060-path-utils.sh b/t/t0060-path-utils.sh
index 4ef2345..c97bbf2 100755
--- a/t/t0060-path-utils.sh
+++ b/t/t0060-path-utils.sh
@@ -12,28 +12,6 @@ norm_path() {
"test \"\$(test-path-utils normalize_path_copy '$1')\" = '$2'"
 }
 
-# On Windows, we are using MSYS's bash, which mangles the paths.
-# Absolute paths are anchored at the MSYS installation directory,
-# which means that the path / accounts for this many characters:
-rootoff=$(test-path-utils normalize_path_copy / | wc -c)
-# Account for the trailing LF:
-if test $rootoff = 2; then
-   rootoff=# we are on Unix
-else
-   rootoff=$(($rootoff-1))
-fi
-
-ancestor() {
-   # We do some math with the expected ancestor length.
-   expected=$3
-   if test -n "$rootoff" && test "x$expected" != x-1; then
-   expected=$(($expected+$rootoff))
-   fi
-   test_expect_success "longest ancestor: $1 $2 => $expected" \
-   "actual=\$(test-path-utils longest_ancestor_length '$1' '$2') &&
-test \"\$actual\" = '$expected'"
-}
-
 # Absolute path tests must be skipped on Windows because due to path mangling
 # the test program never sees a POSIX-style absolute path
 case $(uname -s) in
@@ -93,48 +71,6 @@ norm_path /d1/s1//../s2/../../d2 /d2 POSIX
 norm_path /d1/.../d2 /d1/.../d2 POSIX
 norm_path /d1/..././../d2 /d1/d2 POSIX
 
-ancestor / "" -1
-ancestor / / -1
-ancestor /foo "" -1
-ancestor /foo : -1
-ancestor /foo ::. -1
-ancestor /foo ::..:: -1
-ancestor /foo / 0
-ancestor /foo /fo -1
-ancestor /foo /foo -1
-ancestor /foo /foo/ -1
-ancestor /foo /bar -1
-ancestor /foo /bar/ -1
-ancestor /foo /foo/bar -1
-ancestor /foo /foo:/bar/ -1
-ancestor /foo /foo/:/bar/ -1
-ancestor /foo /foo::/bar/ -1
-ancestor /foo /:/foo:/bar/ 0
-ancestor /foo /foo:/:/bar/ 0
-ancestor /foo /:/bar/:/foo 0
-ancestor /foo/bar "" -1
-ancestor /foo/bar / 0
-ancestor /foo/bar /fo -1
-ancestor /foo/bar foo -1
-ancestor /foo/bar /foo 4
-ancestor /foo/bar /foo/ 4
-ancestor /foo/bar /foo/ba -1
-ancestor /foo/bar /:/fo 0
-ancestor /foo/bar /foo:/foo/ba 4
-ancestor /foo/bar /bar -1
-ancestor /foo/bar /bar/ -1
-ancestor /foo/bar /fo: -1
-ancestor /foo/bar :/fo -1
-ancestor /foo/bar /foo:/bar/ 4
-ancestor /foo/bar /:/foo:/bar/ 4
-ancestor /foo/bar /foo:/:/bar/ 4
-ancestor /foo/bar /:/bar/:/fo