Re: [PATCH] checkout: most of the time we have good leading directories

2013-11-09 Thread Thomas Rast
Junio C Hamano gits...@pobox.com writes:

 When git checkout wants to create a path, e.g. a/b/c/d/e, after
 seeing if the entire thing already exists (in which case we check if
 that is up-to-date and do not bother to check it out, or we unlink
 and recreate it), we validate that the leading directory path is
 without funny symlinks by seeing a/, a/b/, a/b/c/ and then a/b/c/d/
 are all without funny symlinks, by calling has_dirs_only_path() in
 this order.

 When we are checking out many files (imagine: initial checkout),
 however, it is likely that an earlier checkout would have already
 made sure that the leading directory a/b/c/d/ is in good order; by
 first checking the whole path a/b/c/d/ first, we can often bypass
 calls to has_dirs_only_path() for leading part.

Naively one would think that this is just as much work -- to correctly
verify that the path consist only of actual directories (not symlinks)
we have to lstat() every component regardless.  It seems the reason this
is an optimization is that has_dirs_only_path() caches its results, so
that we can get 'a/b/c/d/ is okay in every component' from the cache.

Is this analysis correct?  If so, can you spell that out in the commit
message?

-- 
Thomas Rast
t...@thomasrast.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] checkout: most of the time we have good leading directories

2013-11-09 Thread Junio C Hamano
Thomas Rast t...@thomasrast.ch writes:

 Junio C Hamano gits...@pobox.com writes:

 When git checkout wants to create a path, e.g. a/b/c/d/e, after
 seeing if the entire thing already exists (in which case we check if
 that is up-to-date and do not bother to check it out, or we unlink
 and recreate it), we validate that the leading directory path is
 without funny symlinks by seeing a/, a/b/, a/b/c/ and then a/b/c/d/
 are all without funny symlinks, by calling has_dirs_only_path() in
 this order.

 When we are checking out many files (imagine: initial checkout),
 however, it is likely that an earlier checkout would have already
 made sure that the leading directory a/b/c/d/ is in good order; by
 first checking the whole path a/b/c/d/ first, we can often bypass
 calls to has_dirs_only_path() for leading part.

 Naively one would think that this is just as much work -- to correctly
 verify that the path consist only of actual directories (not symlinks)
 we have to lstat() every component regardless.  It seems the reason this
 is an optimization is that has_dirs_only_path() caches its results, so
 that we can get 'a/b/c/d/ is okay in every component' from the cache.

 Is this analysis correct?  If so, can you spell that out in the commit
 message?

It was done without analysis ;-) but I think you are correct.

If you are checking out a/b/c/d/{m,a,n,y}, after you checked out
a/b/c/d/m, the has_dirs_only_path cache knows a/b/c/d/ is in good
order so when you check out a/b/c/d/{a,n,y}, we can just ask for
a/b/c/d/ and get an OK immediately.  There is no point asking from
a/, a/b/, a/b/c/ and then a/b/c/d/, in the original pessimistic
order.  A change done _right_ to properly optimize this might even
want to change the main loop that the patch bypassed.

I do not think the patch (or the change done right for that
matter) will make much difference on a platform with good filesystem
metadata caching. It may be very interesting to see if that simple
patch makes any difference on Windows, though. If it does, then we
may want to look into cleaning up the code further.

Thanks for a comment.



--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html