Re: inotify to minimize stat() calls

2013-04-29 Thread Duy Nguyen
On Tue, Apr 30, 2013 at 1:05 AM, Robert Zeh robert.allan@gmail.com wrote:
 The call to lstat is only there for testing and should not be in there for
 the final version. Is there an easy way to only enable it for tests?

The usual trick is invent a new GIT_ environment variable. Then check
it and do something different. Then you can set the env in tests only.
--
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-03-10 Thread Ramkumar Ramachandra
Duy Nguyen wrote:
 On Fri, Mar 8, 2013 at 3:15 PM, Junio C Hamano gits...@pobox.com wrote:
  The possible options are:
  +
 -   - 'no' - Show no untracked files
 +   - 'no' - Show no untracked files (this is fastest)

 There is a trade-off around the use of -uno between safety and
 performance.  The default is not to use -uno so that you will not
 forget to add a file you newly created (i.e safety).  You would pay
 for the safety with the cost to find such untracked files (i.e.
 performance).

 I suspect that the documentation was written with the assumption
 that at least for the people who are reading this part of the
 documentation, the trade-off is obvious.  In order to find more
 information, you naturally need to spend more cycles.

 If the trade-off is not so obvious, however, I do not object at all
 to describing it. But if we are to do so, I do object to mentioning
 only one side of the trade-off.  People who choose fastest needs
 to be made very aware that they are disabling safety.

 On the topic of trading off, I was thinking about new -uauto as
 default that is like -uall if it takes less than a certan amount of
 time (e.g. 0.5 seconds), if it exceeds that limit, the operation is
 aborted (i.e. it turns to -uno). The safety net is still there, git
 status advices to use -u to show full information.

Ugh, this is too opaque; the user has no idea whether untracked files
are being counted or not.

 Or a less intrusive approach: measure the time and advice the user to
 (read doc and) use -uno.

I just learnt about -uno myself, from this thread.  At best, it's a
stopgap until we get inotify support.

 But it's probably worth waiting for the first cut of inotify support
 from Ram. It's better with inotify anyway.

This is quite urgent in my opinion.  One of git's primary tasks is to
quickly tell me what changed in the repository, and inotify is the
perfect way to do this.
I'll try to get the first cut out quickly, so we can immediately
correct any fundamental design flaws.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-03-08 Thread Junio C Hamano
Torsten Bögershausen tbo...@web.de writes:

 Doesn't this make one wonder why a separate bit and implementation
 is necessary to say I am not interested in untracked files when
 -uno option is already there?
 ...
 I need to admit that I wasn't aware about git status -uno.

Not so fast.  I did not ask you Why do you need a new one to solve
the same problem -uno already solves?

 Thinking about it, how many git users are aware of the speed penalty
 when running git status to find out which (tracked) files they had changed?

 Or to put it the other way, when a developer wants a quick overview
 about the files she changed, then git status -uno may be a good and fast 
 friend.

 Does it make sence to stress put that someway in the documentation?

 diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
 index 9f1ef9a..360d813 100644
 --- a/Documentation/git-status.txt
 +++ b/Documentation/git-status.txt
 @@ -51,13 +51,18 @@ default is 'normal', i.e. show untracked files and 
 directori
  +
  The possible options are:
  +
 -   - 'no' - Show no untracked files
 +   - 'no' - Show no untracked files (this is fastest)

There is a trade-off around the use of -uno between safety and
performance.  The default is not to use -uno so that you will not
forget to add a file you newly created (i.e safety).  You would pay
for the safety with the cost to find such untracked files (i.e.
performance).

I suspect that the documentation was written with the assumption
that at least for the people who are reading this part of the
documentation, the trade-off is obvious.  In order to find more
information, you naturally need to spend more cycles.

If the trade-off is not so obvious, however, I do not object at all
to describing it. But if we are to do so, I do object to mentioning
only one side of the trade-off.  People who choose fastest needs
to be made very aware that they are disabling safety.

That brings us back to the Why a separate implementation when -uno
is there? question.

Your patch adds new code; it does not just enables the same logic as
what the existing -uno does with a new flag.  Does the new compute
different things?  Does it find more stuff by spending extra cycles?
Does it find less stuff by being extra faster?

These questions are important.

If the new option strikes the trade-off between safety and
performance at a point different from the point where the existing
-uno option does, it _might_ still be worth adding as a separate
option.  I didn't get that impression when I saw the patch, but I
admit that I did not follow the code carefully myself.

That is the reason why I was wondering why a separate bit and
implementation had to be added by the patch.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-03-08 Thread Torsten Bögershausen
On 08.03.13 09:15, Junio C Hamano wrote:
 Torsten Bögershausen tbo...@web.de writes:
 
 Doesn't this make one wonder why a separate bit and implementation
 is necessary to say I am not interested in untracked files when
 -uno option is already there?
 ...
 I need to admit that I wasn't aware about git status -uno.
 
 Not so fast.  I did not ask you Why do you need a new one to solve
 the same problem -uno already solves?
 
 Thinking about it, how many git users are aware of the speed penalty
 when running git status to find out which (tracked) files they had changed?

 Or to put it the other way, when a developer wants a quick overview
 about the files she changed, then git status -uno may be a good and fast 
 friend.

 Does it make sence to stress put that someway in the documentation?

 diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
 index 9f1ef9a..360d813 100644
 --- a/Documentation/git-status.txt
 +++ b/Documentation/git-status.txt
 @@ -51,13 +51,18 @@ default is 'normal', i.e. show untracked files and 
 directori
  +
  The possible options are:
  +
 -   - 'no' - Show no untracked files
 +   - 'no' - Show no untracked files (this is fastest)
 
 There is a trade-off around the use of -uno between safety and
 performance.  The default is not to use -uno so that you will not
 forget to add a file you newly created (i.e safety).  You would pay
 for the safety with the cost to find such untracked files (i.e.
 performance).
 
 I suspect that the documentation was written with the assumption
 that at least for the people who are reading this part of the
 documentation, the trade-off is obvious.  In order to find more
 information, you naturally need to spend more cycles.
 
 If the trade-off is not so obvious, however, I do not object at all
 to describing it. But if we are to do so, I do object to mentioning
 only one side of the trade-off.  People who choose fastest needs
 to be made very aware that they are disabling safety.
 
 That brings us back to the Why a separate implementation when -uno
 is there? question.
[...]
The short version:
The -uno option does exactly what the -c option intended to do ;-)
(The code path to disable the expensive call to read_directory_recursive()
in dir.c is slightly different).
Making benchmarks (again, sorry for the noise) shows that -uno and -c are 
equally fast,
making 5 git status on a linux tree, take the best of 5:

git status
real0m0.697s

git status -uno
real0m0.291s

(with the patch) git status -c
real0m0.289s


These are not really scientific numbers, but all in all we have motivation 
enough to drop
the git status -c patch completely.

My feeling is still that the suggested documentation this is fastest is not a 
good choice either.
Let me try to come up with a better suggestion.
/Torsten
 







--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-03-08 Thread Duy Nguyen
On Fri, Mar 8, 2013 at 3:15 PM, Junio C Hamano gits...@pobox.com wrote:
  The possible options are:
  +
 -   - 'no' - Show no untracked files
 +   - 'no' - Show no untracked files (this is fastest)

 There is a trade-off around the use of -uno between safety and
 performance.  The default is not to use -uno so that you will not
 forget to add a file you newly created (i.e safety).  You would pay
 for the safety with the cost to find such untracked files (i.e.
 performance).

 I suspect that the documentation was written with the assumption
 that at least for the people who are reading this part of the
 documentation, the trade-off is obvious.  In order to find more
 information, you naturally need to spend more cycles.

 If the trade-off is not so obvious, however, I do not object at all
 to describing it. But if we are to do so, I do object to mentioning
 only one side of the trade-off.  People who choose fastest needs
 to be made very aware that they are disabling safety.

On the topic of trading off, I was thinking about new -uauto as
default that is like -uall if it takes less than a certan amount of
time (e.g. 0.5 seconds), if it exceeds that limit, the operation is
aborted (i.e. it turns to -uno). The safety net is still there, git
status advices to use -u to show full information.

Or a less intrusive approach: measure the time and advice the user to
(read doc and) use -uno.

But it's probably worth waiting for the first cut of inotify support
from Ram. It's better with inotify anyway.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-03-07 Thread Torsten Bögershausen
On 11.02.13 03:56, Duy Nguyen wrote:
 On Mon, Feb 11, 2013 at 3:16 AM, Junio C Hamano gits...@pobox.com wrote:
 The other lstat() experiment was a very interesting one, but this
 is not yet an interesting experiment to see where in the ignore
 codepath we are spending times.

 We know that we can tell wt_status_collect_untracked() not to bother
 with the untracked or ignored files with !s-show_untracked_files
 already, but I think the more interesting question is if we can show
 the untracked files with less overhead.

 If we want to show untrackedd files, it is a given that we need to
 read directories to see what paths there are on the filesystem. Is
 the opendir/readdir cost dominating in the process? Are we spending
 a lot of time sifting the result of opendir/readdir via the ignore
 mechanism? Is reading the ignore files costing us much to prime
 the ignore mechanism?

 If readdir cost is dominant, then that makes cache gitignore a
 nonsense proposition, I think.  If you really want to cache
 something, you need to have somebody (i.e. a daemon) who constantly
 keeps an eye on the filesystem changes and can respond with the up
 to date result directly to fill_directory().  I somehow doubt that
 it is a direction we would want to go in, though.
 
 Yeah, it did not cut out syscall cost, I also cut a lot of user-space
 processing (plus .gitignore content access). From the timings I posted
 earlier,
 
 unmodified  dir.c
 real0m0.550s0m0.287s
 user0m0.305s0m0.201s
 sys 0m0.240s0m0.084s
 
 sys time is reduced from 0.24s to 0.08s, so readdir+opendir definitely
 has something to do with it (and perhaps reading .gitignore). But it
 also reduces user time from 0.305 to 0.201s. I don't think avoiding
 readdir+openddir will bring us this gain. It's probably the cost of
 matching .gitignore. I'll try to replace opendir+readdir with a
 no-syscall version. At this point untracked caching sounds more
 feasible (and less complex) than .gitignore cachine.
 
Thanks for Duy for the measurements, and patches.
I took the freedom to convert the patched dir.c into a 
runtime configurable git status option.
I'm not sure if the following copy-and-paste work applies,
(it is based on Git 1.8.1.3), but the time spend for 
git status --changed-only is basically half the time of
git status, similar to what Duy has measured.
I did a test both on a Linux box and Mac OS.

And the speedup is so impressive, that I am tempted to submit a patch simlar
to the following, what do we think about it?
/Torsten




-- 8 --

[PATCH] git status: add option changed-only
git status may be run faster if
- we only check if files are changed which are already known to git.
- we don't check if there are untracked files.

git status --changed-only (or the short form git status -c)

will only check for changed files which are already known to git,
and which are in the index.

The call to read_directory_recursive() is skipped and untracked files
in the working tree are not reported.

Inspired-by: Duy Nguyen pclo...@gmail.com
Signed-off-by: Torsten Bögershausen tbo...@web.de
---
 builtin/commit.c | 2 ++
 dir.c| 5 +++--
 dir.h| 3 ++-
 wt-status.c  | 3 +++
 wt-status.h  | 1 +
 5 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index d6dd3df..6a5ba11 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1158,6 +1158,8 @@ int cmd_status(int argc, const char **argv, const char 
*prefix)
unsigned char sha1[20];
static struct option builtin_status_options[] = {
OPT__VERBOSE(verbose, N_(be verbose)),
+   OPT_BOOLEAN('c', changed-only, s.check_changed_only,
+   N_(Ignore untracked files. Check if files known to 
git are modified)),
OPT_SET_INT('s', short, status_format,
N_(show status concisely), STATUS_FORMAT_SHORT),
OPT_BOOLEAN('b', branch, s.show_branch,
diff --git a/dir.c b/dir.c
index a473ca2..555b652 100644
--- a/dir.c
+++ b/dir.c
@@ -1274,8 +1274,9 @@ int read_directory(struct dir_struct *dir, const char 
*path, int len, const char
return dir-nr;
 
simplify = create_simplify(pathspec);
-   if (!len || treat_leading_path(dir, path, len, simplify))
-   read_directory_recursive(dir, path, len, 0, simplify);
+   if ((!(dir-flags  DIR_CHECK_CHANGED_ONLY)) 
+   (!len || treat_leading_path(dir, path, len, simplify))) 
o
+   read_directory_recursive(dir, path, len, 0, simplify);
free_simplify(simplify);
qsort(dir-entries, dir-nr, sizeof(struct dir_entry *), cmp_name);
qsort(dir-ignored, dir-ignored_nr, sizeof(struct dir_entry *), 
cmp_name);
diff --git a/dir.h b/dir.h
index f5c89e3..1a915a7 100644
--- a/dir.h
+++ b/dir.h
@@ -41,7 +41,8 @@ struct dir_struct {
DIR_SHOW_OTHER_DIRECTORIES = 11,

Re: inotify to minimize stat() calls

2013-03-07 Thread Junio C Hamano
Torsten Bögershausen tbo...@web.de writes:

 diff --git a/builtin/commit.c b/builtin/commit.c
 index d6dd3df..6a5ba11 100644
 --- a/builtin/commit.c
 +++ b/builtin/commit.c
 @@ -1158,6 +1158,8 @@ int cmd_status(int argc, const char **argv, const char 
 *prefix)
   unsigned char sha1[20];
   static struct option builtin_status_options[] = {
   OPT__VERBOSE(verbose, N_(be verbose)),
 + OPT_BOOLEAN('c', changed-only, s.check_changed_only,
 + N_(Ignore untracked files. Check if files known to 
 git are modified)),

Doesn't this make one wonder why a separate bit and implementation
is necessary to say I am not interested in untracked files when
-uno option is already there?


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-03-07 Thread Torsten Bögershausen
On 08.03.13 01:04, Junio C Hamano wrote:
 Torsten Bögershausen tbo...@web.de writes:
 
 diff --git a/builtin/commit.c b/builtin/commit.c
 index d6dd3df..6a5ba11 100644
 --- a/builtin/commit.c
 +++ b/builtin/commit.c
 @@ -1158,6 +1158,8 @@ int cmd_status(int argc, const char **argv, const char 
 *prefix)
  unsigned char sha1[20];
  static struct option builtin_status_options[] = {
  OPT__VERBOSE(verbose, N_(be verbose)),
 +OPT_BOOLEAN('c', changed-only, s.check_changed_only,
 +N_(Ignore untracked files. Check if files known to 
 git are modified)),
 
 Doesn't this make one wonder why a separate bit and implementation
 is necessary to say I am not interested in untracked files when
 -uno option is already there?
Thanks Junio,
this is good news.
I need to admit that I wasn't aware about git status -uno.

Thinking about it, how many git users are aware of the speed penalty
when running git status to find out which (tracked) files they had changed?

Or to put it the other way, when a developer wants a quick overview
about the files she changed, then git status -uno may be a good and fast friend.

Does it make sence to stress put that someway in the documentation?

diff --git a/Documentation/git-status.txt b/Documentation/git-status.txt
index 9f1ef9a..360d813 100644
--- a/Documentation/git-status.txt
+++ b/Documentation/git-status.txt
@@ -51,13 +51,18 @@ default is 'normal', i.e. show untracked files and directori
 +
 The possible options are:
 +
-   - 'no' - Show no untracked files
+   - 'no' - Show no untracked files (this is fastest)
- 'normal' - Shows untracked files and directories
- 'all'- Also shows individual files in untracked directories.
 +
 The default can be changed using the status.showUntrackedFiles
 configuration variable documented in linkgit:git-config[1].
 
++
+Note: Searching for untracked files or directories may take some time.
+A fast way to get a status of files tracked by git is to use
+'git status -uno'
+












 
 
 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-19 Thread Ramkumar Ramachandra
Karsten Blees wrote:
 Am 11.02.2013 04:53, schrieb Duy Nguyen:
 On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund kusmab...@gmail.com wrote:
 Karsten Blees has done something similar-ish on Windows, and he posted
 the results here:

 https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion


 The new hashtable implementation in fscache [1] supports O(1) removal and has 
 no mingw dependencies - might come in handy for anyone trying to implement an 
 inotify daemon.

 [1] https://github.com/kblees/git/commit/f7eb85c2

Thanks!  I'm cherry-picking this.  Why didn't you propose replacing
hash.{c,h} with this in git.git though?

 I also seem to remember he doing a ReadDirectoryChangesW version, but
 I don't remember what happened with that.

 Thanks. I came across that but did not remember. For one thing, we
 know the inotify alternative for Windows: ReadDirectoryChangesW.


 I dropped ReadDirectoryChangesW because maintaining a 'live' file system 
 cache became more and more complicated. For example, according to MSDN docs, 
 ReadDirectoryChangesW *may* report short DOS 8.3 names (i.e. PROGRA~1 
 instead of Program Files), so a correct and fast cache implementation would 
 have to be indexed by long *and* short names...

 Another problem was that the 'live' cache had quite negative performance 
 impact on mutating git commands (checkout, reset...). An inotify daemon 
 running as a background process (not in-process as fscache) will probably 
 affect everyone that modifies the working copy, e.g. running 'make' or the 
 test-suite. This should be considered in the design.

Yes, an external daemon would report creation of *.o files, from the
compile, for instance.  We need a way for it to be filtered at the
daemon itself, so git isn't burdened with the filtering.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-19 Thread Ramkumar Ramachandra
Robert Zeh wrote:
 On Sun, Feb 10, 2013 at 9:21 PM, Duy Nguyen pclo...@gmail.com wrote:
 On Mon, Feb 11, 2013 at 2:03 AM, Robert Zeh robert.allan@gmail.com 
 wrote:
 On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano gits...@pobox.com wrote:
 Ramkumar Ramachandra artag...@gmail.com writes:

 This is much better than Junio's suggestion to study possible
 implementations on all platforms and designing a generic daemon/
 communication channel.  That's no weekend project.

 It appears that you misunderstood what I wrote.  That was not here
 is a design; I want it in my system.  Go implemment it.

 It was If somebody wants to discuss it but does not know where to
 begin, doing a small experiment like this and reporting how well it
 worked here may be one way to do so., nothing more.

 What if instead of communicating over a socket, the daemon
 dumped a file containing all of the lstat information after git
 wrote a file? By definition the daemon should know about file writes.

 There would be no network communication, which I think would make
 things more secure. It would simplify the rendezvous by insisting on
 well known locations in $GIT_DIR.

 We need some sort of interactive communication to the daemon anyway,
 to validate that the information is uptodate. Assume that a user makes
 some changes to his worktree before starting the daemon, git needs to
 know that what the daemon provides does not represent a complete
 file-change picture and it better refreshes the index the old way
 once, then trust the daemon.

 I think we could solve that by storing a session id, provided by the
 daemon, in .git/index. If the session id is not present (or does not
 match what the current daemon gives), refresh the old way. After
 refreshing, it may ask the daemon for new session id and store it.
 Next time if the session id is still valid, trust the daemon's data.
 This session id should be different every time the daemon restarts for
 this to work.

 I think we could do this without interactive communication,
 if we did the following:
1) The Daemon waits to see $GIT_DIR/lstat_request, and atomically
writes out $GIT_DIR/lstat_cache.  By atomically I mean that it writes
things out to a temporary file, and then does a rename.

2) The client erases $GIT_DIR/lstat_cache, and writes
   $GIT_DIR/lstat_request

 I think this is better than socket based communication because there
 are fewer places to check
 for failures.

My main problem with file-based solutions is this: how will the daemon
accumulate inotify change events over time, and report it in a batch
to a git application that is spawned?  Will it append to the
.git/inotify_changes file everytime there's a change?  Wouldn't you
prefer to accumulate the events in-memory and report it over a socket
upon explicit request, to minimize IO?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-19 Thread Drew Northup
On Sun, Feb 10, 2013 at 12:24 AM, Duy Nguyen pclo...@gmail.com wrote:
 On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
 artag...@gmail.com wrote:
 Finn notes in the commit message that it offers no speedup, because
 .gitignore files in every directory still have to be read.  I think
 this is silly: we really should be caching .gitignore, and touching it
 only when lstat() reports that the file has changed.
 ...
 Really, the elephant in the room right now seems to be .gitignore.
 Until that is fixed, there is really no use of writing this inotify
 daemon, no?  Can someone enlighten me on how exactly .gitignore files
 are processed?

 .gitignore is a different issue. I think it's mainly used with
 read_directory/fill_directory to collect ignored files (or not-ignored
 files). And it's not always used (well, status and add does, but diff
 should not). I think wee need to measure how much mass lstat
 elimination gains us (especially on big repos) and how much
 .gitignore/.gitattributes caching does. I don't think .gitignore has
 such a big impact though. strace on git.git tells me git status
 issues about 2500 lstat calls, and just 740 open+getdents calls (on
 total 3800 syscalls). I will think if we can do something about
 .gitignore/.gitattributes.
 --
 Duy

Duy,
Did your testing turn up anything about the amount of time spent
parsing the .gitignore/.gitattributes files? Not the syscall count,
but the actual time spent running the parser (which I presume is
largely CPU-bound). The other notable bit of information to know would
be how much time is spent applying what has been parsed out of those
files to the content of the tree. Both will give a clear signal of the
prominence of those segments of code versus others elsewhere in the
git stat flow path. That information will tell us more clearly what,
if anything, it is worth keeping a cache of and what form that cache
should take.

-- 
-Drew Northup
--
As opposed to vegetable or mineral error?
-John Pescatore, SANS NewsBites Vol. 12 Num. 59
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-19 Thread Duy Nguyen
On Tue, Feb 19, 2013 at 8:16 PM, Drew Northup n1xim.em...@gmail.com wrote:
 Did your testing turn up anything about the amount of time spent
 parsing the .gitignore/.gitattributes files? Not the syscall count,
 but the actual time spent running the parser (which I presume is
 largely CPU-bound). The other notable bit of information to know would
 be how much time is spent applying what has been parsed out of those
 files to the content of the tree. Both will give a clear signal of the
 prominence of those segments of code versus others elsewhere in the
 git stat flow path. That information will tell us more clearly what,
 if anything, it is worth keeping a cache of and what form that cache
 should take.

Not specifically parsing, but we do waste CPU on
.gitignore/.gitattributes stuff. See

http://thread.gmane.org/gmane.comp.version-control.git/216347/focus=216381

Other measurements (which led to the above patch):

http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=215900
http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=216029
http://thread.gmane.org/gmane.comp.version-control.git/215820/focus=216195

So far we could reduce lstat, {open,read,close}dir syscalls with the
help of inotify, which saves time. I'm not sure if we should cache the
list of untracked-but-not-ignored files. It cuts down cpu time on
.gitignore but invalidation could be complicated.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-14 Thread Magnus Bäck
On Sunday, February 10, 2013 at 08:26 EST,
 demerphq demer...@gmail.com wrote:

 Is windows stat really so slow?

Well, the problem is that there is no such thing as Windows stat :-)

 I encountered this perception in windows Perl in the past, and I know
 that on windows Perl stat *appears* slow compared to *nix, because in
 order to satisfy the full *nix stat interface, specifically the nlink
 field, it must open and close the file*. As of 5.10 this can be
 disabled by setting a magic var ${^WIN32_SLOPPY_STAT} to a true value,
 which makes a significant improvement to the performance of the Perl
 level stat implementation.  I would not be surprised if the cygwin
 implementation of stat() has the same issue as Perl did, and that stat
 appears much slower than it actually need be if you don't care about
 the nlink field.

I suggested a few years ago that FindFirstFile() be used to implement
stat() since it's way faster than opening and closing the file, but
FindFirstFile() apparently produces unreliable mtime results when DST
shifts are involved.

http://thread.gmane.org/gmane.comp.version-control.git/114041
(The reference link in Johannes Sixt's first email is broken, but I'm
sure the information can be dug up.)

Based on a quick look it seems GetFileAttributesEx() is still used for
mingw and cygwin Git.

-- 
Magnus Bäck
ba...@google.com
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-14 Thread Ævar Arnfjörð Bjarmason
On Fri, Feb 8, 2013 at 10:10 PM, Ramkumar Ramachandra
artag...@gmail.com wrote:
 For large repositories, many simple git commands like `git status`
 take a while to respond.  I understand that this is because of large
 number of stat() calls to figure out which files were changed.  I
 overheard that Mercurial wants to solve this problem using itnotify,
 but the idea bothers me because it's not portable.  Will Git ever
 consider using inotify on Linux?  What is the downside?

There's one relatively easy sub-task of this that I haven't seen
mentioned: Improving the speed of interactive rebase on large (as in
lots of checked out files) repositories.

That's the single biggest thing that bothers me when I use Git with
large repos, not the speed of git status. When you git rebase -i
HEAD~100 re-arrange some patches and save the TODO list it takes say
0.5-1s for each patch to be applied, but at least 10x less than that
on a small repository. E.g. try this on linux-2.6.git v.s. some small
project with a few dozen files.

I looked into this a long while ago and remembered that rebase was
doing something like a git status for every commit that it made to
check the dirtyness.

This could be vastly improved by having an unsafe option to git-rebase
where it just assumes that the starting state + whatever it wrote out
is the current state, i.e. it would break if someone stuck up on your
checkout during an interactive rebase and changed a file, but the
common case of the user having exclusive access to the repo and
waiting for the rebase would be much faster.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-14 Thread Junio C Hamano
Ævar Arnfjörð Bjarmason ava...@gmail.com writes:

 I looked into this a long while ago and remembered that rebase was
 doing something like a git status for every commit that it made to
 check the dirtyness.

 This could be vastly improved by having an unsafe option to git-rebase
 where it just assumes that the starting state + whatever it wrote out
 is the current state, i.e. it would break if someone stuck up on your
 checkout during an interactive rebase and changed a file,...

You could make it a lot safer than just assumes, and the result
may become generally usable, I think.  For example, you can set a
magic bit somewhere in $GIT_DIR/rebase-i while you are in I am
doing pick/pick/pick and the user will not interfere me mode, and
clear that bit upon rebase --continue.  And you cheat only while
that magic bit is set.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-13 Thread Duy Nguyen
On Tue, Feb 12, 2013 at 09:48:18PM +0100, Karsten Blees wrote:

 However, the difference between git status -uall and -uno was always
 about 1.3 s in all fscache versions, even though
 opendir/readdir/closedir was served entirely from the cache. I added
 a bit of performance tracing to find the cause, and I think most of
 the time spent in wt_status_collect_untracked can be eliminated:
 
 1.) 0.939 s is spent in dir.c/excluded (i.e. checking
 .gitignore). This check is done for *every* file in the working
 copy, including files in the index. Checking the index first could
 eliminate most of that, i.e.:
 
 (Note: patches are for discussion only, I'm aware that they may have
 unintended side effects...)

 @@ -1097,6 +1097,8 @@ static enum path_treatment treat_path(struct dir_struct 
 *dir,
 return path_ignored;
 strbuf_setlen(path, baselen);
 strbuf_addstr(path, de-d_name);
 +   if (cache_name_exists(path-buf, path-len, ignore_case))
 +   return path_ignored;
 if (simplify_away(path-buf, path-len, simplify))
 return path_ignored;

The below patch passes the test suite for me and still does the same
thing. On my Linux box, running git status on gentoo-x86.git with
this patch saves 0.05s (0.548s without the patch, 0.505s with the
patch, best number of 20 runs).

And I just realized gentoo-x86.git does not have any .gitignore. On
webkit.git, it cuts git status time from 1.121s down to
0.762s. Unless I'm mistaken, git add should have the same benefit on
normal case too. Good finding!

-- 8 --
diff --git a/dir.c b/dir.c
index 57394e4..4b4cf60 100644
--- a/dir.c
+++ b/dir.c
@@ -1244,7 +1244,19 @@ static enum path_treatment treat_one_path(struct 
dir_struct *dir,
  const struct path_simplify *simplify,
  int dtype, struct dirent *de)
 {
-   int exclude = is_excluded(dir, path-buf, dtype);
+   int exclude;
+
+   if (dtype == DT_UNKNOWN)
+   dtype = get_dtype(de, path-buf, path-len);
+
+   if (!(dir-flags  DIR_SHOW_IGNORED) 
+   !(dir-flags  DIR_COLLECT_IGNORED) 
+   dtype == DT_REG 
+   cache_name_exists(path-buf, path-len, ignore_case))
+   return path_ignored;
+
+   exclude = is_excluded(dir, path-buf, dtype);
+
if (exclude  (dir-flags  DIR_COLLECT_IGNORED)
 exclude_matches_pathspec(path-buf, path-len, simplify))
dir_add_ignored(dir, path-buf, path-len);
@@ -1256,9 +1268,6 @@ static enum path_treatment treat_one_path(struct 
dir_struct *dir,
if (exclude  !(dir-flags  DIR_SHOW_IGNORED))
return path_ignored;
 
-   if (dtype == DT_UNKNOWN)
-   dtype = get_dtype(de, path-buf, path-len);
-
switch (dtype) {
default:
return path_ignored;
-- 8 --
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-13 Thread Jeff King
On Wed, Feb 13, 2013 at 07:15:47PM +0700, Nguyen Thai Ngoc Duy wrote:

 On Wed, Feb 13, 2013 at 3:48 AM, Karsten Blees karsten.bl...@gmail.com 
 wrote:
  2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, 
  reindexing the same directories over and over again. In the end, the 
  hashtable contains 939k directory entries, even though the WebKit test repo 
  only has 7k directories. Checking if a directory entry already exists could 
  reduce that, i.e.:
 
 This function is only used when core.ignorecase = true. I probably
 won't be able to test this, so I'll leave this to other people who
 care about ignorecase.
 
 This function used to have lookup_hash, but it was removed by Jeff in
 2548183 (fix phantom untracked files when core.ignorecase is set -
 2011-10-06). There's a looong commit message which I'm too lazy to
 read. Anybody who works on this should though.

Yeah, the problem that commit tried to solve is that linking to a single
cache entry through the hash is not enough, because we may remove cache
items. Imagine you have dir/one and dir/two, and you add them to the
in-memory index in that order. The original code hashed dir/ and
inserted a link to the dir/one cache entry. When it came time to put
in the dir/two entry, we noticed that there was already a dir/ entry
and did nothing. Then later, if we remove dir/one, we do so by marking
it with CE_UNHASHED. So a later query for dir/ will see nope, nothing
here that wasn't CE_UNHASHED, which is wrong. We never recorded that
dir/two existed under the hash for dir/, so we can't know about it.

My patch just stores the cache_entry for both under the dir/ hash.
As Karsten noticed, that can lead to a large number of hash entries,
because adding some/deep/hierarchy/with/files will add 4 directory
entries for just that single file. Moreover, looking at it again, I
don't think my patch produces the right behavior: we have a single
dir_next pointer, even though the same ce_entry may appear under many
directory hashes. So the cache_entries that has to dir/foo/ and those
that hash to dir/bar/ may get confused, because they will also both be
found under dir/, and both try to create a linked list from the
dir_next pointer.

Looking at Karsten's patch, it seems like it will not add a cache entry
if there is one of the same name. But I'm not sure if that is right, as
the old one might be CE_UNHASHED (or it might get removed later). You
actually want to be able to find each cache_entry that has a file under
the directory at the hash of that directory, so you can make sure it is
still valid.

And of course that still leaves the existing correctness problem I
mentioned above.

I think the best way forward is to actually create a separate hash table
for the directory lookups. I note that we only care about these entries
in directory_exists_in_index_icase, which is really about whether
something is there, versus what exactly is there. So could we maybe get
by with a separate hash table that stores a count of entries at each
directory, and increment/decrement the count when we add/remove entries?

The biggest problem I see with that is that we do indeed care a little
bit what is at the directory: we check the mode to see if it is a gitdir
or not. But I think we can maybe sneak around that: gitdirs have actual
entries in the index, whereas the directories do not. So we would find
them via index_name_exists; anything that is not there, but _is_ in the
special directory hash would therefore be a directory.

I realize it got pretty esoteric there in the middle. I'll see if I can
work up a patch that expresses what I'm thinking.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-13 Thread Jeff King
On Wed, Feb 13, 2013 at 01:18:51PM -0500, Jeff King wrote:

 I think the best way forward is to actually create a separate hash table
 for the directory lookups. I note that we only care about these entries
 in directory_exists_in_index_icase, which is really about whether
 something is there, versus what exactly is there. So could we maybe get
 by with a separate hash table that stores a count of entries at each
 directory, and increment/decrement the count when we add/remove entries?
 
 The biggest problem I see with that is that we do indeed care a little
 bit what is at the directory: we check the mode to see if it is a gitdir
 or not. But I think we can maybe sneak around that: gitdirs have actual
 entries in the index, whereas the directories do not. So we would find
 them via index_name_exists; anything that is not there, but _is_ in the
 special directory hash would therefore be a directory.
 
 I realize it got pretty esoteric there in the middle. I'll see if I can
 work up a patch that expresses what I'm thinking.

So here's a patch. It's mostly meant to illustrate what I'm thinking,
and I have no clue if it introduces regressions. It does pass the test
suite, but we have virtually no ignorecase tests.  It seems to behave
sanely when I set core.ignorecase on my Linux box, but I have no idea
what it will do on a real case-insensitive system (nor even, to be
honest, what kinds of scenarios should be tested for the dir-hashing
stuff).

---
diff --git a/cache.h b/cache.h
index e493563..6630a35 100644
--- a/cache.h
+++ b/cache.h
@@ -131,7 +131,6 @@ struct cache_entry {
unsigned int ce_namelen;
unsigned char sha1[20];
struct cache_entry *next;
-   struct cache_entry *dir_next;
char name[FLEX_ARRAY]; /* more */
 };
 
@@ -267,26 +266,14 @@ extern void add_name_hash(struct index_state *istate, 
struct cache_entry *ce);
unsigned name_hash_initialized : 1,
 initialized : 1;
struct hash_table name_hash;
+   struct hash_table dir_hash;
 };
 
 extern struct index_state the_index;
 
 /* Name hashing */
 extern void add_name_hash(struct index_state *istate, struct cache_entry *ce);
-/*
- * We don't actually *remove* it, we can just mark it invalid so that
- * we won't find it in lookups.
- *
- * Not only would we have to search the lists (simple enough), but
- * we'd also have to rehash other hash buckets in case this makes the
- * hash bucket empty (common). So it's much better to just mark
- * it.
- */
-static inline void remove_name_hash(struct cache_entry *ce)
-{
-   ce-ce_flags |= CE_UNHASHED;
-}
-
+extern void remove_name_hash(struct index_state *istate, struct cache_entry 
*ce);
 
 #ifndef NO_THE_INDEX_COMPATIBILITY_MACROS
 #define active_cache (the_index.cache)
@@ -443,6 +430,7 @@ extern struct cache_entry *index_name_exists(struct 
index_state *istate, const c
 extern int unmerged_index(const struct index_state *);
 extern int verify_path(const char *path);
 extern struct cache_entry *index_name_exists(struct index_state *istate, const 
char *name, int namelen, int igncase);
+extern int index_icase_dir_exists(struct index_state *istate, const char 
*name, int namelen);
 extern int index_name_pos(const struct index_state *, const char *name, int 
namelen);
 #define ADD_CACHE_OK_TO_ADD 1  /* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2  /* Ok to replace file/directory */
diff --git a/dir.c b/dir.c
index 57394e4..f73ac34 100644
--- a/dir.c
+++ b/dir.c
@@ -927,29 +927,27 @@ static enum exist_status 
directory_exists_in_index_icase(const char *dirname, in
  */
 static enum exist_status directory_exists_in_index_icase(const char *dirname, 
int len)
 {
-   struct cache_entry *ce = index_name_exists(the_index, dirname, len + 
1, ignore_case);
-   unsigned char endchar;
-
-   if (!ce)
-   return index_nonexistent;
-   endchar = ce-name[len];
+   struct cache_entry *ce = index_name_exists(the_index, dirname, len, 
ignore_case);
 
/*
-* The cache_entry structure returned will contain this dirname
-* and possibly additional path components.
+* We found something in the index, which means it is either an actual
+* file, or a gitdir.
 */
-   if (endchar == '/')
-   return index_directory;
+   if (ce) {
+   if (S_ISGITLINK(ce-ce_mode))
+   return index_gitdir;
+   /* We call a file index_nonexistent here, because the caller is
+* asking about a directory.  */
+   return index_nonexistent;
+   }
 
/*
-* If there are no additional path components, then this cache_entry
-* represents a submodule.  Submodules, despite being directories,
-* are stored in the cache without a closing slash.
+* Otherwise, it might be a leading path of something that is in the
+* index. We can look it up in the special dir hash.
 */
-   if 

Re: inotify to minimize stat() calls

2013-02-13 Thread Karsten Blees
Am 13.02.2013 19:18, schrieb Jeff King:
 Moreover, looking at it again, I
 don't think my patch produces the right behavior: we have a single
 dir_next pointer, even though the same ce_entry may appear under many
 directory hashes. So the cache_entries that has to dir/foo/ and those
 that hash to dir/bar/ may get confused, because they will also both be
 found under dir/, and both try to create a linked list from the
 dir_next pointer.
 

Indeed. In the worst case, this causes an endless loop if ce-dir_next == ce
---8---
mkdir V
mkdir V/XQANY
mkdir WURZAUP
touch V/XQANY/test
git init
git config core.ignorecase true
git add .
git status
---8---
Note: V/, V/XQANY/ and WURZAUP/ all have the same hash_name. Although I 
found those strange values by brute force, hash collisions in 32 bit values are 
not that uncommon in real life :-)

 Looking at Karsten's patch, it seems like it will not add a cache entry
 if there is one of the same name. But I'm not sure if that is right, as
 the old one might be CE_UNHASHED (or it might get removed later). You
 actually want to be able to find each cache_entry that has a file under
 the directory at the hash of that directory, so you can make sure it is
 still valid.
 

Yes, the patch was just to show potential performance savings, I didn't 
consider CE_UNHASHED at all.

 I think the best way forward is to actually create a separate hash table
 for the directory lookups. I note that we only care about these entries
 in directory_exists_in_index_icase, which is really about whether
 something is there, versus what exactly is there. So could we maybe get
 by with a separate hash table that stores a count of entries at each
 directory, and increment/decrement the count when we add/remove entries?
 

Alternatively, we could simply create normal cache_entries for the directories 
that are linked via ce-next, but have a trailing '/' in their name?

Reference counting sounds good to me, at least better than allocating directory 
entries per cache entry * parent dirs.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-13 Thread Jeff King
On Wed, Feb 13, 2013 at 09:25:59PM +0100, Karsten Blees wrote:

 Am 13.02.2013 19:18, schrieb Jeff King:
  Moreover, looking at it again, I
  don't think my patch produces the right behavior: we have a single
  dir_next pointer, even though the same ce_entry may appear under many
  directory hashes. So the cache_entries that has to dir/foo/ and those
  that hash to dir/bar/ may get confused, because they will also both be
  found under dir/, and both try to create a linked list from the
  dir_next pointer.
  
 
 Indeed. In the worst case, this causes an endless loop if ce-dir_next == ce
 ---8---
 mkdir V
 mkdir V/XQANY
 mkdir WURZAUP
 touch V/XQANY/test
 git init
 git config core.ignorecase true
 git add .
 git status
 ---8---

Great, thanks for the test case. I can trivially replicate the endless
loop. The patch I sent earlier fixes that. So it's at least a step in
the (possible) right direction. I'm slightly concerned that there is
some other case that is expecting the directories in the main hash, but
I think I got them all.

 Note: V/, V/XQANY/ and WURZAUP/ all have the same hash_name.
 Although I found those strange values by brute force, hash collisions
 in 32 bit values are not that uncommon in real life :-)

Cute. :)

 Alternatively, we could simply create normal cache_entries for the
 directories that are linked via ce-next, but have a trailing '/' in
 their name?

 Reference counting sounds good to me, at least better than allocating
 directory entries per cache entry * parent dirs.

I think that is more or less what my patch does, but it splits the
ref-counted fake cache_entries out into a separate hash of struct
dir_hash_entry rather than storing it in the regular hash. Which IMHO
is a bit cleaner for two reasons:

  1. You do not have to pay the memory price of storing fake
 cache_entries (the name+refcount struct for each directory is much
 smaller than a real cache_entry).

  2. It makes the code a bit simpler, as you do not have to do any
 check for trailing / magic on the result of index_name_exists to
 determine if it is a real name or just a fake dir entry.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-13 Thread Karsten Blees
Am 13.02.2013 23:55, schrieb Jeff King:
 On Wed, Feb 13, 2013 at 09:25:59PM +0100, Karsten Blees wrote:
 
 Alternatively, we could simply create normal cache_entries for the
 directories that are linked via ce-next, but have a trailing '/' in
 their name?

 Reference counting sounds good to me, at least better than allocating
 directory entries per cache entry * parent dirs.
 
 I think that is more or less what my patch does, but it splits the
 ref-counted fake cache_entries out into a separate hash of struct
 dir_hash_entry rather than storing it in the regular hash. Which IMHO
 is a bit cleaner for two reasons:
 
   1. You do not have to pay the memory price of storing fake
  cache_entries (the name+refcount struct for each directory is much
  smaller than a real cache_entry).
 

Yes, but considering the small number of directories compared to files, I think 
this is a relatively small price to pay.

   2. It makes the code a bit simpler, as you do not have to do any
  check for trailing / magic on the result of index_name_exists to
  determine if it is a real name or just a fake dir entry.
 

True for dir.c. On the other hand, you need a lot of new find / find_or_create 
logic in name-hash.c.

Just to illustrate what I mean, here's a quick sketch (there's still a segfault 
somewhere, but I don't have time to debug right now...).

Note that hash_index_entry_directories works from right to left - if the 
immediate parent directory is there, there's no need to check the parent's 
parent.

cache_entry.dir points to the parent directory so that we don't need to lookup 
all path components for reference counting when adding / removing entries.

As directory entries are 'real' cache_entries, we can reuse the existing 
index_name_exists and hash_index_entry code.

I feel slightly guilty for abusing ce_size as reference counter...well :-)

---
 cache.h |  4 +++-
 name-hash.c | 80 -
 2 files changed, 39 insertions(+), 45 deletions(-)

diff --git a/cache.h b/cache.h
index 665b512..2bc1372 100644
--- a/cache.h
+++ b/cache.h
@@ -131,7 +131,7 @@ struct cache_entry {
unsigned int ce_namelen;
unsigned char sha1[20];
struct cache_entry *next;
-   struct cache_entry *dir_next;
+   struct cache_entry *dir;
char name[FLEX_ARRAY]; /* more */
 };
 
@@ -285,6 +285,8 @@ extern void add_name_hash(struct index_state *istate, 
struct cache_entry *ce);
 static inline void remove_name_hash(struct cache_entry *ce)
 {
ce-ce_flags |= CE_UNHASHED;
+   if (ce-dir  !(--ce-dir-ce_size))
+   remove_name_hash(ce-dir);
 }
 
 
diff --git a/name-hash.c b/name-hash.c
index d8d25c2..01e8320 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -32,6 +32,9 @@ static unsigned int hash_name(const char *name, int namelen)
return hash;
 }
 
+static struct cache_entry *lookup_index_entry(struct index_state *istate, 
const char *name, int namelen, int icase);
+static void hash_index_entry(struct index_state *istate, struct cache_entry 
*ce);
+
 static void hash_index_entry_directories(struct index_state *istate, struct 
cache_entry *ce)
 {
/*
@@ -40,30 +43,25 @@ static void hash_index_entry_directories(struct index_state 
*istate, struct cach
 * closing slash.  Despite submodules being a directory, they never
 * reach this point, because they are stored without a closing slash
 * in the cache.
-*
-* Note that the cache_entry stored with the directory does not
-* represent the directory itself.  It is a pointer to an existing
-* filename, and its only purpose is to represent existence of the
-* directory in the cache.  It is very possible multiple directory
-* hash entries may point to the same cache_entry.
 */
-   unsigned int hash;
-   void **pos;
+   int len = ce_namelen(ce);
+   if (len  ce-name[len - 1] == '/')
+   len--;
+   while (len  ce-name[len - 1] != '/')
+   len--;
+   if (!len)
+   return;
 
-   const char *ptr = ce-name;
-   while (*ptr) {
-   while (*ptr  *ptr != '/')
-   ++ptr;
-   if (*ptr == '/') {
-   ++ptr;
-   hash = hash_name(ce-name, ptr - ce-name);
-   pos = insert_hash(hash, ce, istate-name_hash);
-   if (pos) {
-   ce-dir_next = *pos;
-   *pos = ce;
-   }
-   }
+   ce-dir = lookup_index_entry(istate, ce-name, len, ignore_case);
+   if (!ce-dir) {
+   ce-dir = xcalloc(1, cache_entry_size(len));
+   memcpy(ce-dir-name, ce-name, len);
+   ce-dir-ce_namelen = len;
+   ce-dir-name[len] = 0;
+   hash_index_entry(istate, ce-dir);
}
+   ce-dir-ce_flags 

Re: inotify to minimize stat() calls

2013-02-12 Thread Karsten Blees
Am 11.02.2013 04:53, schrieb Duy Nguyen:
 On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund kusmab...@gmail.com wrote:
 Karsten Blees has done something similar-ish on Windows, and he posted
 the results here:

 https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion


The new hashtable implementation in fscache [1] supports O(1) removal and has 
no mingw dependencies - might come in handy for anyone trying to implement an 
inotify daemon.

[1] https://github.com/kblees/git/commit/f7eb85c2

 I also seem to remember he doing a ReadDirectoryChangesW version, but
 I don't remember what happened with that.
 
 Thanks. I came across that but did not remember. For one thing, we
 know the inotify alternative for Windows: ReadDirectoryChangesW.
 

I dropped ReadDirectoryChangesW because maintaining a 'live' file system cache 
became more and more complicated. For example, according to MSDN docs, 
ReadDirectoryChangesW *may* report short DOS 8.3 names (i.e. PROGRA~1 instead 
of Program Files), so a correct and fast cache implementation would have to 
be indexed by long *and* short names...

Another problem was that the 'live' cache had quite negative performance impact 
on mutating git commands (checkout, reset...). An inotify daemon running as a 
background process (not in-process as fscache) will probably affect everyone 
that modifies the working copy, e.g. running 'make' or the test-suite. This 
should be considered in the design.

 I copy git status's (impressive) numbers from fscache-v0 for those
 who are interested in:
 
 preload | -u  | normal | cached | gain
 +-+++--
 false   | all | 25.144 | 3.055  |  8.2
 false   | no  | 22.822 | 1.748  | 12.8
 true| all |  9.234 | 2.179  |  4.2
 true| no  |  6.833 | 0.955  |  7.2
 

Note that I wasn't able to reproduce such bad 'normal' values in later tests, I 
guess disk fragmentation and/or virus scanner must have tricked me on that 
day...gain factors of 2.5 - 5 are more appropriate.


However, the difference between git status -uall and -uno was always about 1.3 
s in all fscache versions, even though opendir/readdir/closedir was served 
entirely from the cache. I added a bit of performance tracing to find the 
cause, and I think most of the time spent in wt_status_collect_untracked can be 
eliminated:

1.) 0.939 s is spent in dir.c/excluded (i.e. checking .gitignore). This check 
is done for *every* file in the working copy, including files in the index. 
Checking the index first could eliminate most of that, i.e.:

(Note: patches are for discussion only, I'm aware that they may have unintended 
side effects...)

@@ -1097,6 +1097,8 @@ static enum path_treatment treat_path(struct dir_struct 
*dir,
return path_ignored;
strbuf_setlen(path, baselen);
strbuf_addstr(path, de-d_name);
+   if (cache_name_exists(path-buf, path-len, ignore_case))
+   return path_ignored;
if (simplify_away(path-buf, path-len, simplify))
return path_ignored;
---


2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing 
the same directories over and over again. In the end, the hashtable contains 
939k directory entries, even though the WebKit test repo only has 7k 
directories. Checking if a directory entry already exists could reduce that, 
i.e.:

@@ -53,14 +55,23 @@ static void hash_index_entry_directories(struct index_state 
*istate, struct cach
unsigned int hash;
void **pos;
double t = ticks();
+   struct cache_entry *ce2;
+   int len = ce_namelen(ce);
 
-   const char *ptr = ce-name;
-   while (*ptr) {
-   while (*ptr  *ptr != '/')
-   ++ptr;
-   if (*ptr == '/') {
-   ++ptr;
-   hash = hash_name(ce-name, ptr - ce-name);
+   while (len  0) {
+   while (len  0  ce-name[len - 1] != '/')
+   len--;
+   if (len  0) {
+   hash = hash_name(ce-name, len);
+   ce2 = lookup_hash(hash, istate-name_hash);
+   while (ce2) {
+   if (same_name(ce2, ce-name, len, ignore_case)) 
{
+   add_since(t, hash_dirs);
+   return;
+   }
+   ce2 = ce2-dir_next;
+   }
+   len--;
pos = insert_hash(hash, ce, istate-name_hash);
if (pos) {
ce-dir_next = *pos;
---


Tests were done with the WebKit repo (~200k files, ~7k directories, 15 
.gitignore files, ~100 entries in root .gitignore). Instrumented code can be 
found here: https://github.com/kblees/git/tree/kb/git-status-performance-tracing

Here's the performance traces of 'git status -s -uall'

Before patches:

trace: at 

Re: inotify to minimize stat() calls

2013-02-11 Thread Duy Nguyen
On Mon, Feb 11, 2013 at 9:56 AM, Duy Nguyen pclo...@gmail.com wrote:
 Yeah, it did not cut out syscall cost, I also cut a lot of user-space
 processing (plus .gitignore content access). From the timings I posted
 earlier,

 unmodified  dir.c
 real0m0.550s0m0.287s
 user0m0.305s0m0.201s
 sys 0m0.240s0m0.084s

 sys time is reduced from 0.24s to 0.08s, so readdir+opendir definitely
 has something to do with it (and perhaps reading .gitignore). But it
 also reduces user time from 0.305 to 0.201s. I don't think avoiding
 readdir+openddir will bring us this gain. It's probably the cost of
 matching .gitignore. I'll try to replace opendir+readdir with a
 no-syscall version. At this point untracked caching sounds more
 feasible (and less complex) than .gitignore cachine.

And this is read_directory's timing breakdown (again, git status on
gentoo-x86,git, built with -O2 on x86-64 if I did not mention before)

opendir   = 0.030s
readdir   = 0.083s
closedir  = 0.020s
{open,read,close}dir = 0.132s
treat_path   = 0.094s (172534 times)
dir_add_name = 0.050s (101917 times)
read_directory   = 0.292s
# On branch master
nothing to commit, working directory clean

real0m0.511s
user0m0.347s
sys 0m0.157s

Instrumentation is done with gettimeofday. Without gettimeofday calls
inside read_directory_recursive, read_directory takes 0.267s (iow,
gettimeofday cost is about 0.30s). {open,read,close}dir + treat_path +
dir_add_name + gettimeofday add up quite close to 0.292s (strbuf_*
takes just about 0.005s)

Eliminating xxxdir syscalls may save us 0.132s (or less, we need to
pay to get the information elsewhere).

Because my worktree is clean, dir_add_name spends all 0.05s in
cache_name_exists(). If we somehow know the input path is not a
tracked entry, we could avoid cache_name_exists() and save 0.05s.

If we do the untracked cache, the number of treat_path calls should
be much lower. In this particular case of gentoo-x86, I'd expect no
more than a dozen of untracked files, which cuts down treat_path and
dir_add_name's time to near zero. On a normal repository like git.git,
untracked files are about 1075 files with 2552 tracked files, we
should be able to save 2/3 to 1/2 of treat_path calls.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-11 Thread Robert Zeh
On Sun, Feb 10, 2013 at 9:21 PM, Duy Nguyen pclo...@gmail.com wrote:
 On Mon, Feb 11, 2013 at 2:03 AM, Robert Zeh robert.allan@gmail.com 
 wrote:
 On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano gits...@pobox.com wrote:
 Ramkumar Ramachandra artag...@gmail.com writes:

 This is much better than Junio's suggestion to study possible
 implementations on all platforms and designing a generic daemon/
 communication channel.  That's no weekend project.

 It appears that you misunderstood what I wrote.  That was not here
 is a design; I want it in my system.  Go implemment it.

 It was If somebody wants to discuss it but does not know where to
 begin, doing a small experiment like this and reporting how well it
 worked here may be one way to do so., nothing more.

 What if instead of communicating over a socket, the daemon
 dumped a file containing all of the lstat information after git
 wrote a file? By definition the daemon should know about file writes.

 There would be no network communication, which I think would make
 things more secure. It would simplify the rendezvous by insisting on
 well known locations in $GIT_DIR.

 We need some sort of interactive communication to the daemon anyway,
 to validate that the information is uptodate. Assume that a user makes
 some changes to his worktree before starting the daemon, git needs to
 know that what the daemon provides does not represent a complete
 file-change picture and it better refreshes the index the old way
 once, then trust the daemon.

 I think we could solve that by storing a session id, provided by the
 daemon, in .git/index. If the session id is not present (or does not
 match what the current daemon gives), refresh the old way. After
 refreshing, it may ask the daemon for new session id and store it.
 Next time if the session id is still valid, trust the daemon's data.
 This session id should be different every time the daemon restarts for
 this to work.

I think we could do this without interactive communication,
if we did the following:
   1) The Daemon waits to see $GIT_DIR/lstat_request, and atomically
   writes out $GIT_DIR/lstat_cache.  By atomically I mean that it writes
   things out to a temporary file, and then does a rename.

   2) The client erases $GIT_DIR/lstat_cache, and writes
  $GIT_DIR/lstat_request

I think this is better than socket based communication because there
are fewer places to check
for failures.

Robert
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Duy Nguyen
On Sun, Feb 10, 2013 at 12:24:58PM +0700, Duy Nguyen wrote:
 On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
 artag...@gmail.com wrote:
  Finn notes in the commit message that it offers no speedup, because
  .gitignore files in every directory still have to be read.  I think
  this is silly: we really should be caching .gitignore, and touching it
  only when lstat() reports that the file has changed.
 
  ...
 
  Really, the elephant in the room right now seems to be .gitignore.
  Until that is fixed, there is really no use of writing this inotify
  daemon, no?  Can someone enlighten me on how exactly .gitignore files
  are processed?

 .gitignore is a different issue. I think it's mainly used with
 read_directory/fill_directory to collect ignored files (or not-ignored
 files). And it's not always used (well, status and add does, but diff
 should not). I think wee need to measure how much mass lstat
 elimination gains us (especially on big repos) and how much
 .gitignore/.gitattributes caching does.

OK let's count. I start with a standard repository, linux-2.6. This
is the number from strace -T on git status (*). The first column is
accumulated time, the second the number of syscalls.

top syscalls sorted top syscalls sorted
by acc. timeby number
--
0.401906 40950 lstat0.401906 40950 lstat
0.190484 5343 getdents  0.150055 5374 open
0.150055 5374 open  0.190484 5343 getdents
0.074843 2806 close 0.074843 2806 close
0.003216 157 read   0.003216 157 read

The following patch pretends every entry is uptodate without
lstat. With the patch, we can see refresh code is the cause of mass
lstat, as lstat disappears:

0.185347 5343 getdents  0.144173 5374 open
0.144173 5374 open  0.185347 5343 getdents
0.071844 2806 close 0.071844 2806 close
0.004918 135 brk0.003378 157 read
0.003378 157 read   0.004918 135 brk

-- 8 --
diff --git a/read-cache.c b/read-cache.c
index 827ae55..94d8ed8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1018,6 +1018,10 @@ static struct cache_entry *refresh_cache_ent(struct 
index_state *istate,
if (ce_uptodate(ce))
return ce;

+#if 1
+   ce_mark_uptodate(ce);
+   return ce;
+#endif
/*
 * CE_VALID or CE_SKIP_WORKTREE means the user promised us
 * that the change to the work tree does not matter and told
-- 8 --

The following patch eliminates untracked search code. As we can see,
open+getdents also disappears with this patch:

0.462909 40950 lstat   0.462909 40950 lstat
0.003417 129 brk   0.003417 129 brk
0.000762 53 read   0.000762 53 read
0.000720 36 open   0.000720 36 open
0.000544 12 munmap 0.000454 33 close

So from syscalls point of view, we know what code issues most of
them. Let's see how much time we gain be these patches, which is an
approximate of the gain by inotify support. This time I measure on
gentoo-x86.git [1] because this one has really big worktree (100k
files)

unmodified  read-cache.c  dir.c both
real0m0.550s0m0.479s  0m0.287s  0m0.213s
user0m0.305s0m0.315s  0m0.201s  0m0.182s
sys 0m0.240s0m0.157s  0m0.084s  0m0.030s

and the syscall picture on gentoo-x86.git:

1.106615 101942 lstat1.106615 101942 lstat
0.667235 47083 getdents  0.641604 47114 open
0.641604 47114 open  0.667235 47083 getdents
0.286711 23573 close 0.286711 23573 close
0.005842 350 brk 0.005842 350 brk

We can see that shortcuting untracked code gives bigger gain than
index refresh code. So I have to agree that .gitignore may be the big
elephant in this particular case.

Bear in mind though this is Linux, where lstat is fast. On systems
with slow lstat, these timings could look very different due to the
large number of lstat calls compared to open+getdents. I really like
to see similar numbers on Windows.

read_directory/fill_directory code is mostly used by git add (not
with -u) and git status, while refresh code is executed in add,
checkout, commit/status, diff, merge. So while smaller gain, reducing
lstat calls could benefit in more cases.

A relatively slow git add is acceptable. git status should be
fast. Although in my workflow, I do git diff [--stat] [--cached]
much more often than git status so relatively slow git status does
not hurt me much. But people may do it differently.

On speeding up read_directory with inotify support. I haven't thought
it through, but I think we could save (or get it via socket) a list of
untracked files in .git, regardless ignore status, with the help from
inotify. When this list is verified valid, read_directory could be
modified to traverse the tree using this list (plus the index) instead
of opendir+readdir. Not sure how the change might look though.


[1] http://git-exp.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary

(*) the script to produce those numbers is

-- 8 --
#!/bin/sh

export LANG=C
strace -T $@ 21 

Re: inotify to minimize stat() calls

2013-02-10 Thread Duy Nguyen
On Sun, Feb 10, 2013 at 06:17:32PM +0700, Duy Nguyen wrote:
 The following patch eliminates untracked search code. As we can see,
 open+getdents also disappears with this patch:
 
 0.462909 40950 lstat   0.462909 40950 lstat
 0.003417 129 brk   0.003417 129 brk
 0.000762 53 read   0.000762 53 read
 0.000720 36 open   0.000720 36 open
 0.000544 12 munmap 0.000454 33 close

.. and the patch is missing:

-- 8 --
diff --git a/dir.c b/dir.c
index 57394e4..1963c6f 100644
--- a/dir.c
+++ b/dir.c
@@ -1439,8 +1439,10 @@ int read_directory(struct dir_struct *dir, const char 
*path, int len, const char
return dir-nr;
 
simplify = create_simplify(pathspec);
+#if 0
if (!len || treat_leading_path(dir, path, len, simplify))
read_directory_recursive(dir, path, len, 0, simplify);
+#endif
free_simplify(simplify);
qsort(dir-entries, dir-nr, sizeof(struct dir_entry *), cmp_name);
qsort(dir-ignored, dir-ignored_nr, sizeof(struct dir_entry *), 
cmp_name);
-- 8 --
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread demerphq
On 10 February 2013 12:17, Duy Nguyen pclo...@gmail.com wrote:
 Bear in mind though this is Linux, where lstat is fast. On systems
 with slow lstat, these timings could look very different due to the
 large number of lstat calls compared to open+getdents. I really like
 to see similar numbers on Windows.

Is windows stat really so slow? I encountered this perception in
windows Perl in the past, and I know that on windows Perl stat
*appears* slow compared to *nix, because in order to satisfy the full
*nix stat interface, specifically the nlink field, it must open and
close the file*. As of 5.10 this can be disabled by setting a magic
var ${^WIN32_SLOPPY_STAT} to a true value, which makes a significant
improvement to the performance of the Perl level stat implementation.
I would not be surprised if the cygwin implementation of stat() has
the same issue as Perl did, and that stat appears much slower than it
actually need be if you don't care about the nlink field.

Yves
* http://perl5.git.perl.org/perl.git/blob/HEAD:/win32/win32.c#l1492

-- 
perl -Mre=debug -e /just|another|perl|hacker/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Duy Nguyen
On Sun, Feb 10, 2013 at 8:26 PM, demerphq demer...@gmail.com wrote:
 On 10 February 2013 12:17, Duy Nguyen pclo...@gmail.com wrote:
 Bear in mind though this is Linux, where lstat is fast. On systems
 with slow lstat, these timings could look very different due to the
 large number of lstat calls compared to open+getdents. I really like
 to see similar numbers on Windows.

 Is windows stat really so slow?

I can't say. I haven't used Windows for months (and git on Windows for years)..

 I encountered this perception in
 windows Perl in the past, and I know that on windows Perl stat
 *appears* slow compared to *nix, because in order to satisfy the full
 *nix stat interface, specifically the nlink field, it must open and
 close the file*. As of 5.10 this can be disabled by setting a magic
 var ${^WIN32_SLOPPY_STAT} to a true value, which makes a significant
 improvement to the performance of the Perl level stat implementation.
 I would not be surprised if the cygwin implementation of stat() has
 the same issue as Perl did, and that stat appears much slower than it
 actually need be if you don't care about the nlink field.

The native port of git uses get_file_attr (in
compat/mingw.c:do_lstat()) to simulate lstat and always sets nlink to
1. I assume this means git does not care about nlink field. I don't
know about cygwin though.

 Yves
 * http://perl5.git.perl.org/perl.git/blob/HEAD:/win32/win32.c#l1492
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Ramkumar Ramachandra
On Sun, Feb 10, 2013 at 4:47 PM, Duy Nguyen pclo...@gmail.com wrote:
 On Sun, Feb 10, 2013 at 12:24:58PM +0700, Duy Nguyen wrote:
 On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
 artag...@gmail.com wrote:
  Finn notes in the commit message that it offers no speedup, because
  .gitignore files in every directory still have to be read.  I think
  this is silly: we really should be caching .gitignore, and touching it
  only when lstat() reports that the file has changed.
 
  ...
 
  Really, the elephant in the room right now seems to be .gitignore.
  Until that is fixed, there is really no use of writing this inotify
  daemon, no?  Can someone enlighten me on how exactly .gitignore files
  are processed?

 .gitignore is a different issue. I think it's mainly used with
 read_directory/fill_directory to collect ignored files (or not-ignored
 files). And it's not always used (well, status and add does, but diff
 should not). I think wee need to measure how much mass lstat
 elimination gains us (especially on big repos) and how much
 .gitignore/.gitattributes caching does.

 OK let's count. I start with a standard repository, linux-2.6. This
 is the number from strace -T on git status (*). The first column is
 accumulated time, the second the number of syscalls.

 top syscalls sorted top syscalls sorted
 by acc. timeby number
 --
 0.401906 40950 lstat0.401906 40950 lstat
 0.190484 5343 getdents  0.150055 5374 open
 0.150055 5374 open  0.190484 5343 getdents
 0.074843 2806 close 0.074843 2806 close
 0.003216 157 read   0.003216 157 read

 The following patch pretends every entry is uptodate without
 lstat. With the patch, we can see refresh code is the cause of mass
 lstat, as lstat disappears:

 0.185347 5343 getdents  0.144173 5374 open
 0.144173 5374 open  0.185347 5343 getdents
 0.071844 2806 close 0.071844 2806 close
 0.004918 135 brk0.003378 157 read
 0.003378 157 read   0.004918 135 brk

Okay, we're saving 40k lstat() calls.

 -- 8 --
 diff --git a/read-cache.c b/read-cache.c
 index 827ae55..94d8ed8 100644
 --- a/read-cache.c
 +++ b/read-cache.c
 @@ -1018,6 +1018,10 @@ static struct cache_entry *refresh_cache_ent(struct 
 index_state *istate,
 if (ce_uptodate(ce))
 return ce;

 +#if 1
 +   ce_mark_uptodate(ce);
 +   return ce;
 +#endif
 /*
  * CE_VALID or CE_SKIP_WORKTREE means the user promised us
  * that the change to the work tree does not matter and told
 -- 8 --

So you're skipping the rest of refresh_cache_ent(), which contains our
lstat() and returning immediately.  Instead of marking paths with the
assume unchanged bit, as core.ignoreStat does, you're directly
attacking the function that refreshes the index and bypassing the
lstat() call.  How are they different?  read-cache.c:1030 checks
ce-flags  CE_VALID (which is set in read-cache.c:88 if
assume_unchanged) and bypasses the lstat() call anyway.  So why didn't
you just set core.ignoreStat for your test?

 The following patch eliminates untracked search code. As we can see,
 open+getdents also disappears with this patch:

 0.462909 40950 lstat   0.462909 40950 lstat
 0.003417 129 brk   0.003417 129 brk
 0.000762 53 read   0.000762 53 read
 0.000720 36 open   0.000720 36 open
 0.000544 12 munmap 0.000454 33 close

Okay, 5k open and 5k getdents calls are gone, but what does this mean?

Pulling the patch from your next email to figure this out:
 -- 8 --
 diff --git a/dir.c b/dir.c
 index 57394e4..1963c6f 100644
 --- a/dir.c
 +++ b/dir.c
 @@ -1439,8 +1439,10 @@ int read_directory(struct dir_struct *dir, const char 
 *path, int len, const char
 return dir-nr;

 simplify = create_simplify(pathspec);
 +#if 0
 if (!len || treat_leading_path(dir, path, len, simplify))
 read_directory_recursive(dir, path, len, 0, simplify);
 +#endif
 free_simplify(simplify);
 qsort(dir-entries, dir-nr, sizeof(struct dir_entry *), cmp_name);
 qsort(dir-ignored, dir-ignored_nr, sizeof(struct dir_entry *), 
 cmp_name);
 -- 8 --

Ah, read_directory(), from the .gitignore/ exclude angle.  Yes,
read_directory() seems to be the main culprit there, from my reading
of Documentation/technical/api-directory-listing.txt.

So, what did you do?  You short-circuited the function into never
executing read_directory_recursive(), so the opendir() and readdir()
are gone.  I'm confused about what this means: will new directories
fail to appear as untracked now?  Either way, I understand that
you've factored out the .gitignore/ excludes.  Let's look at the
timings now.

 So from syscalls point of view, we know what code issues most of
 them. Let's see how much time we gain be these patches, which is an
 approximate of the gain by inotify support. This time I measure on
 gentoo-x86.git [1] because this one has really big worktree (100k
 files)

   

Re: inotify to minimize stat() calls

2013-02-10 Thread Erik Faye-Lund
On Sun, Feb 10, 2013 at 12:17 PM, Duy Nguyen pclo...@gmail.com wrote:
 On Sun, Feb 10, 2013 at 12:24:58PM +0700, Duy Nguyen wrote:
 On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
 artag...@gmail.com wrote:
  Finn notes in the commit message that it offers no speedup, because
  .gitignore files in every directory still have to be read.  I think
  this is silly: we really should be caching .gitignore, and touching it
  only when lstat() reports that the file has changed.
 
  ...
 
  Really, the elephant in the room right now seems to be .gitignore.
  Until that is fixed, there is really no use of writing this inotify
  daemon, no?  Can someone enlighten me on how exactly .gitignore files
  are processed?

 .gitignore is a different issue. I think it's mainly used with
 read_directory/fill_directory to collect ignored files (or not-ignored
 files). And it's not always used (well, status and add does, but diff
 should not). I think wee need to measure how much mass lstat
 elimination gains us (especially on big repos) and how much
 .gitignore/.gitattributes caching does.

 OK let's count. I start with a standard repository, linux-2.6. This
 is the number from strace -T on git status (*). The first column is
 accumulated time, the second the number of syscalls.

 top syscalls sorted top syscalls sorted
 by acc. timeby number
 --
 0.401906 40950 lstat0.401906 40950 lstat
 0.190484 5343 getdents  0.150055 5374 open
 0.150055 5374 open  0.190484 5343 getdents
 0.074843 2806 close 0.074843 2806 close
 0.003216 157 read   0.003216 157 read

 The following patch pretends every entry is uptodate without
 lstat. With the patch, we can see refresh code is the cause of mass
 lstat, as lstat disappears:

 0.185347 5343 getdents  0.144173 5374 open
 0.144173 5374 open  0.185347 5343 getdents
 0.071844 2806 close 0.071844 2806 close
 0.004918 135 brk0.003378 157 read
 0.003378 157 read   0.004918 135 brk

 -- 8 --
 diff --git a/read-cache.c b/read-cache.c
 index 827ae55..94d8ed8 100644
 --- a/read-cache.c
 +++ b/read-cache.c
 @@ -1018,6 +1018,10 @@ static struct cache_entry *refresh_cache_ent(struct 
 index_state *istate,
 if (ce_uptodate(ce))
 return ce;

 +#if 1
 +   ce_mark_uptodate(ce);
 +   return ce;
 +#endif
 /*
  * CE_VALID or CE_SKIP_WORKTREE means the user promised us
  * that the change to the work tree does not matter and told
 -- 8 --

 The following patch eliminates untracked search code. As we can see,
 open+getdents also disappears with this patch:

 0.462909 40950 lstat   0.462909 40950 lstat
 0.003417 129 brk   0.003417 129 brk
 0.000762 53 read   0.000762 53 read
 0.000720 36 open   0.000720 36 open
 0.000544 12 munmap 0.000454 33 close

 So from syscalls point of view, we know what code issues most of
 them. Let's see how much time we gain be these patches, which is an
 approximate of the gain by inotify support. This time I measure on
 gentoo-x86.git [1] because this one has really big worktree (100k
 files)

 unmodified  read-cache.c  dir.c both
 real0m0.550s0m0.479s  0m0.287s  0m0.213s
 user0m0.305s0m0.315s  0m0.201s  0m0.182s
 sys 0m0.240s0m0.157s  0m0.084s  0m0.030s

 and the syscall picture on gentoo-x86.git:

 1.106615 101942 lstat1.106615 101942 lstat
 0.667235 47083 getdents  0.641604 47114 open
 0.641604 47114 open  0.667235 47083 getdents
 0.286711 23573 close 0.286711 23573 close
 0.005842 350 brk 0.005842 350 brk

 We can see that shortcuting untracked code gives bigger gain than
 index refresh code. So I have to agree that .gitignore may be the big
 elephant in this particular case.

 Bear in mind though this is Linux, where lstat is fast. On systems
 with slow lstat, these timings could look very different due to the
 large number of lstat calls compared to open+getdents. I really like
 to see similar numbers on Windows.

Karsten Blees has done something similar-ish on Windows, and he posted
the results here:

https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion

I also seem to remember he doing a ReadDirectoryChangesW version, but
I don't remember what happened with that.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Robert Zeh
On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano gits...@pobox.com wrote:
 Ramkumar Ramachandra artag...@gmail.com writes:

 This is much better than Junio's suggestion to study possible
 implementations on all platforms and designing a generic daemon/
 communication channel.  That's no weekend project.

 It appears that you misunderstood what I wrote.  That was not here
 is a design; I want it in my system.  Go implemment it.

 It was If somebody wants to discuss it but does not know where to
 begin, doing a small experiment like this and reporting how well it
 worked here may be one way to do so., nothing more.

What if instead of communicating over a socket, the daemon
dumped a file containing all of the lstat information after git
wrote a file? By definition the daemon should know about file writes.

There would be no network communication, which I think would make
things more secure. It would simplify the rendezvous by insisting on
well known locations in $GIT_DIR.

Robert Zeh
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Martin Fick
On Sunday, February 10, 2013 12:03:00 pm Robert Zeh wrote:
 On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano 
gits...@pobox.com wrote:
  Ramkumar Ramachandra artag...@gmail.com writes:
  This is much better than Junio's suggestion to study
  possible implementations on all platforms and
  designing a generic daemon/ communication channel. 
  That's no weekend project.
  
  It appears that you misunderstood what I wrote.  That
  was not here is a design; I want it in my system.  Go
  implemment it.
  
  It was If somebody wants to discuss it but does not
  know where to begin, doing a small experiment like
  this and reporting how well it worked here may be one
  way to do so., nothing more.
 
 What if instead of communicating over a socket, the
 daemon dumped a file containing all of the lstat
 information after git wrote a file? By definition the
 daemon should know about file writes.

But git doesn't, how will it know when the file is written?
Will it use inotify, or poll (kind of defeats the point)?

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 On Sun, Feb 10, 2013 at 06:17:32PM +0700, Duy Nguyen wrote:
 The following patch eliminates untracked search code. As we can see,
 open+getdents also disappears with this patch:
 
 0.462909 40950 lstat   0.462909 40950 lstat
 0.003417 129 brk   0.003417 129 brk
 0.000762 53 read   0.000762 53 read
 0.000720 36 open   0.000720 36 open
 0.000544 12 munmap 0.000454 33 close

 .. and the patch is missing:

 -- 8 --
 diff --git a/dir.c b/dir.c
 index 57394e4..1963c6f 100644
 --- a/dir.c
 +++ b/dir.c
 @@ -1439,8 +1439,10 @@ int read_directory(struct dir_struct *dir, const char 
 *path, int len, const char
   return dir-nr;
  
   simplify = create_simplify(pathspec);
 +#if 0
   if (!len || treat_leading_path(dir, path, len, simplify))
   read_directory_recursive(dir, path, len, 0, simplify);
 +#endif

The other lstat() experiment was a very interesting one, but this
is not yet an interesting experiment to see where in the ignore
codepath we are spending times.

We know that we can tell wt_status_collect_untracked() not to bother
with the untracked or ignored files with !s-show_untracked_files
already, but I think the more interesting question is if we can show
the untracked files with less overhead.

If we want to show untrackedd files, it is a given that we need to
read directories to see what paths there are on the filesystem. Is
the opendir/readdir cost dominating in the process? Are we spending
a lot of time sifting the result of opendir/readdir via the ignore
mechanism? Is reading the ignore files costing us much to prime
the ignore mechanism?

If readdir cost is dominant, then that makes cache gitignore a
nonsense proposition, I think.  If you really want to cache
something, you need to have somebody (i.e. a daemon) who constantly
keeps an eye on the filesystem changes and can respond with the up
to date result directly to fill_directory().  I somehow doubt that
it is a direction we would want to go in, though.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Robert Zeh


On Feb 10, 2013, at 1:26 PM, Martin Fick mf...@codeaurora.org wrote:

 On Sunday, February 10, 2013 12:03:00 pm Robert Zeh wrote:
 On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano
 gits...@pobox.com wrote:
 Ramkumar Ramachandra artag...@gmail.com writes:
 This is much better than Junio's suggestion to study
 possible implementations on all platforms and
 designing a generic daemon/ communication channel. 
 That's no weekend project.
 
 It appears that you misunderstood what I wrote.  That
 was not here is a design; I want it in my system.  Go
 implemment it.
 
 It was If somebody wants to discuss it but does not
 know where to begin, doing a small experiment like
 this and reporting how well it worked here may be one
 way to do so., nothing more.
 
 What if instead of communicating over a socket, the
 daemon dumped a file containing all of the lstat
 information after git wrote a file? By definition the
 daemon should know about file writes.
 
 But git doesn't, how will it know when the file is written?
 Will it use inotify, or poll (kind of defeats the point)?
 
 -Martin

I was thinking it would loop on calls to stat for the file with a timeout; this 
is no different than what we would want to do over a socket in that we would 
need timeouts for network reads.  But we would only be calling stat on one 
file, instead of the entire repo. 

I think we can set things up so the file read is atomic, which means we can 
ignore the case of a daemon crashing midway through a conversation. 

Robert

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Duy Nguyen
On Mon, Feb 11, 2013 at 3:16 AM, Junio C Hamano gits...@pobox.com wrote:
 The other lstat() experiment was a very interesting one, but this
 is not yet an interesting experiment to see where in the ignore
 codepath we are spending times.

 We know that we can tell wt_status_collect_untracked() not to bother
 with the untracked or ignored files with !s-show_untracked_files
 already, but I think the more interesting question is if we can show
 the untracked files with less overhead.

 If we want to show untrackedd files, it is a given that we need to
 read directories to see what paths there are on the filesystem. Is
 the opendir/readdir cost dominating in the process? Are we spending
 a lot of time sifting the result of opendir/readdir via the ignore
 mechanism? Is reading the ignore files costing us much to prime
 the ignore mechanism?

 If readdir cost is dominant, then that makes cache gitignore a
 nonsense proposition, I think.  If you really want to cache
 something, you need to have somebody (i.e. a daemon) who constantly
 keeps an eye on the filesystem changes and can respond with the up
 to date result directly to fill_directory().  I somehow doubt that
 it is a direction we would want to go in, though.

Yeah, it did not cut out syscall cost, I also cut a lot of user-space
processing (plus .gitignore content access). From the timings I posted
earlier,

 unmodified  dir.c
 real0m0.550s0m0.287s
 user0m0.305s0m0.201s
 sys 0m0.240s0m0.084s

sys time is reduced from 0.24s to 0.08s, so readdir+opendir definitely
has something to do with it (and perhaps reading .gitignore). But it
also reduces user time from 0.305 to 0.201s. I don't think avoiding
readdir+openddir will bring us this gain. It's probably the cost of
matching .gitignore. I'll try to replace opendir+readdir with a
no-syscall version. At this point untracked caching sounds more
feasible (and less complex) than .gitignore cachine.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Duy Nguyen
On Sun, Feb 10, 2013 at 11:45 PM, Ramkumar Ramachandra
artag...@gmail.com wrote:
 So you're skipping the rest of refresh_cache_ent(), which contains our
 lstat() and returning immediately.  Instead of marking paths with the
 assume unchanged bit, as core.ignoreStat does, you're directly
 attacking the function that refreshes the index and bypassing the
 lstat() call.  How are they different?  read-cache.c:1030 checks
 ce-flags  CE_VALID (which is set in read-cache.c:88 if
 assume_unchanged) and bypasses the lstat() call anyway.  So why didn't
 you just set core.ignoreStat for your test?

It just did not occur to me that core.ignoreStat does the same.

 Ah, read_directory(), from the .gitignore/ exclude angle.  Yes,
 read_directory() seems to be the main culprit there, from my reading
 of Documentation/technical/api-directory-listing.txt.

 So, what did you do?  You short-circuited the function into never
 executing read_directory_recursive(), so the opendir() and readdir()
 are gone.  I'm confused about what this means: will new directories
 fail to appear as untracked now?

No, read_directory returns the list of untracked/ignored files.
Returning empty lists means no untracked nor ignored files.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Duy Nguyen
On Mon, Feb 11, 2013 at 2:03 AM, Robert Zeh robert.allan@gmail.com wrote:
 On Sat, Feb 9, 2013 at 1:35 PM, Junio C Hamano gits...@pobox.com wrote:
 Ramkumar Ramachandra artag...@gmail.com writes:

 This is much better than Junio's suggestion to study possible
 implementations on all platforms and designing a generic daemon/
 communication channel.  That's no weekend project.

 It appears that you misunderstood what I wrote.  That was not here
 is a design; I want it in my system.  Go implemment it.

 It was If somebody wants to discuss it but does not know where to
 begin, doing a small experiment like this and reporting how well it
 worked here may be one way to do so., nothing more.

 What if instead of communicating over a socket, the daemon
 dumped a file containing all of the lstat information after git
 wrote a file? By definition the daemon should know about file writes.

 There would be no network communication, which I think would make
 things more secure. It would simplify the rendezvous by insisting on
 well known locations in $GIT_DIR.

We need some sort of interactive communication to the daemon anyway,
to validate that the information is uptodate. Assume that a user makes
some changes to his worktree before starting the daemon, git needs to
know that what the daemon provides does not represent a complete
file-change picture and it better refreshes the index the old way
once, then trust the daemon.

I think we could solve that by storing a session id, provided by the
daemon, in .git/index. If the session id is not present (or does not
match what the current daemon gives), refresh the old way. After
refreshing, it may ask the daemon for new session id and store it.
Next time if the session id is still valid, trust the daemon's data.
This session id should be different every time the daemon restarts for
this to work.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-10 Thread Duy Nguyen
On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund kusmab...@gmail.com wrote:
 Karsten Blees has done something similar-ish on Windows, and he posted
 the results here:

 https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion

 I also seem to remember he doing a ReadDirectoryChangesW version, but
 I don't remember what happened with that.

Thanks. I came across that but did not remember. For one thing, we
know the inotify alternative for Windows: ReadDirectoryChangesW.

But the meat of the patch is not about that function. In fact it's
dropped in fscache-v3 [1]. It seems that doing
FindFirstFile/FindNextFile for an entire directory, cache the results
and use it to simulate lstat() is faster on Windows. Sounds similar to
preload-index. And because directory listing is cached anyway,
opendir/readdir is replaced to read from cache instead of opening the
directory again.

So it is orthogonal with using ReadDirectoryChangesW/inotify to
further reduce the system calls.

I copy git status's (impressive) numbers from fscache-v0 for those
who are interested in:

preload | -u  | normal | cached | gain
+-+++--
false   | all | 25.144 | 3.055  |  8.2
false   | no  | 22.822 | 1.748  | 12.8
true| all |  9.234 | 2.179  |  4.2
true| no  |  6.833 | 0.955  |  7.2

[1] 
https://github.com/kblees/git/commit/35f319609aa046d2350db32d3afa1fa44920e880
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-09 Thread Duy Nguyen
On Sat, Feb 9, 2013 at 7:53 PM, Ramkumar Ramachandra artag...@gmail.com wrote:
 Ramkumar Ramachandra wrote:
 What about getting systemd to watch everything for us?

 systemd is the perfect candidate!

How about this as a start? I did not really check what it does, but it
does not look complicate enough to pull systemd in.

http://article.gmane.org/gmane.comp.version-control.git/151934

Youo may want to search the mail archive. This topic has come up a few
times before, there may be other similar patches.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-09 Thread Ramkumar Ramachandra
Duy Nguyen wrote:
 How about this as a start? I did not really check what it does, but it
 does not look complicate enough to pull systemd in.

 http://article.gmane.org/gmane.comp.version-control.git/151934

Clever hack.  I didn't know that there was a switch called
core.ignoreStat which will disable automatic lstat() calls altogether.
 So, Finn advises that we set this switch and run igit instead of git.
 There's a git-inotify-daemon which runs inotifywait with -m forever,
updating a modified_files hash.  When it is sent a TERM from igit
(which is what happens immediately upon execution), it writes all this
collected information about modified files to a named pipe that igit
passes to it.  igit then does a git update-index --assume-unchained
--stdin to read the data from the pipe.  Towards the end of its life,
igit starts up a fresh git-inotify-daemon for future invocations.

Finn notes in the commit message that it offers no speedup, because
.gitignore files in every directory still have to be read.  I think
this is silly: we really should be caching .gitignore, and touching it
only when lstat() reports that the file has changed.

As far as a real implementation that we'd want to merge into git.git
is concerned, I have a few comments:
Running multiple daemons on-the-fly for monitoring filesystem changes
is not elegant at all.  Keeping track of the state of so many loose
daemons is a hard problem: how do we ensure any semblance of
reliability without that?  Systemd is a very big improvement over the
legacy of a hundred loose shell scripts that SysVInit demanded.  It
monitors and babysits daemons; it uses cgroups to even kill
misbehaving daemons.  I can inspect running daemons at any time, and
have a uniform way to start/ stop/ restart them.

Okay, now you're asking me to consider a system-wide daemon
independent of systemd.  It has to run with root privileges so it has
access to everyone's repositories, which means that people have to
trust it beyond doubt.  What does it do?  It has a generic API to
watch filesystem paths and report events over an IP socket.  Do you
think that this will only be useful to git?  Every other version
control system (and presumably many other pieces of software) will
want to use it.  One huge downside I see of making this part of
systemd is Ubuntu.  They've decided not to use systemd for some
unfathomable reason.

Really, the elephant in the room right now seems to be .gitignore.
Until that is fixed, there is really no use of writing this inotify
daemon, no?  Can someone enlighten me on how exactly .gitignore files
are processed?

 Youo may want to search the mail archive. This topic has come up a few
 times before, there may be other similar patches.

The thread you linked me to is a 2010 email, and now it's 2013.  We've
been silent about inotify for three years?

Thanks for your inputs, Duy.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-09 Thread Junio C Hamano
Ramkumar Ramachandra artag...@gmail.com writes:

 This is much better than Junio's suggestion to study possible
 implementations on all platforms and designing a generic daemon/
 communication channel.  That's no weekend project.

It appears that you misunderstood what I wrote.  That was not here
is a design; I want it in my system.  Go implemment it.

It was If somebody wants to discuss it but does not know where to
begin, doing a small experiment like this and reporting how well it
worked here may be one way to do so., nothing more.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-09 Thread Duy Nguyen
On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
artag...@gmail.com wrote:
 Finn notes in the commit message that it offers no speedup, because
 .gitignore files in every directory still have to be read.  I think
 this is silly: we really should be caching .gitignore, and touching it
 only when lstat() reports that the file has changed.

 ...

 Really, the elephant in the room right now seems to be .gitignore.
 Until that is fixed, there is really no use of writing this inotify
 daemon, no?  Can someone enlighten me on how exactly .gitignore files
 are processed?

.gitignore is a different issue. I think it's mainly used with
read_directory/fill_directory to collect ignored files (or not-ignored
files). And it's not always used (well, status and add does, but diff
should not). I think wee need to measure how much mass lstat
elimination gains us (especially on big repos) and how much
.gitignore/.gitattributes caching does. I don't think .gitignore has
such a big impact though. strace on git.git tells me git status
issues about 2500 lstat calls, and just 740 open+getdents calls (on
total 3800 syscalls). I will think if we can do something about
.gitignore/.gitattributes.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-08 Thread Junio C Hamano
Ramkumar Ramachandra artag...@gmail.com writes:

 ...  Will Git ever
 consider using inotify on Linux?  What is the downside?

I think this has come up from time to time, but my understanding is
that nobody thought things through to find a good layer in the
codebase to interface to an external daemon that listens to inotify
events yet.  It is not something like somebody decreed that we
would never consider because of such and such downsides.  We are
not there yet.




--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-08 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Ramkumar Ramachandra artag...@gmail.com writes:

 ...  Will Git ever
 consider using inotify on Linux?  What is the downside?

 I think this has come up from time to time, but my understanding is
 that nobody thought things through to find a good layer in the
 codebase to interface to an external daemon that listens to inotify
 events yet.  It is not something like somebody decreed that we
 would never consider because of such and such downsides.  We are
 not there yet.

I checked read-cache.c and preload-index.c code.  To get the
discussion rolling, I think something like the outline below may be
a good starting point and a feasible weekend hack for somebody
competent:

 * At the beginning of preload_index(), instead of spawning the
   worker thread and doing the lstat() check ourselves, we open a
   socket to our daemon (see below) that watches this repository and
   make a request for lstat update.  The request will contain:

- The SHA1 checksum of the index file we just read (to ensure
  that we and our daemon share the same baseline to
  communicate); and

- the pathspec data.

   Our daemon, if it already has a fresh data available, will give
   us a list of path, lstat result.  Our main process runs a loop
   that is equivalent to what preload_thread() runs but uses the
   lstat() data we obtained from the daemon.  If our daemon says it
   does not have a fresh data (or somehow our daemon is dead), we do
   the work ourselves.

 * Our daemon watches the index file and the working tree, and
   waits for the above consumer.  First it reads the index (and
   remembers what it read), and whenever an inotify event comes,
   does the lstat() and remembers the result.  It never writes
   to the index, and does not hold the index lock.  Whenever the
   index file changes, it needs to reload the index, and discard
   lstat() data it already has for paths that are lost from the
   updated index.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-08 Thread Duy Nguyen
On Sat, Feb 9, 2013 at 5:45 AM, Junio C Hamano gits...@pobox.com wrote:
 Junio C Hamano gits...@pobox.com writes:

 Ramkumar Ramachandra artag...@gmail.com writes:

 ...  Will Git ever
 consider using inotify on Linux?  What is the downside?

 I think this has come up from time to time, but my understanding is
 that nobody thought things through to find a good layer in the
 codebase to interface to an external daemon that listens to inotify
 events yet.  It is not something like somebody decreed that we
 would never consider because of such and such downsides.  We are
 not there yet.

 I checked read-cache.c and preload-index.c code.  To get the
 discussion rolling, I think something like the outline below may be
 a good starting point and a feasible weekend hack for somebody
 competent:

  * At the beginning of preload_index(), instead of spawning the
worker thread and doing the lstat() check ourselves, we open a
socket to our daemon (see below) that watches this repository and

Can we replace open a socket to our daemon with open a special file
in .git to get stat data written by our daemon? TCP/IP socket means
system-wide daemon, not attractive. UNIX socket is not available on
Windows (although there may be named pipe, I don't know).

make a request for lstat update.  The request will contain:

 - The SHA1 checksum of the index file we just read (to ensure
   that we and our daemon share the same baseline to
   communicate); and

 - the pathspec data.

Our daemon, if it already has a fresh data available, will give
us a list of path, lstat result.  Our main process runs a loop
that is equivalent to what preload_thread() runs but uses the
lstat() data we obtained from the daemon.  If our daemon says it
does not have a fresh data (or somehow our daemon is dead), we do
the work ourselves.

  * Our daemon watches the index file and the working tree, and
waits for the above consumer.  First it reads the index (and
remembers what it read), and whenever an inotify event comes,
does the lstat() and remembers the result.  It never writes
to the index, and does not hold the index lock.  Whenever the
index file changes, it needs to reload the index, and discard
lstat() data it already has for paths that are lost from the
updated index.


-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-08 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 Can we replace open a socket to our daemon with open a special file
 in .git to get stat data written by our daemon? TCP/IP socket means
 system-wide daemon, not attractive. UNIX socket is not available on
 Windows (although there may be named pipe, I don't know).

I do not think TCP/IP socket is too bad (you have to be able to read
the index file to be able to ask questions to the daemon to begin
with, so you must have list of paths already; the answer from the
daemon would not leak anything more sensitive than you can already
know), and UNIX domain socket is not too bad either.

Just like the implementation detail of the daemon itself may differ
on platforms (does Windows have the identical inotify interface?  I
doubt it), I expect the RPC mechanism between the daemon and the
client would be platform dependent.  So take that open a socket as
a generic way to say have these two communicate with some magic,
nothing more.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-08 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 I checked read-cache.c and preload-index.c code.  To get the
 discussion rolling, I think something like the outline below may be
 a good starting point and a feasible weekend hack for somebody
 competent:

  * At the beginning of preload_index(), instead of spawning the
worker thread and doing the lstat() check ourselves, we open a
socket to our daemon (see below) that watches this repository and
make a request for lstat update.  The request will contain:

 - The SHA1 checksum of the index file we just read (to ensure
   that we and our daemon share the same baseline to
   communicate); and

 - the pathspec data.

Our daemon, if it already has a fresh data available, will give
us a list of path, lstat result.  Our main process runs a loop
that is equivalent to what preload_thread() runs but uses the
lstat() data we obtained from the daemon.  If our daemon says it
does not have a fresh data (or somehow our daemon is dead), we do
the work ourselves.

  * Our daemon watches the index file and the working tree, and
waits for the above consumer.  First it reads the index (and
remembers what it read), and whenever an inotify event comes,
does the lstat() and remembers the result.  It never writes
to the index, and does not hold the index lock.  Whenever the
index file changes, it needs to reload the index, and discard
lstat() data it already has for paths that are lost from the
updated index.

I left the details unsaid in thee above because I thought it was
fairly obvious from the nature of the outline, but let me spend a
few more lines to avoid confusion.

 - The way the daemon watches the changes to the working tree and
   the index may well be very platform dependent.  I said inotify
   above, but the mechanism does not have to be inotify.

 - The channel the daemon and the client communicates would also be
   system dependent.  UNIX domain socket in $GIT_DIR/ with a
   well-known name would be one possibility but it does not have to
   be the only option.

 - The data given from the daemon to the client does not have to
   include full lstat() information.  They start from the same index
   info, and the only thing preload_index() wants to know is for
   which paths it should call ce_mark_uptodate(ce), so the answer
   given by our daemon can be a list of paths.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: inotify to minimize stat() calls

2013-02-08 Thread Robert Zeh
The delay for commands like git status is much worse on Windows than Linux; for 
my workflow I would be happy with a Windows only implementation. 

From the description so far, I have some question: how does the daemon get 
started and stopped?  Is there one per repository --- this seems to be implied 
by putting the unix domain socket in $GIT_DIR. Could we automatically reject 
connections from anything other than localhost when using TCP?

Robert Zeh

On Feb 8, 2013, at 8:56 PM, Junio C Hamano gits...@pobox.com wrote:

 Junio C Hamano gits...@pobox.com writes:
 
 I checked read-cache.c and preload-index.c code.  To get the
 discussion rolling, I think something like the outline below may be
 a good starting point and a feasible weekend hack for somebody
 competent:
 
 * At the beginning of preload_index(), instead of spawning the
   worker thread and doing the lstat() check ourselves, we open a
   socket to our daemon (see below) that watches this repository and
   make a request for lstat update.  The request will contain:
 
- The SHA1 checksum of the index file we just read (to ensure
  that we and our daemon share the same baseline to
  communicate); and
 
- the pathspec data.
 
   Our daemon, if it already has a fresh data available, will give
   us a list of path, lstat result.  Our main process runs a loop
   that is equivalent to what preload_thread() runs but uses the
   lstat() data we obtained from the daemon.  If our daemon says it
   does not have a fresh data (or somehow our daemon is dead), we do
   the work ourselves.
 
 * Our daemon watches the index file and the working tree, and
   waits for the above consumer.  First it reads the index (and
   remembers what it read), and whenever an inotify event comes,
   does the lstat() and remembers the result.  It never writes
   to the index, and does not hold the index lock.  Whenever the
   index file changes, it needs to reload the index, and discard
   lstat() data it already has for paths that are lost from the
   updated index.
 
 I left the details unsaid in thee above because I thought it was
 fairly obvious from the nature of the outline, but let me spend a
 few more lines to avoid confusion.
 
 - The way the daemon watches the changes to the working tree and
   the index may well be very platform dependent.  I said inotify
   above, but the mechanism does not have to be inotify.
 
 - The channel the daemon and the client communicates would also be
   system dependent.  UNIX domain socket in $GIT_DIR/ with a
   well-known name would be one possibility but it does not have to
   be the only option.
 
 - The data given from the daemon to the client does not have to
   include full lstat() information.  They start from the same index
   info, and the only thing preload_index() wants to know is for
   which paths it should call ce_mark_uptodate(ce), so the answer
   given by our daemon can be a list of paths.
 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html