Re: Watchman support for git

2014-05-19 Thread Duy Nguyen
On Fri, May 16, 2014 at 2:42 AM, David Turner  wrote:
>> I assume you won't change your mind about this. Which is fine to me. I
>> will still try out my approach with your libwatchman though. Just
>> curious about its performance and complexity, compared to your
>> approach.
>
> I am open-minded here. This code is really the first time I have looked
> at git's internals, and I accept that your way might be better.  If
> you're going to try the watchman version of your approach, then we do a
> direct comparison.  Let me know if there is something I can do to help
> out on that.

You already helped by publishing your patches (and letting me know
about libwatchman) so I can steal bits here and there ;-)

>> A bit off topic, but msys fork has another "fscache" in compat/win32.
>> If you could come up with a different name, maybe it'll be less
>> confusing for them after merging. But this is not a big deal, as this
>> fscache won't work on windows anyway.
>
> Does wtcache sounds like a better name?
>

Heh i'm bad at naming. But that sounds a bit cryptic. Maybe
watchman-cache.c (with funtions starting with prefix wmc_)? Should not
worry too much about this though, unless some msysgit guys yell up.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-15 Thread David Turner
On Wed, 2014-05-14 at 17:36 +0700, Duy Nguyen wrote:
> >>  With that in
> >> mind, I think you don't need to keep a fs cache on disk at all. All
> >> you need to store (in the index) is the clock value from watchman.
> >> After we parse the index, we perform a "since" query to get path names
> >> (and perhaps "exists" and "mode" for later). Then we set CE_VALID bit
> >> on entries that are _not_ in the query result. The remaining items
> >> will be lstat'd by git (see [1] and read-cache.c changes on the next
> >> few patches). Assuming the number of updated files is reasonably
> >> small, we won't  be punished by lookup time.
> >
> > I considered this, but rejected it for a few reasons:
> > 1. We still need to know about the untracked files.  I guess we could
> > do an index extension for just that, like your untracked cache.
> 
> Yes. But consider that the number of untracked files is usually small
> (otherwise 'git status' would look very messy). And your fscache would
> need to store excluded file list too, which could be a lot bigger (one
> pair of  .[ch] -> one .o). _But_ yours would make 'git status
> --ignored' work. I don't consider that a major use case for
> optimization, but people may have different opinions.

I don't think --ignored is a major use case.  But I do think it's nice
to get as much mileage as possible out of a change like this.  

> > 2. That doesn't eliminate opendir/readdir, I think.  Those are a major
> > cost. I did not thoroughly review your patches from January/February,
> > but I didn't notice anything in there about this part of dir.c.
> > Also the cost of open(nonexistent .gitignore) to do ignore processing.
> 
> Assuming untracked cache is in use, opendir/readdir is replaced with
> lstat. And cheap lstat can be solved with watchman without storing
> anything extra. I solve open(.gitignore) too by checking its stat data
> with the one in index. If matches, I reuse the version in tree. This
> does not necessarily make it cheaper, but it increases locality so it
> might be. _Modified_ .gitignore files have to go through
> open(.gitignore), but people don't update .gitignore often.

Interesting -- if all that actually works, then it's probably the right
approach.  

> > 3. Changing a file in the index (e.g. git add) requires clearing
> > the CE_VALID bit; this means that third-party libraries (e.g. jgit)
> > that change the git repo need to understand this extension for correct
> > operation.  Maybe that's just the nature of extensions, but it's not
> > something that my present patch set requires.
> 
> I don't store CE_VALID on disk. Instead I store a new bit CE_WATCHED.
> Only after loading and validating against watchman that I turn
> CE_WATCHED to CE_VALID in memory. So JGit, libgit2 are not confused.
> 
> I assume you won't change your mind about this. Which is fine to me. I
> will still try out my approach with your libwatchman though. Just
> curious about its performance and complexity, compared to your
> approach.

I am open-minded here. This code is really the first time I have looked
at git's internals, and I accept that your way might be better.  If
you're going to try the watchman version of your approach, then we do a
direct comparison.  Let me know if there is something I can do to help
out on that.

> A bit off topic, but msys fork has another "fscache" in compat/win32.
> If you could come up with a different name, maybe it'll be less
> confusing for them after merging. But this is not a big deal, as this
> fscache won't work on windows anyway.

Does wtcache sounds like a better name?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-14 Thread Duy Nguyen
One more thing I think we can solve with watchman is the racy
timestamp issue (see 29e4d36 and 407c8eb). Basically if a file is
updated within a second (and the system only supports timestamp
granuarity down to a second) then git can't know if a file is changed
by comparing stat data, so it compares content anyway. With watchman,
I think we know that a file is updated or not and can disable racy
test. How often this happens and whether it is worth supporting is
something to think about.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-14 Thread Duy Nguyen
On Wed, May 14, 2014 at 6:44 AM, David Turner  wrote:
>> I'm not so happy that git now needs to link to libjansson.so and
>> libwatchman.so. I would load libwatchman.so at runtime only when
>> core.usewatchman is on, but this is more of personal taste.
>
> I assume you mean something with dlopen? I don't really like that because
> (a) nothing else in git works that way and
> (b) you would still need the libwatchman headers to build git (or the
> structs could be copied, but that is ugly).

And because we don't link to libdl.so anyway, using dlopen means
trading libjansson.so and libwatchman.so for libdl.so. And it's
uglier. So forget what I wrote.

>>  With that in
>> mind, I think you don't need to keep a fs cache on disk at all. All
>> you need to store (in the index) is the clock value from watchman.
>> After we parse the index, we perform a "since" query to get path names
>> (and perhaps "exists" and "mode" for later). Then we set CE_VALID bit
>> on entries that are _not_ in the query result. The remaining items
>> will be lstat'd by git (see [1] and read-cache.c changes on the next
>> few patches). Assuming the number of updated files is reasonably
>> small, we won't  be punished by lookup time.
>
> I considered this, but rejected it for a few reasons:
> 1. We still need to know about the untracked files.  I guess we could
> do an index extension for just that, like your untracked cache.

Yes. But consider that the number of untracked files is usually small
(otherwise 'git status' would look very messy). And your fscache would
need to store excluded file list too, which could be a lot bigger (one
pair of  .[ch] -> one .o). _But_ yours would make 'git status
--ignored' work. I don't consider that a major use case for
optimization, but people may have different opinions.

> 2. That doesn't eliminate opendir/readdir, I think.  Those are a major
> cost. I did not thoroughly review your patches from January/February,
> but I didn't notice anything in there about this part of dir.c.
> Also the cost of open(nonexistent .gitignore) to do ignore processing.

Assuming untracked cache is in use, opendir/readdir is replaced with
lstat. And cheap lstat can be solved with watchman without storing
anything extra. I solve open(.gitignore) too by checking its stat data
with the one in index. If matches, I reuse the version in tree. This
does not necessarily make it cheaper, but it increases locality so it
might be. _Modified_ .gitignore files have to go through
open(.gitignore), but people don't update .gitignore often.

> 3. Changing a file in the index (e.g. git add) requires clearing
> the CE_VALID bit; this means that third-party libraries (e.g. jgit)
> that change the git repo need to understand this extension for correct
> operation.  Maybe that's just the nature of extensions, but it's not
> something that my present patch set requires.

I don't store CE_VALID on disk. Instead I store a new bit CE_WATCHED.
Only after loading and validating against watchman that I turn
CE_WATCHED to CE_VALID in memory. So JGit, libgit2 are not confused.

I assume you won't change your mind about this. Which is fine to me. I
will still try out my approach with your libwatchman though. Just
curious about its performance and complexity, compared to your
approach.

A bit off topic, but msys fork has another "fscache" in compat/win32.
If you could come up with a different name, maybe it'll be less
confusing for them after merging. But this is not a big deal, as this
fscache won't work on windows anyway.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-13 Thread David Turner
On Sat, 2014-05-10 at 15:16 +0700, Duy Nguyen wrote:
> On Sat, May 3, 2014 at 6:14 AM,   wrote:
> > The most sigificant patch uses Facebook's watchman daemon[1] to monitor
> > the repository work tree for changes.  This makes allows git status
> > to avoid traversing the entire work tree to find changes.
> 
> Some comments on this series. I still haven't been able to run it with
> watchman so there are many guesses from my side.
> 
> First of all, when I set USE_WATCHMAN=Yes in config.mak I expect it to
> work out of the box, provided that all dependencies are installed. I
> still need to set WATCHMAN_LIBS for it to build because you only set
> it with configure script. Would be nice to have a default value for
> non-configure people too.

I'll fix that, thanks.

> I'm not so happy that git now needs to link to libjansson.so and
> libwatchman.so. I would load libwatchman.so at runtime only when
> core.usewatchman is on, but this is more of personal taste.

I assume you mean something with dlopen? I don't really like that because
(a) nothing else in git works that way and
(b) you would still need the libwatchman headers to build git (or the
structs could be copied, but that is ugly).

> I still prefer my old tracking model, where the majority of lstat() is
> done by refresh operation and we only need to optimize those lstat
> calls, not every single lstat statement in the code base.

The lstat calls are only one of the problems.  The others are
opendir/readdir and open(.gitignore). 

>  With that in
> mind, I think you don't need to keep a fs cache on disk at all. All
> you need to store (in the index) is the clock value from watchman.
> After we parse the index, we perform a "since" query to get path names
> (and perhaps "exists" and "mode" for later). Then we set CE_VALID bit
> on entries that are _not_ in the query result. The remaining items
> will be lstat'd by git (see [1] and read-cache.c changes on the next
> few patches). Assuming the number of updated files is reasonably
> small, we won't  be punished by lookup time. 

I considered this, but rejected it for a few reasons:
1. We still need to know about the untracked files.  I guess we could 
do an index extension for just that, like your untracked cache.

2. That doesn't eliminate opendir/readdir, I think.  Those are a major 
cost. I did not thoroughly review your patches from January/February, 
but I didn't notice anything in there about this part of dir.c.  
Also the cost of open(nonexistent .gitignore) to do ignore processing.

3. Changing a file in the index (e.g. git add) requires clearing
the CE_VALID bit; this means that third-party libraries (e.g. jgit) 
that change the git repo need to understand this extension for correct
operation.  Maybe that's just the nature of extensions, but it's not
something that my present patch set requires.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-13 Thread David Turner
On Wed, 2014-05-14 at 05:54 +0700, Duy Nguyen wrote:
> On Wed, May 14, 2014 at 5:38 AM, David Turner  
> wrote:
> > On Mon, 2014-05-12 at 17:45 +0700, Duy Nguyen wrote:
> >> This is your quote from above, moved down a bit:
> >>
> >> > update_fs_cache should only have to update based on what it has learned
> >> > from watchman.  So if no .gitignore has been changed, it should not have
> >> > to do very much work.
> >> >
> >> > I could take the fe_excluded check and move it above the
> >> > last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to
> >> > fail when run under watchman but presumably that's fixable
> >>
> >> So you exclude files early and make the real read_directory() pass do
> >> pretty much nothing. This is probably not a good idea. Assume that I
> >> touch $TOP/.gitignore then do something other than "git status" (or
> >> "git add") then I have to pay read_directory() cost.
> >
> > I'm not sure I understand this. read_directory does something: it checks
> > the fs_cache (instead of the filesystem) for untracked files.
> 
> A lot of commands do read_cache() that that eventually calls
> update_fs_cache, which does part of read_directory's work (the
> fe_excluded thing). But not many of those commands actually call
> read_directory(). It'd better if there's a way to mark that "this
> .gitignore is changed", but delay the actual exclude processsing until
> we are sure read_directory() will be used.

OK, that would be straighforward, I think.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-13 Thread Duy Nguyen
On Wed, May 14, 2014 at 5:38 AM, David Turner  wrote:
> On Mon, 2014-05-12 at 17:45 +0700, Duy Nguyen wrote:
>> This is your quote from above, moved down a bit:
>>
>> > update_fs_cache should only have to update based on what it has learned
>> > from watchman.  So if no .gitignore has been changed, it should not have
>> > to do very much work.
>> >
>> > I could take the fe_excluded check and move it above the
>> > last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to
>> > fail when run under watchman but presumably that's fixable
>>
>> So you exclude files early and make the real read_directory() pass do
>> pretty much nothing. This is probably not a good idea. Assume that I
>> touch $TOP/.gitignore then do something other than "git status" (or
>> "git add") then I have to pay read_directory() cost.
>
> I'm not sure I understand this. read_directory does something: it checks
> the fs_cache (instead of the filesystem) for untracked files.

A lot of commands do read_cache() that that eventually calls
update_fs_cache, which does part of read_directory's work (the
fe_excluded thing). But not many of those commands actually call
read_directory(). It'd better if there's a way to mark that "this
.gitignore is changed", but delay the actual exclude processsing until
we are sure read_directory() will be used.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-13 Thread David Turner
On Mon, 2014-05-12 at 17:45 +0700, Duy Nguyen wrote:
> This is your quote from above, moved down a bit:
> 
> > update_fs_cache should only have to update based on what it has learned
> > from watchman.  So if no .gitignore has been changed, it should not have
> > to do very much work.
> >
> > I could take the fe_excluded check and move it above the
> > last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to
> > fail when run under watchman but presumably that's fixable
> 
> So you exclude files early and make the real read_directory() pass do
> pretty much nothing. This is probably not a good idea. Assume that I
> touch $TOP/.gitignore then do something other than "git status" (or
> "git add") then I have to pay read_directory() cost.

I'm not sure I understand this. read_directory does something: it checks
the fs_cache (instead of the filesystem) for untracked files.

> Back to the open vs fs_cache_open and the number of .gitignore files
> above. I touch $TOP/.gitignore then do "git status" to make it read
> all .gitignore files (6k of them) and change between open and
> fs_cache_open. I think the numbers still  do not make any visible
> difference (~1620-1630ms).

Yes, I would expect no win in that case.  fs_cache_open will only save
time in the common case where there is no .gitignore file, because it
saves an open() call.  If every possible .gitignore file exists, of
course it makes no difference.  But also, your processor may be
sufficiently slow that the context-switch penalty for open() is less
than the hash table lookup.  

For me, the win from fs_cache_open is about 7%.  



--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-12 Thread Duy Nguyen
On Mon, May 12, 2014 at 5:56 AM, David Turner  wrote:
>> So without watchman I got
>>
>>299.871ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go
>>498.205ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE
>>796.050ms wt_status_collect:622 wt_status_collect_untracked(s)
>>
>> and with watchman ("git status" ran several times to make sure it's cached)
>>
>>301.950ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go
>> 34.918ms  read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go
>>   1564.096ms  watchman_load_fs_cache:628 update_fs_cache(istate, result);
>>161.930ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE
>>251.614ms wt_status_collect:622 wt_status_collect_untracked(s)
>>
>> Given the total time of "git status" without watchman is 1.9s,,
>> update_fs_cache() nearly matches that number alone. All that is spent
>> in the exclude update code in the function, but if you do
>> last_excluding_matching() anyway, why cache it?
>
> My numbers are different (for my test repository):
>
> ---
> 30.031ms read_index:1386 r = read_index_from(istate, get_index_
> 71.625ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE
>259.712ms wt_status_collect:622 wt_status_collect_untracked(s)
> 
> 41.110ms read_index:1386 r = read_index_from(istate, get_index_
>  9.294ms read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go
>  0.173ms watchman_load_fs_cache:628 update_fs_cache(istate, result)
> 41.901ms read_index:1386 r = read_index_from(istate, get_index_
> 18.355ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE
> 50.911ms wt_status_collect:622 wt_status_collect_untracked(s)
> ---
>
> I think something must be going wrong with update_fs_cache on your
> machine.  I have a few hypotheses:
>
> 1. Maybe watchman isn't fully started up when you run your tests.
> 2. Maybe there is a bug.

It's probably me doing something wrong (I ran it more than a couple
times so watchman must have loaded the whole thing). I got small
numbers in update_fs_cache() now.

>> A bit surprised about wt_status_collect_untracked number. I verified
>> that last_excluding_matching() still runs (on the same number of
>> entries like in no-watchman case). Replacing fs_cache_open() in
>> add_excludes_from_file_to_list() to plain open() does not change the
>> number, so we probably won't need that (unless your worktree is filled
>> with .gitignore, which I doubt it's a norm).
>
> My test repo has a couple hundred of them.  Maybe that's unusual?  A
> repo with a lot of projects will tend to have lots of gitignore files,
> because each project will want to maintain them independently.

I tried the worst case, every directory had an empty .gitignore. The
numbers did not change much. And I found out that because
add_excludes.. were called twice, not on every .gitignore because of
the condition "!(dir->flags & DIR_EXCLUDE_CMDL_ONLY)". So the number
of .gitignore does not matter (yet).

This is your quote from above, moved down a bit:

> update_fs_cache should only have to update based on what it has learned
> from watchman.  So if no .gitignore has been changed, it should not have
> to do very much work.
>
> I could take the fe_excluded check and move it above the
> last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to
> fail when run under watchman but presumably that's fixable

So you exclude files early and make the real read_directory() pass do
pretty much nothing. This is probably not a good idea. Assume that I
touch $TOP/.gitignore then do something other than "git status" (or
"git add") then I have to pay read_directory() cost.

Back to the open vs fs_cache_open and the number of .gitignore files
above. I touch $TOP/.gitignore then do "git status" to make it read
all .gitignore files (6k of them) and change between open and
fs_cache_open. I think the numbers still  do not make any visible
difference (~1620-1630ms).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-11 Thread David Turner
On Sun, 2014-05-11 at 07:21 +0700, Duy Nguyen wrote:
> On Sun, May 11, 2014 at 1:38 AM, David Turner  
> wrote:
> >> I got "warning: Watchman watch error: Got bad JSON from watchman
> >> get-sockname: '[' or '{' expected near end of file". Any ideas what I
> >> did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11
> >> and jansson-2.4 were installed by system (gentoo).
> >
> > What do you get from watchman get-sockname on the command-line?  Do the
> > watchman tests pass?
> 
> Found the problem. "watchman" binary is not in $PATH but popen() did
> not complain (or it did but your "2>/dev/null" in watchman_connect
> suppressed it). 

I should probably not be using popen, since it doesn't offer good error
reporting.  I'll try to fix that in the next few days.

> BTW you need to update the array size of "expressions"
> in test_watchman_misc().

Thanks, fixed.

> So without watchman I got
> 
>299.871ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go
>498.205ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE
>796.050ms wt_status_collect:622 wt_status_collect_untracked(s)
> 
> and with watchman ("git status" ran several times to make sure it's cached)
> 
>301.950ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go
> 34.918ms  read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go
>   1564.096ms  watchman_load_fs_cache:628 update_fs_cache(istate, result);
>161.930ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE
>251.614ms wt_status_collect:622 wt_status_collect_untracked(s)
> 
> Given the total time of "git status" without watchman is 1.9s,,
> update_fs_cache() nearly matches that number alone. All that is spent
> in the exclude update code in the function, but if you do
> last_excluding_matching() anyway, why cache it?

My numbers are different (for my test repository):

---
30.031ms read_index:1386 r = read_index_from(istate, get_index_
71.625ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE
   259.712ms wt_status_collect:622 wt_status_collect_untracked(s)

41.110ms read_index:1386 r = read_index_from(istate, get_index_
 9.294ms read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go
 0.173ms watchman_load_fs_cache:628 update_fs_cache(istate, result)
41.901ms read_index:1386 r = read_index_from(istate, get_index_
18.355ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE
50.911ms wt_status_collect:622 wt_status_collect_untracked(s)
---

I think something must be going wrong with update_fs_cache on your
machine.  I have a few hypotheses:

1. Maybe watchman isn't fully started up when you run your tests.
2. Maybe there is a bug.

update_fs_cache should only have to update based on what it has learned
from watchman.  So if no .gitignore has been changed, it should not have
to do very much work.

I could take the fe_excluded check and move it above the
last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to
fail when run under watchman but presumably that's fixable

> I think we're paying lookup cost in refresh_index(). I forced CE_VALID
> bit on as an experiment and refresh_index() went down to 33ms.

Yes, it is possible to use Watchman to set CE_VALID as well, and I
should probably do that.  It's a bit more complicated (at least, I
recall running into problems), but probably doable.

> A bit surprised about wt_status_collect_untracked number. I verified
> that last_excluding_matching() still runs (on the same number of
> entries like in no-watchman case). Replacing fs_cache_open() in
> add_excludes_from_file_to_list() to plain open() does not change the
> number, so we probably won't need that (unless your worktree is filled
> with .gitignore, which I doubt it's a norm). 

My test repo has a couple hundred of them.  Maybe that's unusual?  A
repo with a lot of projects will tend to have lots of gitignore files,
because each project will want to maintain them independently.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-10 Thread Duy Nguyen
On Sun, May 11, 2014 at 1:38 AM, David Turner  wrote:
>> I got "warning: Watchman watch error: Got bad JSON from watchman
>> get-sockname: '[' or '{' expected near end of file". Any ideas what I
>> did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11
>> and jansson-2.4 were installed by system (gentoo).
>
> What do you get from watchman get-sockname on the command-line?  Do the
> watchman tests pass?

Found the problem. "watchman" binary is not in $PATH but popen() did
not complain (or it did but your "2>/dev/null" in watchman_connect
suppressed it). BTW you need to update the array size of "expressions"
in test_watchman_misc().

So without watchman I got

   299.871ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go
   498.205ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE
   796.050ms wt_status_collect:622 wt_status_collect_untracked(s)

and with watchman ("git status" ran several times to make sure it's cached)

   301.950ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go
34.918ms  read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go
  1564.096ms  watchman_load_fs_cache:628 update_fs_cache(istate, result);
   161.930ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE
   251.614ms wt_status_collect:622 wt_status_collect_untracked(s)

Given the total time of "git status" without watchman is 1.9s,,
update_fs_cache() nearly matches that number alone. All that is spent
in the exclude update code in the function, but if you do
last_excluding_matching() anyway, why cache it?

I think we're paying lookup cost in refresh_index(). I forced CE_VALID
bit on as an experiment and refresh_index() went down to 33ms.

A bit surprised about wt_status_collect_untracked number. I verified
that last_excluding_matching() still runs (on the same number of
entries like in no-watchman case). Replacing fs_cache_open() in
add_excludes_from_file_to_list() to plain open() does not change the
number, so we probably won't need that (unless your worktree is filled
with .gitignore, which I doubt it's a norm). So about 500ms saved is
in opendir/readdir.. A simple/separate opendir/readdir/closedir
program confirms that (ugh!).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-10 Thread David Turner
On Sat, 2014-05-10 at 12:26 +0700, Duy Nguyen wrote:
> On Sat, May 3, 2014 at 6:14 AM,   wrote:
> > The most sigificant patch uses Facebook's watchman daemon[1] to monitor
> > the repository work tree for changes.  This makes allows git status
> > to avoid traversing the entire work tree to find changes.
> 
> I got "warning: Watchman watch error: Got bad JSON from watchman
> get-sockname: '[' or '{' expected near end of file". Any ideas what I
> did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11
> and jansson-2.4 were installed by system (gentoo).

What do you get from watchman get-sockname on the command-line?  Do the
watchman tests pass?

I don't know that I have tested with Jansson 2.4, but I've tried 2.2.1,
2.5, and 2.6.  

I also haven't tried on gentoo (I use Ubuntu).  But if you still have
problems, I can try installing gentoo in a vm and seeing if it's
something gentoo-specific.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-10 Thread Duy Nguyen
On Sat, May 3, 2014 at 6:14 AM,   wrote:
> The most sigificant patch uses Facebook's watchman daemon[1] to monitor
> the repository work tree for changes.  This makes allows git status
> to avoid traversing the entire work tree to find changes.

Some comments on this series. I still haven't been able to run it with
watchman so there are many guesses from my side.

First of all, when I set USE_WATCHMAN=Yes in config.mak I expect it to
work out of the box, provided that all dependencies are installed. I
still need to set WATCHMAN_LIBS for it to build because you only set
it with configure script. Would be nice to have a default value for
non-configure people too.

I'm not so happy that git now needs to link to libjansson.so and
libwatchman.so. I would load libwatchman.so at runtime only when
core.usewatchman is on, but this is more of personal taste.

I still prefer my old tracking model, where the majority of lstat() is
done by refresh operation and we only need to optimize those lstat
calls, not every single lstat statement in the code base. With that in
mind, I think you don't need to keep a fs cache on disk at all. All
you need to store (in the index) is the clock value from watchman.
After we parse the index, we perform a "since" query to get path names
(and perhaps "exists" and "mode" for later). Then we set CE_VALID bit
on entries that are _not_ in the query result. The remaining items
will be lstat'd by git (see [1] and read-cache.c changes onthe next
few patches). Assuming the number of updated files is reasonably
small, we won't  be punished by lookup time. To avoid write time cost
due to possible mass CE_VALID change, assuming split-index is already
used, we could store this bit separately as an extension and apply it
back at read time.

Your changes in dir.c. If I didn't misread it, you still do
last_exclude_matching when using fs-cache. That call is where all your
CPU is spent and probably explains why you didn't see big perf gain
with watchman. I think my untracked cache has dealt correctly with
caching in this area. So when you do the watchman query earlier, you
also check if any of the directories are updated and force using
untracked cache for the rest of the directories.

Hope it helps,
Duy

[1] http://thread.gmane.org/gmane.comp.version-control.git/240339
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-09 Thread Duy Nguyen
On Sat, May 3, 2014 at 6:14 AM,   wrote:
> The most sigificant patch uses Facebook's watchman daemon[1] to monitor
> the repository work tree for changes.  This makes allows git status
> to avoid traversing the entire work tree to find changes.

I got "warning: Watchman watch error: Got bad JSON from watchman
get-sockname: '[' or '{' expected near end of file". Any ideas what I
did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11
and jansson-2.4 were installed by system (gentoo).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-09 Thread David Turner
On Fri, 2014-05-09 at 11:27 -0700, David Lang wrote:
> >
> > That's not my understanding from Durham Goode's talk in January.  Yes,
> > operations involving history go to the server.  But the client also
> > maintains a copy of the working tree, and it is for this that watchman
> > is used.  Otherwise, why bother with watchman at all?  The server knows
> > when it changes files and could simply maintain its own index of what's
> > changed.  Watchman is built on inotify/fsevents -- it doesn't have
> > anything to do with any sort of storage device beyond a vanilla hard
> > drive.
> 
> When you have such a massive repo, your clients aren't storing the data on 
> their 
> local drives, they are accessing the data on a network attached storage (via 
> NFS 
> or through a fuse mount). So they can have their watchman send queries to the 
> storage server to find out what has changed in this massive repo rather than 
> having to walk the directory tree (or try to monitor it for changes on the 
> client machine)

Yes, you could do that.  But I repeat: that is not what Facebook is
actually doing.  If it were, they would have no need for inotify or
FSEvents, since neither even works with NFS (and a FUSE-based solution
would need special support at which point it might as well just
implement watchman itself). 

Please see this section of Durham Goode's talk for evidence that
Facebook is not doing that:
http://www.youtube.com/watch?v=Dlguc63cRXg&t=10m40s

(it's at 10:40 in case that link doesn't work)

Maybe things have changed since at Facebook since Durham gave that talk
in January and you have knowledge of that. If so, say so.  But from what
Facebook has said publicly, this is simply not what's going on.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-09 Thread David Lang

On Fri, 9 May 2014, David Turner wrote:


On Fri, 2014-05-09 at 11:08 -0700, David Lang wrote:

On Fri, 9 May 2014, David Turner wrote:


On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote:

On Thu, 8 May 2014, Sebastian Schuberth wrote:


On 03.05.2014 05:40, Felipe Contreras wrote:


That's very interesting. Do you get similar improvements when doing
something similar in Merurial (watchman vs . no watchman).


I have not tried it.  My understanding is that this is why Facebook
wrote Watchman and added support for it to Mercurial, so I would assume
that the improvements are at least this good.


Yeah, my bet is that they are actually much better (because Mercurial
can't be so optimized as Git).

I'm interested in this number because if watchman in Git is improving it
by 30%, but in Mercurial it's improving it by 100% (made up number),
therefore it makes sens that you might want it more if you are using hg,
but not so much if you are using git.

Also, if similar repositories with Mercurial+watchman are actually
faster than Git+watchman, that means that there's room for improvement
in your implementation. This is not a big issue at this point of the
process, just something nice to know.


The article at [1] has some details, they claim "For our repository, enabling 
Watchman integration has made Mercurial's status command more than 5x faster than Git's 
status command".

[1] 
https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/


a lot of that speed comparison is going to depend on your storage system and the
size of your repository.

if you have a high-end enterprise storage system that tracks metadata very
differently from the file contents (I've seen some that have rackes worth of
SATA drives for contents and then 'small' arrays of a few dozen flash drives for
the metadata), and then you have very large repositories (Facebook has
everything in a single repo), then you have a perfect storm where something like
watchman that talks the proprietary protocol of the storage array can be FAR
faster than anything that needs to operate with the standard POSIX calls.

That can easily account for the difference between the facebook announcement and
the results presented for normal disks that show an improvement, but with even
stock git being faster than improved mercurial.


As I recall from Facebook's presentation[1] on this (as well as from the
discussion on the git mailing list[2]), Facebook's test respository is
much larger than any known git repository.  In particular, it is larger
than WebKit.


agreed, it's huge, it's the entire codebase history of every tool that they use
crammed together in one rep


These performance improvements are not for server-side
tasks, but for client-side (e.g. git/hg status).  Facebook also made
other improvements for the client-server communication, and for
log/blame, but these are not relevant to watchman.


well, in their situation they have shared storage that clients use for this huge
repo, so I don't think they have a clear client/server boundry the way you are
thinking. Even clients have this huge repo to deal with, and they can do so
efficiently by querying the storage device rather than trying to walk the tree
or monitor access directly.


That's not my understanding from Durham Goode's talk in January.  Yes,
operations involving history go to the server.  But the client also
maintains a copy of the working tree, and it is for this that watchman
is used.  Otherwise, why bother with watchman at all?  The server knows
when it changes files and could simply maintain its own index of what's
changed.  Watchman is built on inotify/fsevents -- it doesn't have
anything to do with any sort of storage device beyond a vanilla hard
drive.


When you have such a massive repo, your clients aren't storing the data on their 
local drives, they are accessing the data on a network attached storage (via NFS 
or through a fuse mount). So they can have their watchman send queries to the 
storage server to find out what has changed in this massive repo rather than 
having to walk the directory tree (or try to monitor it for changes on the 
client machine)


David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-09 Thread David Turner
On Fri, 2014-05-09 at 11:08 -0700, David Lang wrote:
> On Fri, 9 May 2014, David Turner wrote:
> 
> > On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote:
> >> On Thu, 8 May 2014, Sebastian Schuberth wrote:
> >>
> >>> On 03.05.2014 05:40, Felipe Contreras wrote:
> >>>
> >> That's very interesting. Do you get similar improvements when doing
> >> something similar in Merurial (watchman vs . no watchman).
> >
> > I have not tried it.  My understanding is that this is why Facebook
> > wrote Watchman and added support for it to Mercurial, so I would assume
> > that the improvements are at least this good.
> 
>  Yeah, my bet is that they are actually much better (because Mercurial
>  can't be so optimized as Git).
> 
>  I'm interested in this number because if watchman in Git is improving it
>  by 30%, but in Mercurial it's improving it by 100% (made up number),
>  therefore it makes sens that you might want it more if you are using hg,
>  but not so much if you are using git.
> 
>  Also, if similar repositories with Mercurial+watchman are actually
>  faster than Git+watchman, that means that there's room for improvement
>  in your implementation. This is not a big issue at this point of the
>  process, just something nice to know.
> >>>
> >>> The article at [1] has some details, they claim "For our repository, 
> >>> enabling Watchman integration has made Mercurial's status command more 
> >>> than 5x faster than Git's status command".
> >>>
> >>> [1] 
> >>> https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
> >>
> >> a lot of that speed comparison is going to depend on your storage system 
> >> and the
> >> size of your repository.
> >>
> >> if you have a high-end enterprise storage system that tracks metadata very
> >> differently from the file contents (I've seen some that have rackes worth 
> >> of
> >> SATA drives for contents and then 'small' arrays of a few dozen flash 
> >> drives for
> >> the metadata), and then you have very large repositories (Facebook has
> >> everything in a single repo), then you have a perfect storm where 
> >> something like
> >> watchman that talks the proprietary protocol of the storage array can be 
> >> FAR
> >> faster than anything that needs to operate with the standard POSIX calls.
> >>
> >> That can easily account for the difference between the facebook 
> >> announcement and
> >> the results presented for normal disks that show an improvement, but with 
> >> even
> >> stock git being faster than improved mercurial.
> >
> > As I recall from Facebook's presentation[1] on this (as well as from the
> > discussion on the git mailing list[2]), Facebook's test respository is
> > much larger than any known git repository.  In particular, it is larger
> > than WebKit.
> 
> agreed, it's huge, it's the entire codebase history of every tool that they 
> use 
> crammed together in one rep
> 
> > These performance improvements are not for server-side
> > tasks, but for client-side (e.g. git/hg status).  Facebook also made
> > other improvements for the client-server communication, and for
> > log/blame, but these are not relevant to watchman.
> 
> well, in their situation they have shared storage that clients use for this 
> huge 
> repo, so I don't think they have a clear client/server boundry the way you 
> are 
> thinking. Even clients have this huge repo to deal with, and they can do so 
> efficiently by querying the storage device rather than trying to walk the 
> tree 
> or monitor access directly.

That's not my understanding from Durham Goode's talk in January.  Yes,
operations involving history go to the server.  But the client also
maintains a copy of the working tree, and it is for this that watchman
is used.  Otherwise, why bother with watchman at all?  The server knows
when it changes files and could simply maintain its own index of what's
changed.  Watchman is built on inotify/fsevents -- it doesn't have
anything to do with any sort of storage device beyond a vanilla hard
drive.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-09 Thread David Lang

On Fri, 9 May 2014, David Turner wrote:


On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote:

On Thu, 8 May 2014, Sebastian Schuberth wrote:


On 03.05.2014 05:40, Felipe Contreras wrote:


That's very interesting. Do you get similar improvements when doing
something similar in Merurial (watchman vs . no watchman).


I have not tried it.  My understanding is that this is why Facebook
wrote Watchman and added support for it to Mercurial, so I would assume
that the improvements are at least this good.


Yeah, my bet is that they are actually much better (because Mercurial
can't be so optimized as Git).

I'm interested in this number because if watchman in Git is improving it
by 30%, but in Mercurial it's improving it by 100% (made up number),
therefore it makes sens that you might want it more if you are using hg,
but not so much if you are using git.

Also, if similar repositories with Mercurial+watchman are actually
faster than Git+watchman, that means that there's room for improvement
in your implementation. This is not a big issue at this point of the
process, just something nice to know.


The article at [1] has some details, they claim "For our repository, enabling 
Watchman integration has made Mercurial's status command more than 5x faster than Git's 
status command".

[1] 
https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/


a lot of that speed comparison is going to depend on your storage system and the
size of your repository.

if you have a high-end enterprise storage system that tracks metadata very
differently from the file contents (I've seen some that have rackes worth of
SATA drives for contents and then 'small' arrays of a few dozen flash drives for
the metadata), and then you have very large repositories (Facebook has
everything in a single repo), then you have a perfect storm where something like
watchman that talks the proprietary protocol of the storage array can be FAR
faster than anything that needs to operate with the standard POSIX calls.

That can easily account for the difference between the facebook announcement and
the results presented for normal disks that show an improvement, but with even
stock git being faster than improved mercurial.


As I recall from Facebook's presentation[1] on this (as well as from the
discussion on the git mailing list[2]), Facebook's test respository is
much larger than any known git repository.  In particular, it is larger
than WebKit.


agreed, it's huge, it's the entire codebase history of every tool that they use 
crammed together in one rep



These performance improvements are not for server-side
tasks, but for client-side (e.g. git/hg status).  Facebook also made
other improvements for the client-server communication, and for
log/blame, but these are not relevant to watchman.


well, in their situation they have shared storage that clients use for this huge 
repo, so I don't think they have a clear client/server boundry the way you are 
thinking. Even clients have this huge repo to deal with, and they can do so 
efficiently by querying the storage device rather than trying to walk the tree 
or monitor access directly.



It is entirely possible that, as repo size grows, Mercurial with
watchman is faster than git without.

With my patches, git status isn't constant-time; it's merely a roughly
constant factor faster. My initial design was to make git status
constant-time by caching the results of the wt_status_collect calls.


This is what you would have to do with traditional storage. My understanding is 
that the real benefits of watchman show up when you have non-traditional storage 
and can take advantage of the knowledge that the storage system gathers for it's 
own use.


David Lang


But there were so many cases with the various options that I got a bit
lost in the wilderness and made a big mess. Maybe I would do better if I
tried it again today.  And maybe if I just build on top of the
untracked-cache code, I would be able to get to constant-time; I'll have
to try that at some point.

[1] http://www.youtube.com/watch?v=Dlguc63cRXg
[2] http://comments.gmane.org/gmane.comp.version-control.git/189776

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-09 Thread David Turner
On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote:
> On Thu, 8 May 2014, Sebastian Schuberth wrote:
> 
> > On 03.05.2014 05:40, Felipe Contreras wrote:
> >
>  That's very interesting. Do you get similar improvements when doing
>  something similar in Merurial (watchman vs . no watchman).
> >>>
> >>> I have not tried it.  My understanding is that this is why Facebook
> >>> wrote Watchman and added support for it to Mercurial, so I would assume
> >>> that the improvements are at least this good.
> >>
> >> Yeah, my bet is that they are actually much better (because Mercurial
> >> can't be so optimized as Git).
> >>
> >> I'm interested in this number because if watchman in Git is improving it
> >> by 30%, but in Mercurial it's improving it by 100% (made up number),
> >> therefore it makes sens that you might want it more if you are using hg,
> >> but not so much if you are using git.
> >>
> >> Also, if similar repositories with Mercurial+watchman are actually
> >> faster than Git+watchman, that means that there's room for improvement
> >> in your implementation. This is not a big issue at this point of the
> >> process, just something nice to know.
> >
> > The article at [1] has some details, they claim "For our repository, 
> > enabling Watchman integration has made Mercurial's status command more than 
> > 5x faster than Git's status command".
> >
> > [1] 
> > https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
> 
> a lot of that speed comparison is going to depend on your storage system and 
> the 
> size of your repository.
> 
> if you have a high-end enterprise storage system that tracks metadata very 
> differently from the file contents (I've seen some that have rackes worth of 
> SATA drives for contents and then 'small' arrays of a few dozen flash drives 
> for 
> the metadata), and then you have very large repositories (Facebook has 
> everything in a single repo), then you have a perfect storm where something 
> like 
> watchman that talks the proprietary protocol of the storage array can be FAR 
> faster than anything that needs to operate with the standard POSIX calls.
> 
> That can easily account for the difference between the facebook announcement 
> and 
> the results presented for normal disks that show an improvement, but with 
> even 
> stock git being faster than improved mercurial.

As I recall from Facebook's presentation[1] on this (as well as from the
discussion on the git mailing list[2]), Facebook's test respository is
much larger than any known git repository.  In particular, it is larger
than WebKit. These performance improvements are not for server-side
tasks, but for client-side (e.g. git/hg status).  Facebook also made
other improvements for the client-server communication, and for
log/blame, but these are not relevant to watchman.

It is entirely possible that, as repo size grows, Mercurial with
watchman is faster than git without.

With my patches, git status isn't constant-time; it's merely a roughly
constant factor faster. My initial design was to make git status
constant-time by caching the results of the wt_status_collect calls.
But there were so many cases with the various options that I got a bit
lost in the wilderness and made a big mess. Maybe I would do better if I
tried it again today.  And maybe if I just build on top of the
untracked-cache code, I would be able to get to constant-time; I'll have
to try that at some point.  

[1] http://www.youtube.com/watch?v=Dlguc63cRXg
[2] http://comments.gmane.org/gmane.comp.version-control.git/189776

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-09 Thread David Lang

On Thu, 8 May 2014, Sebastian Schuberth wrote:


On 03.05.2014 05:40, Felipe Contreras wrote:


That's very interesting. Do you get similar improvements when doing
something similar in Merurial (watchman vs . no watchman).


I have not tried it.  My understanding is that this is why Facebook
wrote Watchman and added support for it to Mercurial, so I would assume
that the improvements are at least this good.


Yeah, my bet is that they are actually much better (because Mercurial
can't be so optimized as Git).

I'm interested in this number because if watchman in Git is improving it
by 30%, but in Mercurial it's improving it by 100% (made up number),
therefore it makes sens that you might want it more if you are using hg,
but not so much if you are using git.

Also, if similar repositories with Mercurial+watchman are actually
faster than Git+watchman, that means that there's room for improvement
in your implementation. This is not a big issue at this point of the
process, just something nice to know.


The article at [1] has some details, they claim "For our repository, enabling 
Watchman integration has made Mercurial's status command more than 5x faster than Git's 
status command".

[1] 
https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/


a lot of that speed comparison is going to depend on your storage system and the 
size of your repository.


if you have a high-end enterprise storage system that tracks metadata very 
differently from the file contents (I've seen some that have rackes worth of 
SATA drives for contents and then 'small' arrays of a few dozen flash drives for 
the metadata), and then you have very large repositories (Facebook has 
everything in a single repo), then you have a perfect storm where something like 
watchman that talks the proprietary protocol of the storage array can be FAR 
faster than anything that needs to operate with the standard POSIX calls.


That can easily account for the difference between the facebook announcement and 
the results presented for normal disks that show an improvement, but with even 
stock git being faster than improved mercurial.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-08 Thread Sebastian Schuberth
On 03.05.2014 05:40, Felipe Contreras wrote:

>>> That's very interesting. Do you get similar improvements when doing
>>> something similar in Merurial (watchman vs . no watchman).
>>
>> I have not tried it.  My understanding is that this is why Facebook
>> wrote Watchman and added support for it to Mercurial, so I would assume
>> that the improvements are at least this good.
> 
> Yeah, my bet is that they are actually much better (because Mercurial
> can't be so optimized as Git).
> 
> I'm interested in this number because if watchman in Git is improving it
> by 30%, but in Mercurial it's improving it by 100% (made up number),
> therefore it makes sens that you might want it more if you are using hg,
> but not so much if you are using git.
> 
> Also, if similar repositories with Mercurial+watchman are actually
> faster than Git+watchman, that means that there's room for improvement
> in your implementation. This is not a big issue at this point of the
> process, just something nice to know.

The article at [1] has some details, they claim "For our repository, enabling 
Watchman integration has made Mercurial's status command more than 5x faster 
than Git's status command".

[1] 
https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/

-- 
Sebastian Schuberth
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-05 Thread David Turner
On Sun, 2014-05-04 at 07:15 +0700, Duy Nguyen wrote:
> > I would like to merge the feature into master.  It works well for me,
> > and some of my colleagues who have tried it out.
> 
> Have you tried to turn watchman on by default, then run it with git
> test suite? That usually helps.

I have.  The tests work run fine under make, but prove sometimes freezes
due to an issue in libwatchman which I just fixed (and which I plan to
merge as soon as I can get a colleague to look the changes over).

> > I can split the vmac patch into two, but one of them will remain quite
> > large because it contains the code for VMAC and AES, which total a bit
> > over 100k.  Since the list will probably reject that, I'll post a link
> > to a repository containing the patches.
> 
> With the read-cache deamon, I think hashing cost is less of an issue,
> so new hashing algorithm becomes less important. If you store the file
> cache in the deamon's memory only, there's no need to hash anything.
> But I guess you already tried this.

I agree that with the daemon, the cost is less of an issue, but I am not
100% sure it is a non-issue; consecutive commands that need to
read/write the index can still be slowed down.

> > I'm not 100% sure how to split the watchman patch up.  I could add the
> > fs_cache code and then separately add the watchman code that populates
> > the cache.  Do you think there is a need to divide it up beyond this?
> 
> I'll need to have closer look at your patches to give any suggestions.

I have uploaded a new version (which is about 5-10% faster and which
corrects some minor changes) to https://github.com/dturner-tw/git.git on
the watchman branch.  

> Although if you don't mind waiting a bit, I can try to put my
> untracked cache patches in good shape (hopefully in 2 weeks), then you
> can mostly avoid touching dir.c and reuse my work.

If the untracked cache patches are going to make it into master, then I
would of course be willing to rewrite on top of them.  But I would also
like to have a sense of whether there is any interest in watchman
support (outside of Twitter).

For what it's worth, the numbers today for index version 4 are for my
superscience repo are:
~380 (no watchman), ~260 (untracked-cache), ~175 (watchman).

That's because untracked-cache still has to stat every directory.

> I backed away from watchman support because I was worried about its
> overhead (of watchman itself, and git/watchman IPC because it's not
> designed specifically for git), which led me to try optimizing git as
> much as possible without watchman first, then see how/if watchman can
> help on top of that. I still think it's a good approach (maybe because
> it started to make me doubt if watchman could pull a big performance
> win on top to justify the changes to support it)

I think on large repositories (especially deeply-nested ones), with the
common case of a small number of changes, watchman will end up being a
big win.  Java tends towards deep nesting
(src/main/java/com/twitter/common/...), which is probably why my test
repo had the largest speedup (>50%).  The IPC overhead might become bad
if there were a large number of changes, but so far this has not been an
issue for me in testing.  

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-05 Thread Duy Nguyen
On Tue, May 6, 2014 at 7:26 AM, Duy Nguyen  wrote:
> This discard_index() code was added
> recently in e28f764 (unpack-trees: plug a memory leak - 2013-08-13).
> As a workaround, perhaps we only do so when the sequencer is running.

Hmm.. as o->result does not carry cache-tree anyway, the next
assignment after discard_index() will "destroy" cache-tree anyway. I
wrote and deleted the following sentence, but it looks like I should
rewrite again. So if cache-tree has always been destroyed before, be
careful when you try to keep cache-tree as cache-tree invalidation
code in unpack-trees.c may not be well tested.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-05 Thread Duy Nguyen
On Sat, May 3, 2014 at 7:52 AM, Duy Nguyen  wrote:
> wt_status_collect_changes_index() depends on how damaged cache-tree is
> (in this case, totally scraped). watchman does not help this either.
> We need to try to "heal" cache-tree as much as possible to reduce the
> number.

On the topic of cache-tree, I notice that unpack_trees() will call
discard_index() in the end (or even discard_index() on a merge
failure). That destroys cache-tree. unpack_trees() is the core of
branch switching, or reset/merge, which I consider frequent
operations. Cache-tree destruction is bad for "git diff --cached",
"git commit" and maybe more. This discard_index() code was added
recently in e28f764 (unpack-trees: plug a memory leak - 2013-08-13).
As a workaround, perhaps we only do so when the sequencer is running.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Watchman support for git

2014-05-05 Thread Felipe Contreras
David Turner wrote:
> On Fri, 2014-05-02 at 22:40 -0500, Felipe Contreras wrote:
> > David Turner wrote:
> > > On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote:
> > > > dturner@ wrote:
> > > > > Test repository 1: Linux
> > > > > 
> > > > > Linux is about 45k files in 3k directories.  The average length of a
> > > > > filename is about 32 bytes.
> > > > > 
> > > > > Git status timing:
> > > > > no watchman: 125ms
> > > > > watchman: 90ms
> > > > 
> > > > That's very interesting. Do you get similar improvements when doing
> > > > something similar in Merurial (watchman vs . no watchman).
> > > 
> > > I have not tried it.  My understanding is that this is why Facebook
> > > wrote Watchman and added support for it to Mercurial, so I would assume
> > > that the improvements are at least this good.
> > 
> > Yeah, my bet is that they are actually much better (because Mercurial
> > can't be so optimized as Git).
> >
> > I'm interested in this number because if watchman in Git is improving it
> > by 30%, but in Mercurial it's improving it by 100% (made up number),
> > therefore it makes sens that you might want it more if you are using hg,
> > but not so much if you are using git.
> > 
> > Also, if similar repositories with Mercurial+watchman are actually
> > faster than Git+watchman, that means that there's room for improvement
> > in your implementation. This is not a big issue at this point of the
> > process, just something nice to know.
> 
> Converting git repos to hg seems to be incredibly slow.  (I have not yet
> tried doing it with git-remote-hg).

I haven't profiled the git->hg conversion, only the other way around,
but I don't see why it would be slow in git-remote-hg.

The only restriction is that octopus merges are not supported, so you
would probably need to find another repository.

> But I did find a hg repository for
> linux: https://bitbucket.org/orzel/linux-kernel-stable
> 
> On this repo, I get:
> hg without watchman: 620ms
> hg with watchman: 264ms
> (compared to 125ms/90ms for git).
> 
> The number of syscalls is, perhaps also interesting:
> no watchman / with watchman
> hg  3   /  5421
> git 59180   /  599 
> 
> (About 1/3 of hg's syscalls with watchman seem to be loading Python
> stuff)

Exactly what I thought :) 28% improvement for Git, 76% improvement for
Mercurial. Still, the improvement would be welcome by big companies I
bet.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Watchman support for git

2014-05-05 Thread David Turner
On Fri, 2014-05-02 at 22:40 -0500, Felipe Contreras wrote:
> David Turner wrote:
> > On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote:
> > > dturner@ wrote:
> > > > Test repository 1: Linux
> > > > 
> > > > Linux is about 45k files in 3k directories.  The average length of a
> > > > filename is about 32 bytes.
> > > > 
> > > > Git status timing:
> > > > no watchman: 125ms
> > > > watchman: 90ms
> > > 
> > > That's very interesting. Do you get similar improvements when doing
> > > something similar in Merurial (watchman vs . no watchman).
> > 
> > I have not tried it.  My understanding is that this is why Facebook
> > wrote Watchman and added support for it to Mercurial, so I would assume
> > that the improvements are at least this good.
> 
> Yeah, my bet is that they are actually much better (because Mercurial
> can't be so optimized as Git).
>
> I'm interested in this number because if watchman in Git is improving it
> by 30%, but in Mercurial it's improving it by 100% (made up number),
> therefore it makes sens that you might want it more if you are using hg,
> but not so much if you are using git.
> 
> Also, if similar repositories with Mercurial+watchman are actually
> faster than Git+watchman, that means that there's room for improvement
> in your implementation. This is not a big issue at this point of the
> process, just something nice to know.

Converting git repos to hg seems to be incredibly slow.  (I have not yet
tried doing it with git-remote-hg).  But I did find a hg repository for
linux: https://bitbucket.org/orzel/linux-kernel-stable

On this repo, I get:
hg without watchman: 620ms
hg with watchman: 264ms
(compared to 125ms/90ms for git).

The number of syscalls is, perhaps also interesting:
no watchman / with watchman
hg  3   /  5421
git 59180   /  599 

(About 1/3 of hg's syscalls with watchman seem to be loading Python
stuff)

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-03 Thread Duy Nguyen
On Sun, May 4, 2014 at 3:49 AM, David Turner  wrote:
> On Sat, 2014-05-03 at 15:49 +0700, Duy Nguyen wrote:
>> On Sat, May 3, 2014 at 11:39 AM, David Turner  
>> wrote:
>> >> Index v4 and split index (and the following read-cache daemon,
>> >> hopefully)
>> >
>> > Looking at some of the archives for read-cache daemon, it seems to be
>> > somewhat similar to watchman, right?  But I only saw inotify code; what
>> > about Mac OS?  Or am I misunderstanding what it is?
>>
>> It's mentioned in [1], the second paragraph, mostly to hide index I/O
>> read cost and the SHA-1 hashing cost in the background. In theory it
>> should work on all platforms that support multiple processes and
>> efficient IPC. It can help load watchman file cache faster too.
>
> Yes, that seems like a good idea.
>
> I actually wrote some of a more-complicated, weirder version of this
> idea.  In my version, there was a long-running git daemon process that
> held the index, the watchman file cache, and also objects loaded from
> the object database.  Other git commands would then send their
> command-line and arguments over to the daemon, which would run the
> commands and send stdin/out/err back.  Of course, this is complicated
> because git commands are designed to run then exit, so they often rely
> on variables being initialized to zero, or fail to free memory.  I used
> the Boehm GC to handle the memory freeing problem.  To handle variables
> that needed to be reinitialized, I used __attribute__(section...) to put
> them all into one section, which I could save on daemon startup and
> restore after each command.  I also replaced calls to exit() with a
> function that called longjmp() so the daemon could survive commands
> failing.  Had I continued, I would also have had to track open file
> descriptors to avoid leaking those.
>
> This was a giant mess that only sort-of worked: it was difficult to
> track down all of the variables that needed to be reinitialized.
>
> The advantage of my method is that there was somewhat less data to
> marshall over IPC, and that objects could be easily cached; the
> disadvantage is complexity, massive code changes, and the fact that it
> didn't actually totally work at the time I ran out of time.
>
> So I'm really looking forward to trying your version!

Hm.. I may face the same problem if I'm not careful. So far I think
the daemon only holds index data (with on-disk format, not in-memory),
mainly to cut out SHA-1 hashing cost. This is still at the idea phase
for me though, nothing is materialized yet.

> I would like to merge the feature into master.  It works well for me,
> and some of my colleagues who have tried it out.

Have you tried to turn watchman on by default, then run it with git
test suite? That usually helps.

> I can split the vmac patch into two, but one of them will remain quite
> large because it contains the code for VMAC and AES, which total a bit
> over 100k.  Since the list will probably reject that, I'll post a link
> to a repository containing the patches.

With the read-cache deamon, I think hashing cost is less of an issue,
so new hashing algorithm becomes less important. If you store the file
cache in the deamon's memory only, there's no need to hash anything.
But I guess you already tried this.

> I'm not 100% sure how to split the watchman patch up.  I could add the
> fs_cache code and then separately add the watchman code that populates
> the cache.  Do you think there is a need to divide it up beyond this?

I'll need to have closer look at your patches to give any suggestions.
Although if you don't mind waiting a bit, I can try to put my
untracked cache patches in good shape (hopefully in 2 weeks), then you
can mostly avoid touching dir.c and reuse my work.

I backed away from watchman support because I was worried about its
overhead (of watchman itself, and git/watchman IPC because it's not
designed specifically for git), which led me to try optimizing git as
much as possible without watchman first, then see how/if watchman can
help on top of that. I still think it's a good approach (maybe because
it started to make me doubt if watchman could pull a big performance
win on top to justify the changes to support it)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-03 Thread David Turner
On Sat, 2014-05-03 at 15:49 +0700, Duy Nguyen wrote:
> On Sat, May 3, 2014 at 11:39 AM, David Turner  
> wrote:
> >> Index v4 and split index (and the following read-cache daemon,
> >> hopefully)
> >
> > Looking at some of the archives for read-cache daemon, it seems to be
> > somewhat similar to watchman, right?  But I only saw inotify code; what
> > about Mac OS?  Or am I misunderstanding what it is?
> 
> It's mentioned in [1], the second paragraph, mostly to hide index I/O
> read cost and the SHA-1 hashing cost in the background. In theory it
> should work on all platforms that support multiple processes and
> efficient IPC. It can help load watchman file cache faster too.

Yes, that seems like a good idea.

I actually wrote some of a more-complicated, weirder version of this
idea.  In my version, there was a long-running git daemon process that
held the index, the watchman file cache, and also objects loaded from
the object database.  Other git commands would then send their
command-line and arguments over to the daemon, which would run the
commands and send stdin/out/err back.  Of course, this is complicated
because git commands are designed to run then exit, so they often rely
on variables being initialized to zero, or fail to free memory.  I used
the Boehm GC to handle the memory freeing problem.  To handle variables
that needed to be reinitialized, I used __attribute__(section...) to put
them all into one section, which I could save on daemon startup and
restore after each command.  I also replaced calls to exit() with a
function that called longjmp() so the daemon could survive commands
failing.  Had I continued, I would also have had to track open file
descriptors to avoid leaking those.

This was a giant mess that only sort-of worked: it was difficult to
track down all of the variables that needed to be reinitialized. 

The advantage of my method is that there was somewhat less data to
marshall over IPC, and that objects could be easily cached; the
disadvantage is complexity, massive code changes, and the fact that it
didn't actually totally work at the time I ran out of time. 

So I'm really looking forward to trying your version!

> >> The last line could be a competition between watchman and my coming
> >> "untracked cache" series. I expect to cut the number in that line at
> >> least in half without external dependency.
> >
> > I hadn't seen the "untracked cached" work (I actually finished these
> > patches a month or so ago but have been waiting for some internal
> > reviews before sending them out).  Looks interesting.  It seems we use a
> > similar strategy for handling ignores.
> 
> Yep, mostly the same at the core, except that I exploit directory
> mtime while you use inotify. Each approach has its own pros and cons,
> I think. Both should face the same traps in caching (e.g. if you "git
> rm --cached" a file, that file could be come either untracked, or
> ignored).
> 
> >> Patch 2/3 did not seem to make it to the list by the way..
> >
> > Thanks for your comments.  I just tried again to send patch 2/3.  I do
> > actually see the CC of it in my @twitter.com mailbox, but I don't see it
> > in the archives on the web.  Do you know if there is a reason the
> > mailing list would reject it?
> 
> Probably its size, 131K, which is also an indicator to split it (and
> the third patch) into smaller patches if you want to merge this
> feature in master eventually.

I would like to merge the feature into master.  It works well for me,
and some of my colleagues who have tried it out.

I can split the vmac patch into two, but one of them will remain quite
large because it contains the code for VMAC and AES, which total a bit
over 100k.  Since the list will probably reject that, I'll post a link
to a repository containing the patches.

I'm not 100% sure how to split the watchman patch up.  I could add the
fs_cache code and then separately add the watchman code that populates
the cache.  Do you think there is a need to divide it up beyond this?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-03 Thread Duy Nguyen
On Sat, May 3, 2014 at 11:39 AM, David Turner  wrote:
>> Index v4 and split index (and the following read-cache daemon,
>> hopefully)
>
> Looking at some of the archives for read-cache daemon, it seems to be
> somewhat similar to watchman, right?  But I only saw inotify code; what
> about Mac OS?  Or am I misunderstanding what it is?

It's mentioned in [1], the second paragraph, mostly to hide index I/O
read cost and the SHA-1 hashing cost in the background. In theory it
should work on all platforms that support multiple processes and
efficient IPC. It can help load watchman file cache faster too.

>> The last line could be a competition between watchman and my coming
>> "untracked cache" series. I expect to cut the number in that line at
>> least in half without external dependency.
>
> I hadn't seen the "untracked cached" work (I actually finished these
> patches a month or so ago but have been waiting for some internal
> reviews before sending them out).  Looks interesting.  It seems we use a
> similar strategy for handling ignores.

Yep, mostly the same at the core, except that I exploit directory
mtime while you use inotify. Each approach has its own pros and cons,
I think. Both should face the same traps in caching (e.g. if you "git
rm --cached" a file, that file could be come either untracked, or
ignored).

>> Patch 2/3 did not seem to make it to the list by the way..
>
> Thanks for your comments.  I just tried again to send patch 2/3.  I do
> actually see the CC of it in my @twitter.com mailbox, but I don't see it
> in the archives on the web.  Do you know if there is a reason the
> mailing list would reject it?

Probably its size, 131K, which is also an indicator to split it (and
the third patch) into smaller patches if you want to merge this
feature in master eventually.

>   At any rate, the contents may be found
> at
> https://github.com/dturner-tw/git/commit/cf587d54fc72d82a23267348afa2c4b60f14ce51.diff

Good enough for me :)

>
>> initial
>> reaction is storing the list of all paths seems too much, but I'll
>> need to play with it a bit to understand it.
>
> I wonder if it would make sense to use the untracked cache as the
> storage strategy, but use watchman as the update strategy.

I'm afraid not. If a directory mtime is changed, which means
files/dirs have been added or deleted, the untracked code would fall
back to the opendir/readdir/is_excluded dance again on that directory.
If we naively do the same using watchman, we lose its advantage that
it knows exactly what files/dirs are added/removed. That kind of
knowledge can help speed up the dance, which is not stored anywhere in
the untracked cache.

We could extend the "read-cache daemon" mentioned above though, to
hide all the hard work in the background and present a good view to
git: when a file/dir is added, read-cache daemon classifies the new
files/dirs as tracked/untracked/ignore and update its untracked cache
in memory. When "git status" asks about the index and untracked cache,
it will receive the _updated_ cache (not the on disk version any more)
with latest dir mtime so git can verify the cache is perfect and skip
opendir/ All git does is to write the index down in the end to
make the updated data permanent. It sounds interesting. But I'm not so
sure if it's worth the complexity.

[1] http://article.gmane.org/gmane.comp.version-control.git/247268
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-02 Thread David Turner
On Sat, 2014-05-03 at 07:52 +0700, Duy Nguyen wrote:
> On Sat, May 3, 2014 at 6:14 AM,   wrote:
> > The index format change might be less important with the split index;
> > I haven't investigated that since at the time I wrote these patches,
> > it didn't exist.
> 
> This is the worst case scenario of "git status" on webkit.git (182k
> files, path name 74 byte long on average), hot cache, no SSD
> 
>366.379ms gitmodules_config:199 if (read_cache() < 0) die("index file
>  0.004ms cmd_status:1294 read_cache_preload(&s.pathspec);
>488.433ms cmd_status:1295 refresh_index(&the_index, REFRESH_QUIE
>456.495ms cmd_status:1299 update_index_if_able(&the_index, &inde
> 13.088ms wt_status_collect:616 wt_status_collect_changes_worktree(s)
>706.926ms wt_status_collect:621 wt_status_collect_changes_index(s)
>100.495ms lazy_init_name_hash:136 { int nr; if (istate->name_hash_initia
>921.185ms wt_status_collect:622 wt_status_collect_untracked(s)
> 
> real0m2.969s
> user0m1.943s
> sys 0m1.021s

For me, those times are:
0m0.581s (no watchman, index v4)
0m0.465s (watchman, index v4)
0m0.445s (watchman, index v5)

That's not huge win on its own, but (a) it's better than nothing and (b)
it lays the groundwork for other improvements.

A fair amount (~12%) of the time seems to be spent in zlib; this varies
based on how the data is packed IIRC. 

> Index v4 and split index (and the following read-cache daemon,
> hopefully) 

Looking at some of the archives for read-cache daemon, it seems to be
somewhat similar to watchman, right?  But I only saw inotify code; what
about Mac OS?  Or am I misunderstanding what it is?

> should help reduce numbers of the 1st and 4th lines, I
> expect to less than 50ms each line. lazy_init_name_hash could be taken
> away with read-cache daemon also.
> 
> core.preloadindex can cut the total number of 2nd and 3rd lines by
> half. Watchman should help in these two lines, but it should do better
> than core.preloadindex.
> 
> wt_status_collect_changes_index() depends on how damaged cache-tree is
> (in this case, totally scraped). watchman does not help this either.
> We need to try to "heal" cache-tree as much as possible to reduce the
> number.
> 
> The last line could be a competition between watchman and my coming
> "untracked cache" series. I expect to cut the number in that line at
> least in half without external dependency.

I hadn't seen the "untracked cached" work (I actually finished these
patches a month or so ago but have been waiting for some internal
reviews before sending them out).  Looks interesting.  It seems we use a
similar strategy for handling ignores.

> Patch 2/3 did not seem to make it to the list by the way.. 

Thanks for your comments.  I just tried again to send patch 2/3.  I do
actually see the CC of it in my @twitter.com mailbox, but I don't see it
in the archives on the web.  Do you know if there is a reason the
mailing list would reject it?  At any rate, the contents may be found
at 
https://github.com/dturner-tw/git/commit/cf587d54fc72d82a23267348afa2c4b60f14ce51.diff

> initial
> reaction is storing the list of all paths seems too much, but I'll
> need to play with it a bit to understand it.

I wonder if it would make sense to use the untracked cache as the
storage strategy, but use watchman as the update strategy.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Watchman support for git

2014-05-02 Thread Felipe Contreras
David Turner wrote:
> On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote:
> > dturner@ wrote:
> > > Test repository 1: Linux
> > > 
> > > Linux is about 45k files in 3k directories.  The average length of a
> > > filename is about 32 bytes.
> > > 
> > > Git status timing:
> > > no watchman: 125ms
> > > watchman: 90ms
> > 
> > That's very interesting. Do you get similar improvements when doing
> > something similar in Merurial (watchman vs . no watchman).
> 
> I have not tried it.  My understanding is that this is why Facebook
> wrote Watchman and added support for it to Mercurial, so I would assume
> that the improvements are at least this good.

Yeah, my bet is that they are actually much better (because Mercurial
can't be so optimized as Git).

I'm interested in this number because if watchman in Git is improving it
by 30%, but in Mercurial it's improving it by 100% (made up number),
therefore it makes sens that you might want it more if you are using hg,
but not so much if you are using git.

Also, if similar repositories with Mercurial+watchman are actually
faster than Git+watchman, that means that there's room for improvement
in your implementation. This is not a big issue at this point of the
process, just something nice to know.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Watchman support for git

2014-05-02 Thread David Turner
On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote:
> dturner@ wrote:
> > Test repository 1: Linux
> > 
> > Linux is about 45k files in 3k directories.  The average length of a
> > filename is about 32 bytes.
> > 
> > Git status timing:
> > no watchman: 125ms
> > watchman: 90ms
> 
> That's very interesting. Do you get similar improvements when doing
> something similar in Merurial (watchman vs . no watchman).

I have not tried it.  My understanding is that this is why Facebook
wrote Watchman and added support for it to Mercurial, so I would assume
that the improvements are at least this good.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Watchman support for git

2014-05-02 Thread Duy Nguyen
On Sat, May 3, 2014 at 6:14 AM,   wrote:
> The index format change might be less important with the split index;
> I haven't investigated that since at the time I wrote these patches,
> it didn't exist.

This is the worst case scenario of "git status" on webkit.git (182k
files, path name 74 byte long on average), hot cache, no SSD

   366.379ms gitmodules_config:199 if (read_cache() < 0) die("index file
 0.004ms cmd_status:1294 read_cache_preload(&s.pathspec);
   488.433ms cmd_status:1295 refresh_index(&the_index, REFRESH_QUIE
   456.495ms cmd_status:1299 update_index_if_able(&the_index, &inde
13.088ms wt_status_collect:616 wt_status_collect_changes_worktree(s)
   706.926ms wt_status_collect:621 wt_status_collect_changes_index(s)
   100.495ms lazy_init_name_hash:136 { int nr; if (istate->name_hash_initia
   921.185ms wt_status_collect:622 wt_status_collect_untracked(s)

real0m2.969s
user0m1.943s
sys 0m1.021s

Index v4 and split index (and the following read-cache daemon,
hopefully) should help reduce numbers of the 1st and 4th lines, I
expect to less than 50ms each line. lazy_init_name_hash could be taken
away with read-cache daemon also.

core.preloadindex can cut the total number of 2nd and 3rd lines by
half. Watchman should help in these two lines, but it should do better
than core.preloadindex.

wt_status_collect_changes_index() depends on how damaged cache-tree is
(in this case, totally scraped). watchman does not help this either.
We need to try to "heal" cache-tree as much as possible to reduce the
number.

The last line could be a competition between watchman and my coming
"untracked cache" series. I expect to cut the number in that line at
least in half without external dependency.

Patch 2/3 did not seem to make it to the list by the way.. initial
reaction is storing the list of all paths seems too much, but I'll
need to play with it a bit to understand it.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Watchman support for git

2014-05-02 Thread Felipe Contreras
dturner@ wrote:
> Test repository 1: Linux
> 
> Linux is about 45k files in 3k directories.  The average length of a
> filename is about 32 bytes.
> 
> Git status timing:
> no watchman: 125ms
> watchman: 90ms

That's very interesting. Do you get similar improvements when doing
something similar in Merurial (watchman vs . no watchman).

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html