Re: [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames

2018-02-05 Thread Elijah Newren
On Mon, Feb 5, 2018 at 11:44 AM, Stefan Beller  wrote:
>>> Having a stringlist of potentially new dirs sounds like the algorithm is
>>> at least n^2, but how do I know? I'll read on.
>>
>> Yes, I suppose it's technically n^2, but n is expected to be O(1).
>> While one can trivially construct a case making n arbitrarily large,
>> statistically for real world repositories, I expected the mode of n to
>> be 1 and the mean to be less than 2.  My original idea was to use a
>> hash for possible_new_dirs, but since hashes are so painful in C and n
>> should be very small anyway, I didn't bother.  If anyone can find an
>> example of a real world open source repository (linux, webkit, git,
>> etc.) with a merge where n is greater than about 10, I'll be
>> surprised.
>>
>> Does that address your concern, or does it sound like I'm kicking the
>> can down the road?  If it's the latter, we can switch it out.
>
> I think that is fine for now; the real world usage matters more
> than the big O notation. But maybe you want to hint at the possibility of
> speedup (in the commit message or in code?), once someone produces
> a slow case and digs up the code.

Sounds like a good idea; I'll add a comment about this issue to the
commit message.


Re: [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames

2018-02-05 Thread Junio C Hamano
Johannes Sixt  writes:

> Am 03.02.2018 um 22:34 schrieb Elijah Newren:
>>  If anyone can find an
>> example of a real world open source repository (linux, webkit, git,
>> etc.) with a merge where n is greater than about 10, I'll be
>> surprised.
>
> git rev-list --parents --merges master |
>  grep " .* .* .* .* .* .* .* .* .* .* "

Twinkle twinkle little stars? ;-)

rev-list can take --min-parents=10 option for a thing like this.  In
fact, --merges is implemented in terms of that option.



Re: [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames

2018-02-05 Thread Stefan Beller
>> Having a stringlist of potentially new dirs sounds like the algorithm is
>> at least n^2, but how do I know? I'll read on.
>
> Yes, I suppose it's technically n^2, but n is expected to be O(1).
> While one can trivially construct a case making n arbitrarily large,
> statistically for real world repositories, I expected the mode of n to
> be 1 and the mean to be less than 2.  My original idea was to use a
> hash for possible_new_dirs, but since hashes are so painful in C and n
> should be very small anyway, I didn't bother.  If anyone can find an
> example of a real world open source repository (linux, webkit, git,
> etc.) with a merge where n is greater than about 10, I'll be
> surprised.
>
> Does that address your concern, or does it sound like I'm kicking the
> can down the road?  If it's the latter, we can switch it out.

I think that is fine for now; the real world usage matters more
than the big O notation. But maybe you want to hint at the possibility of
speedup (in the commit message or in code?), once someone produces
a slow case and digs up the code.


Re: [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames

2018-02-05 Thread Elijah Newren
// Re-sending because of bounce

On Sun, Feb 4, 2018 at 12:54 AM, Johannes Sixt  wrote:
> Am 03.02.2018 um 22:34 schrieb Elijah Newren:
>>  If anyone can find an
>> example of a real world open source repository (linux, webkit, git,
>> etc.) with a merge where n is greater than about 10, I'll be
>> surprised.
>
> git rev-list --parents --merges master |
>  grep " .* .* .* .* .* .* .* .* .* .* "
>
> returns quite a few hits in the Linux repository. Most notable is
> fa623d1b0222adbe8f822e53c08003b9679a410c; spoiler: it has 30 parents.
>
> -- Hannes

Sorry, I didn't make it very clear what n represented. This is in the
context of detecting directory renames, and in this discussion n
represents the number of distinct directories that files were renamed
into from a single original directory on a given side of the merge. So
your example of number of parents of a commit isn't directly relevant
to this case (also, even along your tangent, merge-recursive is only
invoked when the number of parents is precisely two). However, I find
your nugget rather interesting, even if unrelated. I had known of
merges with more than 10 parents, but somehow was unaware that the
limit of 16 parents had been lifted. And digging through the history,
it was apparently removed quite a while ago. I love the commit message
that lifted the limit too:

"There is no really good reason to have a merge with more than 16
parents, but we have a history of giving our users rope."

:-)


Re: [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames

2018-02-04 Thread Johannes Sixt
Am 03.02.2018 um 22:34 schrieb Elijah Newren:
>  If anyone can find an
> example of a real world open source repository (linux, webkit, git,
> etc.) with a merge where n is greater than about 10, I'll be
> surprised.

git rev-list --parents --merges master |
 grep " .* .* .* .* .* .* .* .* .* .* "

returns quite a few hits in the Linux repository. Most notable is
fa623d1b0222adbe8f822e53c08003b9679a410c; spoiler: it has 30 parents.

-- Hannes


Re: [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames

2018-02-03 Thread Elijah Newren
On Fri, Feb 2, 2018 at 4:26 PM, Stefan Beller  wrote:
> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren  wrote:

>> +static void dir_rename_init(struct hashmap *map)
>> +{
>> +   hashmap_init(map, (hashmap_cmp_fn) dir_rename_cmp, NULL, 0);
>
> See 55c965f3a2 (Merge branch 'sb/hashmap-cleanup', 2017-08-11) or rather
> 201c14e375 (attr.c: drop hashmap_cmp_fn cast, 2017-06-30) for an attempt to
> keep out the casting in the init, but cast the void->type inside the function.

Fixed (locally).

>> +struct dir_rename_entry {
>> +   struct hashmap_entry ent; /* must be the first member! */
>> +   char *dir;
>> +   unsigned non_unique_new_dir:1;
>> +   struct strbuf new_dir;
>> +   struct string_list possible_new_dirs;
>> +};
>
> Could you add comments what these are and if they have any constraints?
> e.g. is 'dir' the full path from the root of the repo and does it end
> with a '/' ?

Good idea, will do.  To help you out with the rest of your review:

yes, dir is full path from the root of the repo, but no it does not
end with a '/'.  The same two constraints hold for how I store
directories in other variables as well.

If someone renames multiple files in a single directory to several
different other directories, then possible_new_dirs is used
(temporarily) to track what those other directories are and how many
files were moved from `dir` to that directory.  The 'winner' (the
target directory with the most renames from `dir` to it) will be
recorded in new_dir.  If there is no 'winner' (a tie for target
directory with the most renames), then non_unique_new_dir is set.
Once we can set either non_unique_new_dir or possible_new_dirs, then
possible_new_dirs is unnecessary and can be freed.

> Having a stringlist of potentially new dirs sounds like the algorithm is
> at least n^2, but how do I know? I'll read on.

Yes, I suppose it's technically n^2, but n is expected to be O(1).
While one can trivially construct a case making n arbitrarily large,
statistically for real world repositories, I expected the mode of n to
be 1 and the mean to be less than 2.  My original idea was to use a
hash for possible_new_dirs, but since hashes are so painful in C and n
should be very small anyway, I didn't bother.  If anyone can find an
example of a real world open source repository (linux, webkit, git,
etc.) with a merge where n is greater than about 10, I'll be
surprised.

Does that address your concern, or does it sound like I'm kicking the
can down the road?  If it's the latter, we can switch it out.

> This patch only adds static functions, so some compilers may even refuse
> to compile after this patch (-Werror -Wunused). I am not sure how strict we
> are there, but as git-bisect still hasn't learned about going only
> into the first
> parent to have bisect working on a "per series" level rather than a "per 
> commit"
> level, it is possible that someone will end up on this commit in the future 
> and
> it won't compile well. :/
>
> Not sure what to recommend, maybe squash this with the patch that makes
> use of these functions?

I'm unsure as well; I can combine it with 19/31 but I thought that
keeping them separate made it slightly easier to review both.


Re: [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames

2018-02-02 Thread Stefan Beller
On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren  wrote:
> This just adds dir_rename_entry and the associated functions; code using
> these will be added in subsequent commits.
>
> Signed-off-by: Elijah Newren 
> ---
>  merge-recursive.c | 35 +++
>  merge-recursive.h |  8 
>  2 files changed, 43 insertions(+)
>
> diff --git a/merge-recursive.c b/merge-recursive.c
> index 8ac69e1cbb..3b6d0e3f70 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -49,6 +49,41 @@ static unsigned int path_hash(const char *path)
> return ignore_case ? strihash(path) : strhash(path);
>  }
>
> +static struct dir_rename_entry *dir_rename_find_entry(struct hashmap 
> *hashmap,
> + char *dir)
> +{
> +   struct dir_rename_entry key;
> +
> +   if (dir == NULL)
> +   return NULL;
> +   hashmap_entry_init(, strhash(dir));
> +   key.dir = dir;
> +   return hashmap_get(hashmap, , NULL);
> +}
> +
> +static int dir_rename_cmp(void *unused_cmp_data,
> + const struct dir_rename_entry *e1,
> + const struct dir_rename_entry *e2,
> + const void *unused_keydata)
> +{
> +   return strcmp(e1->dir, e2->dir);
> +}
> +
> +static void dir_rename_init(struct hashmap *map)
> +{
> +   hashmap_init(map, (hashmap_cmp_fn) dir_rename_cmp, NULL, 0);

See 55c965f3a2 (Merge branch 'sb/hashmap-cleanup', 2017-08-11) or rather
201c14e375 (attr.c: drop hashmap_cmp_fn cast, 2017-06-30) for an attempt to
keep out the casting in the init, but cast the void->type inside the function.

> +static void dir_rename_entry_init(struct dir_rename_entry *entry,
> + char *directory)
> +{
> +   hashmap_entry_init(entry, strhash(directory));
> +   entry->dir = directory;
> +   entry->non_unique_new_dir = 0;
> +   strbuf_init(>new_dir, 0);
> +   string_list_init(>possible_new_dirs, 0);
> +}
> +
>  static void flush_output(struct merge_options *o)
>  {
> if (o->buffer_output < 2 && o->obuf.len) {
> diff --git a/merge-recursive.h b/merge-recursive.h
> index 80d69d1401..d7f4cc80c1 100644
> --- a/merge-recursive.h
> +++ b/merge-recursive.h
> @@ -29,6 +29,14 @@ struct merge_options {
> struct string_list df_conflict_file_set;
>  };
>
> +struct dir_rename_entry {
> +   struct hashmap_entry ent; /* must be the first member! */
> +   char *dir;
> +   unsigned non_unique_new_dir:1;
> +   struct strbuf new_dir;
> +   struct string_list possible_new_dirs;
> +};

Could you add comments what these are and if they have any constraints?
e.g. is 'dir' the full path from the root of the repo and does it end
with a '/' ?

Having a stringlist of potentially new dirs sounds like the algorithm is
at least n^2, but how do I know? I'll read on.

This patch only adds static functions, so some compilers may even refuse
to compile after this patch (-Werror -Wunused). I am not sure how strict we
are there, but as git-bisect still hasn't learned about going only
into the first
parent to have bisect working on a "per series" level rather than a "per commit"
level, it is possible that someone will end up on this commit in the future and
it won't compile well. :/

Not sure what to recommend, maybe squash this with the patch that makes
use of these functions?

Thanks,
Stefan