Re: [PATCH v2 0/5] New hash table implementation

2013-09-26 Thread Karsten Blees
Am 24.09.2013 13:16, schrieb Tay Ray Chuan:
> Hi Karsten,
> 
> On Tue, Sep 24, 2013 at 5:50 PM, Karsten Blees  
> wrote:
>>
>> |   add|  get 100% hits  |get 10% hits
>> |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
>> ++-+---+-+-+
>> FNV | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
>> FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
>> i   |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
>> ix2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
>> i/10|  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
>> i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206
>>
>> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
>> ./test-hashmap', see test-hashmap.c for definition of method flags.
> 
> I'm not sure if I'm reading the numbers right, but they look impressive!
> 

The numbers are for 100 million additions / lookups (1,000 rounds á 100,000 
entries). Considering everything else that happens in git, the hash table 
performance should be insignificant, though.

> If it's not too much trouble, could you put together an API document,
> along the lines of Documentation/technical/api-hash.txt?

Yes, I had already planned to port the documentation to asciidoc. Although in 
my experience, API documentation in the header file tends to better stay in 
sync with code changes (but this only makes real sense with extraction tools 
such as doxygen).

> I could give
> a stab at replacing patience and histogram diff's hash implementation
> with yours.
> 

Open addressing (i.e. distributing conflicting entries to other buckes) *may* 
be faster *if* all data fits into the table (i.e. no pointers to the data are 
used). Scanning such a table (without following pointers) has very high 
locality and thus may benefit from accessing fewer CPU cache lines. The 
patience implementation seems to fall into this category (although the entry 
struct is fairly large, and it also uses the *2 trick to defeat bad hash codes 
(which wouldn't be necessary with chaining)).

Both patience and histogram use preallocated, fixed-size hash tables, and thus 
won't benefit from faster inserts (the 'add' performance numbers are for 
dynamically resized hash tables).

So, converting patience/histogram is probably not worth the trouble for 
performance reasons alone. If it also simplifies the algorithms and/or reduces 
memory usage - fine.

Ciao,
Karsten
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] New hash table implementation

2013-09-26 Thread Karsten Blees
Am 26.09.2013 12:16, schrieb Fredrik Gustafsson:
> On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
>> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
>> ./test-hashmap', see test-hashmap.c for definition of method flags.
> 
> So I'm still curious about the actual performance improvements for git.
> I runned git describe on the linux kernel with both the old hashmap and
> this new one:
> 

Performance was never the primary issue, the intention of the performance tests 
was to ensure that the new implementation doesn't *slow down* git.

>From the original PATCH/RFC:
- O(1) remove
- builtin entry chaining
- ready-to-use FNV-1 hash functions
- unit test
- additions are ~twice as fast
- uses less memory

So, the new implementation allows us to get rid of workarounds such as the 
CE_UNHASHED flag, duplicate entry chaining code and hash_name() 
implementations. It also addresses the memory usage FIXME in hash.h.

The simplified API may help prevent bugs such as the broken entry chaining in 
name-hash.c (see commits 2548183, 395c735, 2092678).

Maybe we can also replace some of the custom hash table implementations in 
attr.c, decorate.c, fast-import.c and object.c (to name just a few...).

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] New hash table implementation

2013-09-26 Thread Duy Nguyen
On Thu, Sep 26, 2013 at 6:08 PM, Fredrik Gustafsson  wrote:
> On Thu, Sep 26, 2013 at 05:26:27PM +0700, Duy Nguyen wrote:
>> On Thu, Sep 26, 2013 at 5:16 PM, Fredrik Gustafsson  wrote:
>> > On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
>> >> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
>> >> ./test-hashmap', see test-hashmap.c for definition of method flags.
>> >
>> > So I'm still curious about the actual performance improvements for git.
>> > I runned git describe on the linux kernel with both the old hashmap and
>> > this new one:
>> >
>> > ...
>> >
>> > I can't see any improvements at all here. What do I miss? Am I running
>> > git describe in the wrong way? Does linux.git have too few tags to be
>> > important?
>>
>> I wonder if it makes any difference if there are a lot more refs. I
>> hear gerrit creates a lot but don't know how many. linux-2.6 has ~350
>> refs. How about increasing the number of refs to 3500 refs?
>
> So I runned:
> for i in $(git rev-list HEAD ); do git tag "tag$i" $i ; done
>
> in my linux repo and aborted it after a while:
> iveqy@minilla:/srv/slask/linux$ git tag | wc -l
> 9323
>
> So it's a few at least. Not sure how those artificial tagnames would
> hurt or improve the performance.
>
> Old hashtable
> =
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real0m0.384s
> user0m0.288s
> sys 0m0.092s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real0m0.383s
> user0m0.284s
> sys 0m0.100s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real0m0.386s
> user0m0.312s
> sys 0m0.072s
>
>
> New hashtable
> =
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real0m0.382s
> user0m0.300s
> sys 0m0.084s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real0m0.382s
> user0m0.288s
> sys 0m0.092s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real0m0.384s
> user0m0.296s
> sys 0m0.088s

OK I have to say I don't see any justification for more code then.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] New hash table implementation

2013-09-26 Thread Fredrik Gustafsson
On Thu, Sep 26, 2013 at 05:26:27PM +0700, Duy Nguyen wrote:
> On Thu, Sep 26, 2013 at 5:16 PM, Fredrik Gustafsson  wrote:
> > On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
> >> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
> >> ./test-hashmap', see test-hashmap.c for definition of method flags.
> >
> > So I'm still curious about the actual performance improvements for git.
> > I runned git describe on the linux kernel with both the old hashmap and
> > this new one:
> >
> > ...
> >
> > I can't see any improvements at all here. What do I miss? Am I running
> > git describe in the wrong way? Does linux.git have too few tags to be
> > important?
> 
> I wonder if it makes any difference if there are a lot more refs. I
> hear gerrit creates a lot but don't know how many. linux-2.6 has ~350
> refs. How about increasing the number of refs to 3500 refs?

So I runned:
for i in $(git rev-list HEAD ); do git tag "tag$i" $i ; done

in my linux repo and aborted it after a while:
iveqy@minilla:/srv/slask/linux$ git tag | wc -l
9323

So it's a few at least. Not sure how those artificial tagnames would
hurt or improve the performance.

Old hashtable
=
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real0m0.384s
user0m0.288s
sys 0m0.092s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real0m0.383s
user0m0.284s
sys 0m0.100s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real0m0.386s
user0m0.312s
sys 0m0.072s


New hashtable
=
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real0m0.382s
user0m0.300s
sys 0m0.084s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real0m0.382s
user0m0.288s
sys 0m0.092s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real0m0.384s
user0m0.296s
sys 0m0.088s


-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iv...@iveqy.com
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] New hash table implementation

2013-09-26 Thread Duy Nguyen
On Thu, Sep 26, 2013 at 5:16 PM, Fredrik Gustafsson  wrote:
> On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
>> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
>> ./test-hashmap', see test-hashmap.c for definition of method flags.
>
> So I'm still curious about the actual performance improvements for git.
> I runned git describe on the linux kernel with both the old hashmap and
> this new one:
>
> ...
>
> I can't see any improvements at all here. What do I miss? Am I running
> git describe in the wrong way? Does linux.git have too few tags to be
> important?

I wonder if it makes any difference if there are a lot more refs. I
hear gerrit creates a lot but don't know how many. linux-2.6 has ~350
refs. How about increasing the number of refs to 3500 refs?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] New hash table implementation

2013-09-26 Thread Fredrik Gustafsson
On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
> ./test-hashmap', see test-hashmap.c for definition of method flags.

So I'm still curious about the actual performance improvements for git.
I runned git describe on the linux kernel with both the old hashmap and
this new one:

With old hashmap

iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real0m0.236s
user0m0.216s
sys 0m0.020s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real0m0.236s
user0m0.220s
sys 0m0.016s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real0m0.236s
user0m0.212s
sys 0m0.024s

With new hashmap

iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real0m0.236s
user0m0.216s
sys 0m0.020s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real0m0.235s
user0m0.216s
sys 0m0.020s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real0m0.235s
user0m0.208s
sys 0m0.028s


I can't see any improvements at all here. What do I miss? Am I running
git describe in the wrong way? Does linux.git have too few tags to be
important?

-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iv...@iveqy.com
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] New hash table implementation

2013-09-24 Thread Tay Ray Chuan
Hi Karsten,

On Tue, Sep 24, 2013 at 5:50 PM, Karsten Blees  wrote:
>
> |   add|  get 100% hits  |get 10% hits
> |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
> ++-+---+-+-+
> FNV | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
> FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
> i   |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
> ix2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
> i/10|  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
> i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206
>
> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
> ./test-hashmap', see test-hashmap.c for definition of method flags.

I'm not sure if I'm reading the numbers right, but they look impressive!

If it's not too much trouble, could you put together an API document,
along the lines of Documentation/technical/api-hash.txt? I could give
a stab at replacing patience and histogram diff's hash implementation
with yours.

-- 
Cheers,
Ray Chuan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] New hash table implementation

2013-09-24 Thread Fredrik Gustafsson
On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
> Regarding performance, I have to admit that the difference between the two 
> implementations is far greater than I had anticipated. The following times 
> (in seconds) are from Linux x64 (Debian Sarge) on a Core i7 860 @2.8GHz. All 
> tests have been run with 1,000 rounds of 100,000 entries each.
> 
> The 'get 10% hits' test does 100,000 lookups on a table with 10,000 entries 
> (i.e. 90% unsuccessful lookups).
> 
> The rows denote different hash functions with different qualities:
> - FNV: FNV-1 hash on stringified loop counter (i.e. fnv1(itoa(i))), as
>   an example of a high quality / low collision hash
> - i: just the loop counter (i.e. 0, 1, 2,...), guaranteed collision free
> - i/10: every 10 entries share the same hash code, lots of collisions
> 
> The i and i/10 tests show that open addressing suffers badly from clustering, 
> i.e. with adjacent hash codes, it degrades to linear search. The *2 versions 
> provide for some space between used buckets to better compare it to the 
> chaining version.
> 
> 
> |   add|  get 100% hits  |get 10% hits
> |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
> ++-+---+-+-+
> FNV | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
> FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
> i   |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
> ix2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
> i/10|  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
> i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206
> 
> Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
> ./test-hashmap', see test-hashmap.c for definition of method flags.
> 

So I did this improved hash implementation a few months back. Although I
could do a test like this and see an improvement, I failed to see an
improvement in actual git usage.

Hopefully it was just me doing something wrong, but I abandonned the
idea of a better hashmap since I couldn't see any major performance
boost using git and the current implementation is really simple and easy
to maintain.

So my question to you is, does your hashmap speed up git? And does it
speed it up enough to justify that your implementation is the double
amount of code than the current?
-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iv...@iveqy.com
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/5] New hash table implementation

2013-09-24 Thread Karsten Blees
Also here: https://github.com/kblees/git/tree/kb/hashmap-v2

Changes since initial version:
- removed struct typedefs
- clarified documentation of hashmap_entry
- renamed find_entry -> find_entry_ptr
- added performance tests for lookup


I've also tried to resize the table based on the number of used buckets 
(instead of total entries). However, this doesn't work with hash functions that 
produce 'gaps'. E.g. a hash function that returns only even numbers would fill 
only every second bucket (i.e. max. 50% buckets used), and the 80% resize 
threshold is never reached.


Regarding performance, I have to admit that the difference between the two 
implementations is far greater than I had anticipated. The following times (in 
seconds) are from Linux x64 (Debian Sarge) on a Core i7 860 @2.8GHz. All tests 
have been run with 1,000 rounds of 100,000 entries each.

The 'get 10% hits' test does 100,000 lookups on a table with 10,000 entries 
(i.e. 90% unsuccessful lookups).

The rows denote different hash functions with different qualities:
- FNV: FNV-1 hash on stringified loop counter (i.e. fnv1(itoa(i))), as
  an example of a high quality / low collision hash
- i: just the loop counter (i.e. 0, 1, 2,...), guaranteed collision free
- i/10: every 10 entries share the same hash code, lots of collisions

The i and i/10 tests show that open addressing suffers badly from clustering, 
i.e. with adjacent hash codes, it degrades to linear search. The *2 versions 
provide for some space between used buckets to better compare it to the 
chaining version.


|   add|  get 100% hits  |get 10% hits
|  hash  | hashmap | hash  | hashmap |  hash   | hashmap
++-+---+-+-+
FNV | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
i   |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
ix2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
i/10|  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206

Tests can be reproduced with 'time echo "perfhash[map]  1000" | 
./test-hashmap', see test-hashmap.c for definition of method flags.


Karsten Blees (5):
  add a hashtable implementation that supports O(1) removal
  buitin/describe.c: use new hash map implementation
  diffcore-rename.c: move code around to prepare for the next patch
  diffcore-rename.c: simplify finding exact renames
  diffcore-rename.c: use new hash map implementation

 Makefile   |   3 +
 builtin/describe.c |  53 
 diffcore-rename.c  | 185 ++--
 hashmap.c  | 211 +++
 hashmap.h  | 200 ++
 t/t0011-hashmap.sh | 236 +++
 test-hashmap.c | 354 +
 7 files changed, 1092 insertions(+), 150 deletions(-)
 create mode 100644 hashmap.c
 create mode 100644 hashmap.h
 create mode 100755 t/t0011-hashmap.sh
 create mode 100644 test-hashmap.c


---
git diff kb/hashmap..kb/hashmap-v2:

diff --git a/builtin/describe.c b/builtin/describe.c
index bbc7159..5db5d89 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -25,7 +25,7 @@ static int longformat;
 static int first_parent;
 static int abbrev = -1; /* unspecified */
 static int max_candidates = 10;
-static hashmap names;
+static struct hashmap names;
 static int have_util;
 static const char *pattern;
 static int always;
@@ -38,7 +38,7 @@ static const char *diff_index_args[] = {
 
 
 struct commit_name {
-   hashmap_entry entry;
+   struct hashmap_entry entry;
unsigned char peeled[20];
struct tag *tag;
unsigned prio:2; /* annotated tag = 2, tag = 1, head = 0 */
@@ -280,7 +280,7 @@ static void describe(const char *arg, int last_one)
fprintf(stderr, _("searching to describe %s\n"), arg);
 
if (!have_util) {
-   hashmap_iter iter;
+   struct hashmap_iter iter;
struct commit *c;
struct commit_name *n = hashmap_iter_first(&names, &iter);
for (; n; n = hashmap_iter_next(&iter)) {
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6271af9..2e70d31 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -243,7 +243,7 @@ static int score_compare(const void *a_, const void *b_)
 }
 
 struct file_similarity {
-   hashmap_entry entry;
+   struct hashmap_entry entry;
int index;
struct diff_filespec *filespec;
 };
@@ -260,7 +260,7 @@ static unsigned int hash_filespec(struct diff_filespec 
*filespec)
return hash;
 }
 
-static int find_identical_files(hashmap *srcs,
+static int find_identical_files(struct hashmap *srcs,
int dst_index,
struct diff_options *options)
 {
@@ -310,7 +310