Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-11-14 Thread Karsten Blees
Am 29.10.2013 10:09, schrieb Karsten Blees:
> On Mon, Oct 28, 2013 at 10:04 PM, Vicent Martí  wrote:
>>
>> On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees  
>> wrote:
>>
>>> Regarding performance, khash uses open addressing, which requires more key 
>>> comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). 
>>> However, any measurable differences will most likely be dwarfed by IO costs 
>>> in this particular use case.
>>
>> I don't think this is true. If you actually run a couple insertion and
>> lookup benchmarks comparing the two implementations, you'll find khash
>> to be around ~30% faster for most workloads (venturing a guess from
>> past experience). I am obviously not the author of khash, but I've
...

Just out of curiosity, I added performance test code for khash to the test in 
my current hashmap patch series [1]. It turns out that khash is by far the 
slowest of the bunch, especially with many collisions.

Again, I don't think that performance matters all that much (or in other words: 
_any_ hash table implementation will probably be fast enough compared to the 
rest that's going on). Its more a question of whether we really need two 
different hash table implementations (and a queasy feeling about the macro 
kludge in khash.h...).


Khash doesn't store the hash codes along with the entries (as both hash.[ch] 
and hashmap.[ch] do), so it needs to re-calculate hash codes on every resize. 
For a fair comparison, the "khash" test uses keys with pre-calculated hash 
codes in the key structure. This should be similar to a hash function that just 
copies 4 bytes from a sha1 key. Khash maps with more complex hash functions 
will be slower (see khstr).

The "khstr" test uses khash's predefined string map and khash's X31 hash 
function for strings (therefore no separate values for different hash functions 
here).

The table is similar to what I posted for hashmap-v2 [2] (i.e. real time in 
seconds for 1,000 rounds á 100,000 entries). I just turned it around a bit to 
make room for khash columns.

test | hash_fn | hashmap |  hash   |  khash  | khstr  |
-+-+-+-+-++
 | FNV |   2.429 |  14.366 |  11.780 | 18.677 |
 | FNV  x2 |   2.946 |  14.558 |  10.922 ||
 add | i   |   1.708 |   7.419 |   4.132 ||
 | ix2 |   1.791 |   8.565 |   4.502 ||
 | i/10|   1.555 |   1.805 | 344.691 ||
 | i/10 x2 |   1.543 |   1.808 | 319.559 ||
-+-+-+-+-++
 | FNV |   1.822 |   3.452 |   4.922 |  8.309 |
get  | FNV  x2 |   2.298 |   3.194 |   4.473 ||
100% | i   |   1.252 |   1.344 |   0.944 ||
hits | ix2 |   1.286 |   1.434 |   1.220 ||
 | i/10|   6.720 |   5.138 | 281.815 ||
 | i/10 x2 |   6.297 |   5.188 | 257.021 ||
-+-+-+-+-++
 | FNV |   1.023 |   3.949 |   4.115 |  4.878 |
get  | FNV  x2 |   1.538 |   3.915 |   4.571 ||
10%  | i   |   0.654 | 397.457 |  38.125 ||
hits | ix2 |   0.718 |   0.722 |   9.111 ||
 | i/10|   1.128 |  30.235 |  60.376 ||
 | i/10 x2 |   1.260 |   1.082 |  43.354 ||
-+-+-+-+-++

[1] https://github.com/kblees/git/commits/kb/hashmap-v5-khash
[2] http://article.gmane.org/gmane.comp.version-control.git/235290


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-29 Thread Karsten Blees
On Mon, Oct 28, 2013 at 10:04 PM, Vicent Martí  wrote:
>
> On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees  
> wrote:
>
> > Regarding performance, khash uses open addressing, which requires more key 
> > comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). 
> > However, any measurable differences will most likely be dwarfed by IO costs 
> > in this particular use case.
>
> I don't think this is true. If you actually run a couple insertion and
> lookup benchmarks comparing the two implementations, you'll find khash
> to be around ~30% faster for most workloads (venturing a guess from
> past experience). I am obviously not the author of khash, but I've
> found that the theoretical increase in key comparisons is completely
> dwarfed by the benefit of increased cache locality during the probing
> fase. I still haven't found a faster C hash table implementation than
> khash for the general case, that's why I normally use it despite the
> worrisome preprocessor crash-party going on in that header file.

Yes, cache locality is where open addressing shines, however, you
loose that benefit when the keys are pointers (e.g. sha1's).

>
>
> > Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for 
> > duplicates. That's probably the reason for the high hashcmp times you found 
> > in the first round of the patch series.
>
> Patch 03 is a refactoring -- the duplicate checking code has been in
> pack-objects.c for years. I am not sure duplicate checking is
> superfluous at all, there are many cases where you could be
> double-inserting objects in a pack-objects run, and you really don't
> want to generate packfiles with dupe objects.

The point is that its in _rehash_. Duplicate checking should be in
add/put. Rehash only rearranges entries that are alread _in_ the map,
and it usually only needs the hash code for that. So pack-objects
implements rehash in terms of a full clear + add-all instead, which is
of course slower than what khash, hashmap etc. would do.

Ciao,
Karsten
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-28 Thread Vicent Martí
On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees  wrote:
>> The new `hashmap.c` covers the first case quite well (albeit slightly
>> more verbosely than I'd like), but in the second case it doesn't quite
>> work. Since the new hash needs to embed the "struct hashmap_entry" on
>> all its values (to allow for separate chaining), having it map to
>> `int` keys requires a struct like this:
>>
>> struct sha1_position {
>> struct hashmap_entry {
>> struct hashmap_entry *next;
>> unsigned int hash;
>> };
>> int position;
>> }
>>
>
> Hmm...isn't that position rather an index into two separately maintained 
> arrays? So you'd rather have:
>
> struct sha1_position {
> struct hashmap_entry {
> struct hashmap_entry *next;
> unsigned int hash;
> };
> uint32_t pack_name_hash;
> struct object *object;
>  }

No, this is not quite right. We use the separate arrays because the
normal operation mode is to index by position (e.g. we need the nth
object in the extended index); the hash table is an auxiliary
structure to reverse that indexing (e.g. what position does this SHA1
have on the extended index). The information which is always required
to construct bitmaps is position, so we need to store the indexes in a
map.

>> khash on the other hand is capable of storing the position values as
>> part of the hash table itself (i.e. `int **buckets`), and saves us
>> from thousands of bytes of allocations + indirection.
>>
>
> However, khash keeps separate arrays for flags, keys and values, all of them 
> overallocated by 1 / load factor (so there's lots of unused space). 
> ext_index.objects and ext_index.hashes are also overallocated by the usual 
> alloc_nr() factor 1.5.

FWIW The flags for khash are compacted, so they are stored much more
tightly than pointers, even when overallocated.

>
> Regarding memory consumption, I think both implementations will be pretty 
> similar. Hashmap allocates many small regions while khash re-allocates a few 
> big ones...I really don't know which is better, ideally entries would be 
> allocated in chunks to minimize both heap overhead and memcpy disadvantes.

I agree, both implementations probably have very similar memory
characteristics, probably enough not to matter.

> Regarding performance, khash uses open addressing, which requires more key 
> comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, 
> any measurable differences will most likely be dwarfed by IO costs in this 
> particular use case.

I don't think this is true. If you actually run a couple insertion and
lookup benchmarks comparing the two implementations, you'll find khash
to be around ~30% faster for most workloads (venturing a guess from
past experience). I am obviously not the author of khash, but I've
found that the theoretical increase in key comparisons is completely
dwarfed by the benefit of increased cache locality during the probing
fase. I still haven't found a faster C hash table implementation than
khash for the general case, that's why I normally use it despite the
worrisome preprocessor crash-party going on in that header file.

> Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for 
> duplicates. That's probably the reason for the high hashcmp times you found 
> in the first round of the patch series.

Patch 03 is a refactoring -- the duplicate checking code has been in
pack-objects.c for years. I am not sure duplicate checking is
superfluous at all, there are many cases where you could be
double-inserting objects in a pack-objects run, and you really don't
want to generate packfiles with dupe objects.

Thanks for the feedback!

luv,
vmg
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-28 Thread Karsten Blees
Am 28.10.2013 17:16, schrieb Vicent Martí:
> On Mon, Oct 28, 2013 at 4:48 PM, Junio C Hamano  wrote:
>>> jk/pack-bitmap adds khash.h, which from a first glance looks like yet
>>> another hash table implementation. I was just wondering if kb's new
>>> hash tables can cover the need of pack-bitmap.c too so we can remove
>>> khash.h later..
>>
>> Good thinking ;-).
> 
> We use the khash tables to map:
> 
> - sha1 (const char *) to (void *)
> - sha1 (const char *) to int
> 
> The new `hashmap.c` covers the first case quite well (albeit slightly
> more verbosely than I'd like), but in the second case it doesn't quite
> work. Since the new hash needs to embed the "struct hashmap_entry" on
> all its values (to allow for separate chaining), having it map to
> `int` keys requires a struct like this:
> 
> struct sha1_position {
> struct hashmap_entry {
> struct hashmap_entry *next;
> unsigned int hash;
> };
> int position;
> }
> 

Hmm...isn't that position rather an index into two separately maintained 
arrays? So you'd rather have:

struct sha1_position {
struct hashmap_entry {
struct hashmap_entry *next;
unsigned int hash;
};
uint32_t pack_name_hash;
struct object *object;
 }

> khash on the other hand is capable of storing the position values as
> part of the hash table itself (i.e. `int **buckets`), and saves us
> from thousands of bytes of allocations + indirection.
> 

However, khash keeps separate arrays for flags, keys and values, all of them 
overallocated by 1 / load factor (so there's lots of unused space). 
ext_index.objects and ext_index.hashes are also overallocated by the usual 
alloc_nr() factor 1.5.

Regarding memory consumption, I think both implementations will be pretty 
similar. Hashmap allocates many small regions while khash re-allocates a few 
big ones...I really don't know which is better, ideally entries would be 
allocated in chunks to minimize both heap overhead and memcpy disadvantes.

Regarding performance, khash uses open addressing, which requires more key 
comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, 
any measurable differences will most likely be dwarfed by IO costs in this 
particular use case.


Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for 
duplicates. That's probably the reason for the high hashcmp times you found in 
the first round of the patch series.

Cheers,
Karsten

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-28 Thread Junio C Hamano
Vicent Martí  writes:

> On Mon, Oct 28, 2013 at 4:48 PM, Junio C Hamano  wrote:
>>> jk/pack-bitmap adds khash.h, which from a first glance looks like yet
>>> another hash table implementation. I was just wondering if kb's new
>>> hash tables can cover the need of pack-bitmap.c too so we can remove
>>> khash.h later..
>> ...
> khash on the other hand is capable of storing the position values as
> part of the hash table itself (i.e. `int **buckets`), and saves us
> from thousands of bytes of allocations + indirection.

My "Good thinking ;-)" comment was primarily meant as "somebody
needs to at least think about the possibility and consider pros and
cons", and you thought about it already ;-).

In short, kb's hash table does not cover the need for pack-bitmap,
so we should keep two at least for now, until (and/or unless) either
side can be shown (and/or extended) to cover the need for the other
one as well.

Thanks. 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-28 Thread Vicent Martí
On Mon, Oct 28, 2013 at 4:48 PM, Junio C Hamano  wrote:
>> jk/pack-bitmap adds khash.h, which from a first glance looks like yet
>> another hash table implementation. I was just wondering if kb's new
>> hash tables can cover the need of pack-bitmap.c too so we can remove
>> khash.h later..
>
> Good thinking ;-).

We use the khash tables to map:

- sha1 (const char *) to (void *)
- sha1 (const char *) to int

The new `hashmap.c` covers the first case quite well (albeit slightly
more verbosely than I'd like), but in the second case it doesn't quite
work. Since the new hash needs to embed the "struct hashmap_entry" on
all its values (to allow for separate chaining), having it map to
`int` keys requires a struct like this:

struct sha1_position {
struct hashmap_entry {
struct hashmap_entry *next;
unsigned int hash;
};
int position;
}

khash on the other hand is capable of storing the position values as
part of the hash table itself (i.e. `int **buckets`), and saves us
from thousands of bytes of allocations + indirection.

I am not sure whether the consistency of having a single hash map
warrants the performance and memory hits when operating on the
extended index.

Please advice.

luv,
vmg
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-28 Thread Junio C Hamano
Duy Nguyen  writes:

> On Sat, Oct 26, 2013 at 6:23 AM, Junio C Hamano  wrote:
>> * kb/fast-hashmap (2013-10-22) 12 commits
>>  - remove old hash.[ch] implementation
>>  - read-cache.c: fix memory leaks caused by removed cache entries
>>  - name-hash.c: remove cache entries instead of marking them CE_UNHASHED
>>  - name-hash.c: use new hash map implementation for cache entries
>>  - name-hash.c: remove unreferenced directory entries
>>  - name-hash.c: use new hash map implementation for directories
>>  - diffcore-rename.c: use new hash map implementation
>>  - diffcore-rename.c: simplify finding exact renames
>>  - diffcore-rename.c: move code around to prepare for the next patch
>>  - buitin/describe.c: use new hash map implementation
>>  - add a hashtable implementation that supports O(1) removal
>>  - submodule: don't access the .gitmodules cache entry after removing it
>>
>>  Improvements to our hash table to get it to meet the needs of the
>>  msysgit fscache project, with some nice performance improvements.
>>
>>  The preparatory clean-up to submodule from Jens is at the bottom. I
>>  also squashed in a fix-up by Karsten found at $gmane/236468 (please
>>  double-check the result).
>
> jk/pack-bitmap adds khash.h, which from a first glance looks like yet
> another hash table implementation. I was just wondering if kb's new
> hash tables can cover the need of pack-bitmap.c too so we can remove
> khash.h later..

Good thinking ;-).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-26 Thread Duy Nguyen
On Sat, Oct 26, 2013 at 6:23 AM, Junio C Hamano  wrote:
> * kb/fast-hashmap (2013-10-22) 12 commits
>  - remove old hash.[ch] implementation
>  - read-cache.c: fix memory leaks caused by removed cache entries
>  - name-hash.c: remove cache entries instead of marking them CE_UNHASHED
>  - name-hash.c: use new hash map implementation for cache entries
>  - name-hash.c: remove unreferenced directory entries
>  - name-hash.c: use new hash map implementation for directories
>  - diffcore-rename.c: use new hash map implementation
>  - diffcore-rename.c: simplify finding exact renames
>  - diffcore-rename.c: move code around to prepare for the next patch
>  - buitin/describe.c: use new hash map implementation
>  - add a hashtable implementation that supports O(1) removal
>  - submodule: don't access the .gitmodules cache entry after removing it
>
>  Improvements to our hash table to get it to meet the needs of the
>  msysgit fscache project, with some nice performance improvements.
>
>  The preparatory clean-up to submodule from Jens is at the bottom. I
>  also squashed in a fix-up by Karsten found at $gmane/236468 (please
>  double-check the result).

jk/pack-bitmap adds khash.h, which from a first glance looks like yet
another hash table implementation. I was just wondering if kb's new
hash tables can cover the need of pack-bitmap.c too so we can remove
khash.h later..
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


What's cooking in git.git (Oct 2013, #06; Fri, 25)

2013-10-25 Thread Junio C Hamano
Here are the topics that have been cooking.  Commits prefixed with
'-' are only in 'pu' (proposed updates) while commits prefixed with
'+' are in 'next'.

You can find the changes described here in the integration branches
of the repositories listed at

http://git-blame.blogspot.com/p/git-public-repositories.html

--
[New Topics]

* bc/http-100-continue (2013-10-23) 1 commit
 - http: add option to enable 100 Continue responses

 Conditionally allow "100 Continue" responses to help use of
 GSS-Negotiate authentication scheme over HTTP transport.

 Seems to be still under discussion.


* jc/merge-base-reflog (2013-10-25) 2 commits
 - merge-base: teach "--fork-point" mode
 - merge-base: use OPT_CMDMODE and clarify the command line parsing

 Code the logic in "pull --rebase" that figures out a fork point
 from reflog entries in C.


* jk/date-c-double-semicolon (2013-10-24) 1 commit
 - drop redundant semicolon in empty while

 Will merge to 'next'.


* jk/for-each-ref-skip-parsing (2013-10-24) 1 commit
 - for-each-ref: avoid loading objects to print %(objectname)

 Will merge to 'next'.


* jk/pack-bitmap (2013-10-25) 19 commits
 - pack-bitmap: implement optional name_hash cache
 - t: add basic bitmap functionality tests
 - repack: consider bitmaps when performing repacks
 - repack: handle optional files created by pack-objects
 - repack: turn exts array into array-of-struct
 - repack: stop using magic number for ARRAY_SIZE(exts)
 - pack-objects: implement bitmap writing
 - rev-list: add bitmap mode to speed up object lists
 - pack-objects: use bitmaps when packing objects
 - pack-bitmap: add support for bitmap indexes
 - documentation: add documentation for the bitmap format
 - ewah: compressed bitmap implementation
 - compat: add endianness helpers
 - sha1_file: export `git_open_noatime`
 - revision: allow setting custom limiter function
 - pack-objects: factor out name_hash
 - pack-objects: refactor the packing list
 - revindex: export new APIs
 - sha1write: make buffer const-correct

 Borrows the bitmap index into packfiles from JGit to speed up
 enumeration of objects involved in a commit range without having to
 fully traverse the history.


* jk/refs-c-squelch-gcc (2013-10-24) 1 commit
 - silence gcc array-bounds warning

 Will merge to 'next'.


* jk/robustify-parse-commit (2013-10-24) 6 commits
 - checkout: do not die when leaving broken detached HEAD
 - use parse_commit_or_die instead of custom message
 - use parse_commit_or_die instead of segfaulting
 - assume parse_commit checks for NULL commit
 - assume parse_commit checks commit->object.parsed
 - log_tree_diff: die when we fail to parse a commit

 Will merge to 'next' after taking another look.


* mh/fetch-tags-in-addition-to-normal-refs (2013-10-24) 16 commits
 - fetch, remote: properly convey --no-prune options to subprocesses
 - builtin/remote.c:update(): use struct argv_array
 - builtin/remote.c: reorder function definitions
 - query_refspecs(): move some constants out of the loop
 - fetch --prune: prune only based on explicit refspecs
 - SQUASH??? --tags is no longer a short-hand
 - fetch --tags: fetch tags *in addition to* other stuff
 - builtin/fetch.c: reorder function definitions
 - ref_remove_duplicates(): improve documentation comment
 - ref_remove_duplicates(): simplify function
 - ref_remove_duplicates(): avoid redundant bisection
 - get_ref_map(): rename local variables
 - api-remote.txt: correct section "struct refspec"
 - t5510: check that "git fetch --prune --tags" does not prune branches
 - t5510: prepare test refs more straightforwardly
 - t5510: use the correct tag name in test

 Some questionable paragraphs in the doc updates, but other than
 that looks reasonably solid.


* nd/lift-path-max (2013-10-24) 2 commits
 - checkout_entry(): clarify the use of topath[] parameter
 - entry.c: convert checkout_entry to use strbuf

 Will merge to 'next'.


* jk/pack-corruption-post-mortem (2013-10-25) 1 commit
 - howto: add article on recovering a corrupted object

 Will merge to 'next'.


* jk/reset-p-current-head-fix (2013-10-25) 2 commits
 - reset: pass real rev name to add--interactive
 - add-interactive: handle unborn branch in patch mode

 "git reset -p HEAD" has codepath to special case it from resetting
 to contents of other commits, but recent change broke it.

 Will merge to 'next'.


* mf/graph-show-root (2013-10-25) 1 commit
 - graph.c: mark root commit differently

 Needs adjustments to some tests.


* nv/parseopt-opt-arg (2013-10-25) 1 commit
 - rev-parse --parseopt: add the --sticked-long mode

 Enhance "rev-parse --parseopt" mode to help parsing options with
 an optional parameter.

--
[Stalled]

* np/pack-v4 (2013-09-18) 90 commits
 . packv4-parse.c: add tree offset caching
 . t1050: replace one instance of show-index with verify-pack
 . index-pack, pack-objects: allow creating .idx v2 with .pack v4
 . unpack-objects: de