Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-12-07 Thread Karsten Blees
Am 03.12.2013 19:21, schrieb Jeff King:
 On Tue, Dec 03, 2013 at 03:40:41PM +0100, Karsten Blees wrote:
 
 IMO, trying to improve khash isn't worth the trouble.

 With smaller-than-pointer types, khash _may_ actually save a few bytes
 compared to hash.[ch] or hashmap.[ch]. E.g. a set of 'int's would be a
 perfect use case for khash. However, you're using bitmaps for that,
 and khash's predefined macros for int sets and maps have been removed
 from the git version.
 
 True, we are not using it for smaller-than-pointer sizes here. So it may
 not be worth thinking about (I was considering more for the general
 case). In most instances, though, we can shove the int bits into a
 pointer with the right casting. So it probably isn't worth worrying
 about (you may waste a few bytes, but probably not more than one word
 per entry).
 
 Using khash with pointers and larger-than-pointer types is just a
 waste of memory and performance, though. Hash tables are sparsely
 filled by design, and using large types means lots of large empty
 buckets. E.g. kh_resize requires almost four times the size or your
 data, and copies everything at least three times.
 
 I think the analysis is more complicated than that, and depends on what
 you are storing. If you are storing something that is 1.5 times the size
 of a pointer, it is more space efficient to just stick it in the hash
 table than it is to have a separate pointer in the hash table (you pay
 the load factor penalty only on the single pointer, but you've almost
 doubled your total storage).
 
 But again, that's a more general argument. We're not storing anything of
 that size here, and in fact I think we are just storing pointers
 everywhere.
 

Yes. With pointers, its a very close call regarding space. Let's take 
bitmap_index.bitmaps / struct stored_bitmap as an example. I'm assuming 8 byte 
pointers and an average load factor of 0.5.

struct stored_bitmap is already kind of an entry structure (key + value) 
allocated on the heap. This is a very common szenario, all hash.[ch] - 
hashmap.[ch] conversions were that way.

The average khash memory requirement per entry is (on top of the data on the 
heap):

  (key + value + flags) / load-factor = (8 + 8 + .25) / .5 = 32.5 bytes

For the hashmap conversion, I simply injected struct hashmap_entry into the 
existing structure, adding 12 bytes per entry. So the total hashmap memory per 
entry is:

  hashmap_entry + (entry-pointer / load-factor) = 12 + (8 / .5) = 28 bytes

 Additionally, there are the obvious problems with khash's macro design
 (hard to read, impossible to debug, and each named instance increases
 executable size by ~4k).
 
 Yes, those are all downsides to macros. Type safety is one of the
 upsides, though.


Well, khash_sha1 maps to void *, so I guess its not that important :-) But if 
you care about type safety, a common core implementation with a few small 
wrapper macros would be hugely preferable, don't you think?
 
 Besides macros, I think the major difference between the two
 implementations is open-addressing versus chaining. Especially for sets,
 we've had good experiences with open-addressing by keeping the load
 factor low (e.g., the one in object.c). As you note, the resizing
 operation pays some penalty, but in most of our workloads it's largely
 irrelevant compared to lookup times.


Resizing is reasonably fast and straightforward for both open addressing and 
chaining. My 'copy three times' remark above was specific to the 
rehash-in-place stunt in kh_resize (if you haven't wondered what these 64 
multi-statement-lines do, you probably don't wanna know... ;-)

But its still best if you know the target size and can prevent resizing 
altogether.

 Chaining typically adds an extra layer of pointer-following to the
 lookup, and I'd be worried about that loss of locality. It's true that
 when you are storing a pointer you are already hurting locality to
 follow the pointer to the key, but I don't know whether that means an
 _extra_ layer doesn't still hurt more (and I really mean I don't know --
 I'd be interested to see measurements).


The locality advantage of open addressing _only_ applies when storing the data 
directly in the table, but AFAIK we're not doing that anywhere in git. Compared 
to open addressing with pointers, chaining in fact has _better_ locality.

Iterating a linked list (i.e. chaining) accesses exactly one (1.0) memory 
location per element.

Iterating an array of pointers (i.e. open addressing with linear probing) 
accesses ~1.125 memory locations per element (assuming 8 byte pointers and 64 
byte cache lines, array[8] will be in a different cache line than array[0], 
thus the +.125).

Of course, any implementation can add more pointer-following to that minimum. 
E.g. khash uses separate flags and keys arrays and triangular numbers as probe 
sequence, so its up to 3 memory locations per element.

Hashmap, on the other hand, allows you to inject it's per-entry data into the 

Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-12-03 Thread Karsten Blees
Am 28.11.2013 11:38, schrieb Jeff King:
 On Wed, Nov 27, 2013 at 10:08:56AM +0100, Karsten Blees wrote:
 
 Khash is OK for sha1 keys, but I don't think it should be advertised
 as a second general purpose hash table implementation. Its far too
 easy to shoot yourself in the foot by using 'straightforward' hash-
 and comparison functions. Khash doesn't store the hash codes of the
 keys, so you have to take care of that yourself or live with the
 performance penalties (see [1]).

 [1] http://article.gmane.org/gmane.comp.version-control.git/237876
 
 Yes. I wonder if we should improve it in that respect. I haven't looked
 carefully at the hash code you posted elsewhere, but I feel like many
 uses will want a macro implementation to let them store arbitrary types
 smaller or larger than a pointer.
 
 -Peff
 

IMO, trying to improve khash isn't worth the trouble.

With smaller-than-pointer types, khash _may_ actually save a few bytes compared 
to hash.[ch] or hashmap.[ch]. E.g. a set of 'int's would be a perfect use case 
for khash. However, you're using bitmaps for that, and khash's predefined 
macros for int sets and maps have been removed from the git version.

Using khash with pointers and larger-than-pointer types is just a waste of 
memory and performance, though. Hash tables are sparsely filled by design, and 
using large types means lots of large empty buckets. E.g. kh_resize requires 
almost four times the size or your data, and copies everything at least three 
times.

Additionally, there are the obvious problems with khash's macro design (hard to 
read, impossible to debug, and each named instance increases executable size by 
~4k).


Below is a patch that converts pack-bitmap.c to hashmap. Its not even longer 
than the khash version, and the hashmap API forces you to think about 
no-brainer improvements such as specifying an expected size or skipping 
duplicates checks where they aren't needed. I could do the same for 
pack-bitmap-write.c if you like.

Karsten

---8--
Subject: [PATCH] pack-bitmap.c: convert to new hashmap implementation

Preallocates bitmap_index.bitmaps with the expected size to prevent
rehashing.

Combines the five arrays of ext_index ('objects', 'hashes' and khash's
'keys', 'flags' and 'vals') into a single structure for better locality.

Removes two unnecessary duplicates checks:
 - we don't expect a pack index file to contain duplicate sha1's
 - ext_index_add_object is only called after bitmap_position returned -1,
   so it is impossible that the entry is already there

Signed-off-by: Karsten Blees bl...@dcon.de
---
 pack-bitmap.c | 166 --
 1 file changed, 81 insertions(+), 85 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 078f7c6..115caed 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -15,6 +15,7 @@
  * commit.
  */
 struct stored_bitmap {
+   struct hashmap_entry ent;
unsigned char sha1[20];
struct ewah_bitmap *root;
struct stored_bitmap *xor;
@@ -22,6 +23,16 @@ struct stored_bitmap {
 };
 
 /*
+ * An entry in the extended index.
+ */
+struct ext_entry {
+   struct hashmap_entry ent;
+   uint32_t name_hash;
+   struct object *object;
+   unsigned int nr;
+};
+
+/*
  * The currently active bitmap index. By design, repositories only have
  * a single bitmap index available (the index for the biggest packfile in
  * the repository), since bitmap indexes need full closure.
@@ -61,7 +72,7 @@ static struct bitmap_index {
struct ewah_bitmap *tags;
 
/* Map from SHA1 - `stored_bitmap` for all the bitmapped comits */
-   khash_sha1 *bitmaps;
+   struct hashmap bitmaps;
 
/* Number of bitmapped commits */
uint32_t entry_count;
@@ -76,12 +87,7 @@ static struct bitmap_index {
 * packed in `pack`, these objects are added to this fake index and
 * are assumed to appear at the end of the packfile for all operations
 */
-   struct eindex {
-   struct object **objects;
-   uint32_t *hashes;
-   uint32_t count, alloc;
-   khash_sha1_pos *positions;
-   } ext_index;
+   struct hashmap ext_index;
 
/* Bitmap result of the last performed walk */
struct bitmap *result;
@@ -93,6 +99,21 @@ static struct bitmap_index {
 
 } bitmap_git;
 
+static int stored_bitmap_cmp(const struct stored_bitmap *entry,
+   const struct stored_bitmap *entry_or_key,
+   const unsigned char *sha1)
+{
+   return hashcmp(entry-sha1, sha1 ? sha1 : entry_or_key-sha1);
+}
+
+static int ext_entry_cmp(const struct ext_entry *entry,
+   const struct ext_entry *entry_or_key,
+   const unsigned char *sha1)
+{
+   return hashcmp(entry-object-sha1,
+  sha1 ? sha1 : entry_or_key-object-sha1);
+}
+
 static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st)
 {
struct 

Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-12-03 Thread Jeff King
On Tue, Dec 03, 2013 at 03:40:41PM +0100, Karsten Blees wrote:

 IMO, trying to improve khash isn't worth the trouble.
 
 With smaller-than-pointer types, khash _may_ actually save a few bytes
 compared to hash.[ch] or hashmap.[ch]. E.g. a set of 'int's would be a
 perfect use case for khash. However, you're using bitmaps for that,
 and khash's predefined macros for int sets and maps have been removed
 from the git version.

True, we are not using it for smaller-than-pointer sizes here. So it may
not be worth thinking about (I was considering more for the general
case). In most instances, though, we can shove the int bits into a
pointer with the right casting. So it probably isn't worth worrying
about (you may waste a few bytes, but probably not more than one word
per entry).

 Using khash with pointers and larger-than-pointer types is just a
 waste of memory and performance, though. Hash tables are sparsely
 filled by design, and using large types means lots of large empty
 buckets. E.g. kh_resize requires almost four times the size or your
 data, and copies everything at least three times.

I think the analysis is more complicated than that, and depends on what
you are storing. If you are storing something that is 1.5 times the size
of a pointer, it is more space efficient to just stick it in the hash
table than it is to have a separate pointer in the hash table (you pay
the load factor penalty only on the single pointer, but you've almost
doubled your total storage).

But again, that's a more general argument. We're not storing anything of
that size here, and in fact I think we are just storing pointers
everywhere.

 Additionally, there are the obvious problems with khash's macro design
 (hard to read, impossible to debug, and each named instance increases
 executable size by ~4k).

Yes, those are all downsides to macros. Type safety is one of the
upsides, though.

Besides macros, I think the major difference between the two
implementations is open-addressing versus chaining. Especially for sets,
we've had good experiences with open-addressing by keeping the load
factor low (e.g., the one in object.c). As you note, the resizing
operation pays some penalty, but in most of our workloads it's largely
irrelevant compared to lookup times.

Chaining typically adds an extra layer of pointer-following to the
lookup, and I'd be worried about that loss of locality. It's true that
when you are storing a pointer you are already hurting locality to
follow the pointer to the key, but I don't know whether that means an
_extra_ layer doesn't still hurt more (and I really mean I don't know --
I'd be interested to see measurements).

In your implementation, it looks like you break even there because you
store the hash directly in the entry, and do a single-word compare (so
you avoid having to follow a pointer to the key in the common case
during lookup). But that also means you're paying extra to store the
hash. That probably makes sense for things like strings, where it takes
some effort to calculate the hash. But not necessarily for sha1s, where
looking at the hash is the same thing as looking at the key bytes (so
you are storing extra bytes, and when you do have a hit on the stored
hash, which is just the first bytes of the sha1, you end up comparing
them again as part of the hashcmp. The tradeoff is that in the
non-matching cases, you avoid an extra level of indirection).

All of these are things I could see helping or hurting, depending on the
case. I'd really love to see more numbers. I tried timing your patch
below, but there was no interesting change. Mostly because the code path
you changed is not heavily exercised (the ext_index is for objects
that are not in the pack, and frequent packing keeps that number low).

I'd be really interested to see if you could make the hash in object.c
faster. That one is a prominent lookup bottle-neck (e.g., for git
rev-list --objects --all); if you can make that faster, I would be very
convinced that your implementation is fast (note that I am not implying
the converse; if you cannot make it faster, that does not necessarily
mean your implementation sucks, but perhaps only that the existing one
has lots of type-specific optimizations which add up).

Khash also has a lot of bloat (e.g., flags) that the one in object.c
does not have. If you do not care about deletion and are storing
something with a sentinel value (e.g., NULL for pointers), you can trim
quite a bit of fat.

 Below is a patch that converts pack-bitmap.c to hashmap. Its not even
 longer than the khash version, and the hashmap API forces you to think
 about no-brainer improvements such as specifying an expected size or
 skipping duplicates checks where they aren't needed. I could do the
 same for pack-bitmap-write.c if you like.

If it's not too much trouble, I'd be curious to measure the performance
impact on pack-bitmap-write.

 Removes two unnecessary duplicates checks:
  - we don't expect a pack index file to 

Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-12-02 Thread Jeff King
On Fri, Nov 29, 2013 at 10:21:04PM +0100, Thomas Rast wrote:

 I do think it's worth fixing the syntax pedantry at the end so that we
 can keep supporting arcane compilers, but otherwise, meh.

Agreed. I've picked up those changes in my tree.

  +static int open_pack_bitmap_1(struct packed_git *packfile)
 
 This goes somewhat against the naming convention (if you can call it
 that) used elsewhere in git.  Usually foo_1() is an implementation
 detail of foo(), used because it is convenient to wrap the main part in
 another function, e.g. so that it can consistently free resources or
 some such.  But this one operates on one pack file, so in the terms of
 the rest of git, it should probably be called open_pack_bitmap_one().

Hmm. I see your point, but I think that my (and Vicent's) mental model
was that is _was_ a helper for open_pack_bitmap. It just happens to also
fill the role of open_pack_bitmap_one(), but you would not want the
latter. We only support a single bitmap at a time; by calling the
helper, you would miss out on the assert which would catch the error.

So I don't care much, but I have a slight preference to leave it, as it
signals you should not be calling this directly more clearly.

 A bit unfortunate that you inherit the strange show_* naming from
 builtin/pack-objects.c, which seems to have stolen some code from
 builtin/rev-list.c at some point without worrying about better naming...

Yes, I agree they're not very descriptive. Let's leave it for now to
stay consistent with pack-objects, and I'd be happy to see a patch
giving all of them better names come later.

  +   while (i  objects-word_alloc  ewah_iterator_next(filter, it)) {
  +   eword_t word = objects-words[i]  filter;
  +
  +   for (offset = 0; offset  BITS_IN_WORD; ++offset) {
  +   const unsigned char *sha1;
  +   struct revindex_entry *entry;
  +   uint32_t hash = 0;
  +
  +   if ((word  offset) == 0)
  +   break;
  +
  +   offset += ewah_bit_ctz64(word  offset);
  +
  +   if (pos + offset  bitmap_git.reuse_objects)
  +   continue;
  +
  +   entry = bitmap_git.reverse_index-revindex[pos + 
  offset];
  +   sha1 = nth_packed_object_sha1(bitmap_git.pack, 
  entry-nr);
  +
  +   show_reach(sha1, object_type, 0, hash, bitmap_git.pack, 
  entry-offset);
  +   }
 
 You have a very nice bitmap_each_bit() function in ewah/bitmap.c, why
 not use it here?

We are bitwise-ANDing against an on-disk ewah bitmap to filter out
objects which do not match the desired type. bitmap_each_bit would make
this more complicated, because we wouldn't be able to move the
ewah_iterator in single-word lockstep. And it would probably be slower
(if you did it naively), because we'd end up checking each bit in the
ewah, rather than AND-ing whole words.

The right, reusable way to do it would probably be to bitmap_and_ewah
the original and the filter together, and then bitmap_each_bit the
result. But you would have to write bitmap_and_ewah first. :)

  +   /*
  +* Reuse the packfile content if we need more than
  +* 90% of its objects
  +*/
  +   static const double REUSE_PERCENT = 0.9;
 
 Curious: is this based on some measurements or just a guess?

I think it's mostly a guess.

  +enum pack_bitmap_opts {
  +   BITMAP_OPT_FULL_DAG = 1,
 
 And I think this trailing comma on the last enum item is also strictly
 speaking not allowed, even though it is very nice to have:
 
 pack-bitmap.h:28:27: warning: comma at end of enumerator list [-Wpedantic]

It's allowed in C99, but was not in C89.  I've fixed this site for
consistency with the rest of git. But I wonder how relevant it still is.
The only data points I know of are:

  http://article.gmane.org/gmane.comp.version-control.git/145739

and

  http://article.gmane.org/gmane.comp.version-control.git/145739

It sounds like an ancient IBM VisualAge is the only reported problem.
And according to IBM, they stopped supporting it 10 years ago (well,
technically we have a few more weeks to hit the 10-year mark):

  
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=ansubtype=casupplier=897appname=IBMLinkRedirectletternum=ENUS903-227

I do wonder if at some point we should revisit our do not use any
C99-isms philosophy. It was very good advice in 2005. I don't know how
good it is over 8 years later (it seems like even ancient systems should
be able to get gcc compiled as a last resort, but maybe there really are
people for whom that is a burden).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-12-02 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I do wonder if at some point we should revisit our do not use any
 C99-isms philosophy. It was very good advice in 2005. I don't know how
 good it is over 8 years later (it seems like even ancient systems should
 be able to get gcc compiled as a last resort, but maybe there really are
 people for whom that is a burden).

Well, we are not kernel where being able to precisely control
generated machine code matters and enforcement of acceptable
compiler versions to achieve that goal is warranted, so I'd prefer
to avoid anything that tells the users go get a newer gcc.

There are certain things outside C89 that would make our code easier
to read and maintain (e.g. named member initialization of
struct/union, cf. ANSI C99 s6.7.9, just to name one) that I would
love to be able to use in our codebase, but being able to leave an
extra comma at the list of enums is very low on that list.  Other
than making a patch unnecessarily verbose when you introduce a new
enum and append it at the end, I do not think it hurts us much (I
may be missing other reasons why you may want to leave an extra
comma at the end, though).

Even that problem can easily be solved by having the a value of
this enum has to be lower than this at the end, e.g.

enum object_type {
OBJ_BAD = -1,
OBJ_NONE = 0,
OBJ_COMMIT = 1,
OBJ_TREE = 2,
OBJ_BLOB = 3,
OBJ_TAG = 4,
/* 5 for future expansion */
OBJ_OFS_DELTA = 6,
OBJ_REF_DELTA = 7,
OBJ_ANY,
OBJ_MAX
};


which will never tempt anybody to append at the end, causing the
patch bloat.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-12-02 Thread Jeff King
On Mon, Dec 02, 2013 at 12:36:34PM -0800, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  I do wonder if at some point we should revisit our do not use any
  C99-isms philosophy. It was very good advice in 2005. I don't know how
  good it is over 8 years later (it seems like even ancient systems should
  be able to get gcc compiled as a last resort, but maybe there really are
  people for whom that is a burden).
 
 Well, we are not kernel where being able to precisely control
 generated machine code matters and enforcement of acceptable
 compiler versions to achieve that goal is warranted, so I'd prefer
 to avoid anything that tells the users go get a newer gcc.

Sorry, I was not very clear about what I said. I do not think go get a
newer gcc is a good thing to be telling people. But I wonder:

  a. if there are actually people on systems that have pre-c99 compilers
 in 2013

  b. if there are, do they actually _use_ the ancient system compiler,
 and not just install gcc as the first step anyway?

In other words, I am questioning whether we would have to tell anybody
go install gcc these days. I'm not sure of the best way to answer that
question, though.

 There are certain things outside C89 that would make our code easier
 to read and maintain (e.g. named member initialization of
 struct/union, cf. ANSI C99 s6.7.9, just to name one) that I would
 love to be able to use in our codebase, but being able to leave an
 extra comma at the list of enums is very low on that list.

Yes, I can live without trailing commas. I was musing more on the
general issue (of course, we don't _have_ to take C99 as a whole, and
can pick and choose features that even pre-C99 compilers got right, but
I was wondering mainly when it would be time to say C99 is old enough
that everybody supports it).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-12-02 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 Sorry, I was not very clear about what I said. I do not think go get a
 newer gcc is a good thing to be telling people. But I wonder:

   a. if there are actually people on systems that have pre-c99 compilers
  in 2013

   b. if there are, do they actually _use_ the ancient system compiler,
  and not just install gcc as the first step anyway?

 In other words, I am questioning whether we would have to tell anybody
 go install gcc these days. I'm not sure of the best way to answer that
 question, though.

Thanks. We are at the same wavelength.  I do not know what the best
way to answer that question, either.  Unleashing such a change to
the wild is one obvious way to do so, and it will give us the most
accurate answer, so it may be the best way if we primarily care
about the quality of the answer, but we value the experience of our
direct consumers even more when judging what is best, so...

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-11-29 Thread Thomas Rast
TLDR: nitpicks.  Thanks for a very nice read.

I do think it's worth fixing the syntax pedantry at the end so that we
can keep supporting arcane compilers, but otherwise, meh.

 +static int open_pack_bitmap_1(struct packed_git *packfile)

This goes somewhat against the naming convention (if you can call it
that) used elsewhere in git.  Usually foo_1() is an implementation
detail of foo(), used because it is convenient to wrap the main part in
another function, e.g. so that it can consistently free resources or
some such.  But this one operates on one pack file, so in the terms of
the rest of git, it should probably be called open_pack_bitmap_one().

 +static void show_object(struct object *object, const struct name_path *path,
 + const char *last, void *data)
 +{
 + struct bitmap *base = data;
 + int bitmap_pos;
 +
 + bitmap_pos = bitmap_position(object-sha1);
 +
 + if (bitmap_pos  0) {
 + char *name = path_name(path, last);
 + bitmap_pos = ext_index_add_object(object, name);
 + free(name);
 + }
 +
 + bitmap_set(base, bitmap_pos);
 +}
 +
 +static void show_commit(struct commit *commit, void *data)
 +{
 +}

A bit unfortunate that you inherit the strange show_* naming from
builtin/pack-objects.c, which seems to have stolen some code from
builtin/rev-list.c at some point without worrying about better naming...

 +static void show_objects_for_type(
 + struct bitmap *objects,
 + struct ewah_bitmap *type_filter,
 + enum object_type object_type,
 + show_reachable_fn show_reach)
 +{
[...]
 + while (i  objects-word_alloc  ewah_iterator_next(filter, it)) {
 + eword_t word = objects-words[i]  filter;
 +
 + for (offset = 0; offset  BITS_IN_WORD; ++offset) {
 + const unsigned char *sha1;
 + struct revindex_entry *entry;
 + uint32_t hash = 0;
 +
 + if ((word  offset) == 0)
 + break;
 +
 + offset += ewah_bit_ctz64(word  offset);
 +
 + if (pos + offset  bitmap_git.reuse_objects)
 + continue;
 +
 + entry = bitmap_git.reverse_index-revindex[pos + 
 offset];
 + sha1 = nth_packed_object_sha1(bitmap_git.pack, 
 entry-nr);
 +
 + show_reach(sha1, object_type, 0, hash, bitmap_git.pack, 
 entry-offset);
 + }

You have a very nice bitmap_each_bit() function in ewah/bitmap.c, why
not use it here?

 +int reuse_partial_packfile_from_bitmap(struct packed_git **packfile,
 +uint32_t *entries,
 +off_t *up_to)
 +{
 + /*
 +  * Reuse the packfile content if we need more than
 +  * 90% of its objects
 +  */
 + static const double REUSE_PERCENT = 0.9;

Curious: is this based on some measurements or just a guess?


 diff --git a/pack-bitmap.h b/pack-bitmap.h
[...]
 +static const char BITMAP_IDX_SIGNATURE[] = {'B', 'I', 'T', 'M'};;

There's a stray ; at the end of the line that is technically not
permitted:

pack-bitmap.h:22:65: warning: ISO C does not allow extra ‘;’ outside of a 
function [-Wpedantic]

 +enum pack_bitmap_opts {
 + BITMAP_OPT_FULL_DAG = 1,

And I think this trailing comma on the last enum item is also strictly
speaking not allowed, even though it is very nice to have:

pack-bitmap.h:28:27: warning: comma at end of enumerator list [-Wpedantic]

-- 
Thomas Rast
t...@thomasrast.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-11-28 Thread Jeff King
On Wed, Nov 27, 2013 at 10:08:56AM +0100, Karsten Blees wrote:

 Khash is OK for sha1 keys, but I don't think it should be advertised
 as a second general purpose hash table implementation. Its far too
 easy to shoot yourself in the foot by using 'straightforward' hash-
 and comparison functions. Khash doesn't store the hash codes of the
 keys, so you have to take care of that yourself or live with the
 performance penalties (see [1]).
 
 [1] http://article.gmane.org/gmane.comp.version-control.git/237876

Yes. I wonder if we should improve it in that respect. I haven't looked
carefully at the hash code you posted elsewhere, but I feel like many
uses will want a macro implementation to let them store arbitrary types
smaller or larger than a pointer.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-11-27 Thread Karsten Blees
Am 24.11.2013 22:36, schrieb Thomas Rast:
 Jeff King p...@peff.net writes:
[...]
 
 I think I'll also lend you a hand writing 
 Documentation/technical/api-khash.txt
 (expect it tomorrow) so that we also have documentation in the git
 style, where gitters can be expected to find it on their own.
 

Khash is OK for sha1 keys, but I don't think it should be advertised as a 
second general purpose hash table implementation. Its far too easy to shoot 
yourself in the foot by using 'straightforward' hash- and comparison functions. 
Khash doesn't store the hash codes of the keys, so you have to take care of 
that yourself or live with the performance penalties (see [1]).

[1] http://article.gmane.org/gmane.comp.version-control.git/237876

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-11-24 Thread Thomas Rast
Jeff King p...@peff.net writes:

  khash.h   | 338 
[...]
 diff --git a/khash.h b/khash.h
 new file mode 100644
 index 000..57ff603
 --- /dev/null
 +++ b/khash.h
 @@ -0,0 +1,338 @@
 +/* The MIT License
 +
 +   Copyright (c) 2008, 2009, 2011 by Attractive Chaos attrac...@live.co.uk
 +
 +   Permission is hereby granted, free of charge, to any person obtaining
 +   a copy of this software and associated documentation files (the
 +   Software), to deal in the Software without restriction, including
 +   without limitation the rights to use, copy, modify, merge, publish,
 +   distribute, sublicense, and/or sell copies of the Software, and to
 +   permit persons to whom the Software is furnished to do so, subject to
 +   the following conditions:
 +
 +   The above copyright notice and this permission notice shall be
 +   included in all copies or substantial portions of the Software.
 +
 +   THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
 +   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 +   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 +   NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
 +   BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
 +   ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
 +   CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 +   SOFTWARE.
 +*/
 +
 +#ifndef __AC_KHASH_H
 +#define __AC_KHASH_H
[...]
 +static inline khint_t __kh_oid_hash(const unsigned char *oid)
 +{
 + khint_t hash;
 + memcpy(hash, oid, sizeof(hash));
 + return hash;
 +}
 +
 +#define __kh_oid_cmp(a, b) (hashcmp(a, b) == 0)
 +
 +KHASH_INIT(sha1, const unsigned char *, void *, 1, __kh_oid_hash, 
 __kh_oid_cmp)
 +typedef kh_sha1_t khash_sha1;
 +
 +KHASH_INIT(sha1_pos, const unsigned char *, int, 1, __kh_oid_hash, 
 __kh_oid_cmp)
 +typedef kh_sha1_pos_t khash_sha1_pos;
 +
 +#endif /* __AC_KHASH_H */

AFAICS, the part after the [...] are additions specific to git.  Is that
right?

Can we store them in a separate khash-sha1.h or some such, to make it
clearer what's what?  As things stand, one has to look for an identifier
that is built from macros at the far end of a file that looks like it
was imported verbatim from klib(?).  I don't know about you, but that
just took me far too long to get right.

I think I'll also lend you a hand writing Documentation/technical/api-khash.txt
(expect it tomorrow) so that we also have documentation in the git
style, where gitters can be expected to find it on their own.

All that could then nicely fit into a commit that actually says where
you conjured khash.h from and such.  This information seems
conspicuously absent from this commit's message :-)

Furthermore, would it be a problem to name the second hash sha1_int
instead?  I have another use for such a hash, and I can't imagine I'm
the only one.  (That's not critical however, I can do the required
editing in that other series.)

-- 
Thomas Rast
t...@thomasrast.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/21] pack-bitmap: add support for bitmap indexes

2013-11-14 Thread Jeff King
From: Vicent Marti tan...@gmail.com

A bitmap index is a `.bitmap` file that can be found inside
`$GIT_DIR/objects/pack/`, next to its corresponding packfile, and
contains precalculated reachability information for selected commits.
The full specification of the format for these bitmap indexes can be found
in `Documentation/technical/bitmap-format.txt`.

For a given commit SHA1, if it happens to be available in the bitmap
index, its bitmap will represent every single object that is reachable
from the commit itself. The nth bit in the bitmap is the nth object in
the packfile; if it's set to 1, the object is reachable.

By using the bitmaps available in the index, this commit implements
several new functions:

- `prepare_bitmap_git`
- `prepare_bitmap_walk`
- `traverse_bitmap_commit_list`
- `reuse_partial_packfile_from_bitmap`

The `prepare_bitmap_walk` function tries to build a bitmap of all the
objects that can be reached from the commit roots of a given `rev_info`
struct by using the following algorithm:

- If all the interesting commits for a revision walk are available in
the index, the resulting reachability bitmap is the bitwise OR of all
the individual bitmaps.

- When the full set of WANTs is not available in the index, we perform a
partial revision walk using the commits that don't have bitmaps as
roots, and limiting the revision walk as soon as we reach a commit that
has a corresponding bitmap. The earlier OR'ed bitmap with all the
indexed commits can now be completed as this walk progresses, so the end
result is the full reachability list.

- For revision walks with a HAVEs set (a set of commits that are deemed
uninteresting), first we perform the same method as for the WANTs, but
using our HAVEs as roots, in order to obtain a full reachability bitmap
of all the uninteresting commits. This bitmap then can be used to:

a) limit the subsequent walk when building the WANTs bitmap
b) finding the final set of interesting commits by performing an
   AND-NOT of the WANTs and the HAVEs.

If `prepare_bitmap_walk` runs successfully, the resulting bitmap is
stored and the equivalent of a `traverse_commit_list` call can be
performed by using `traverse_bitmap_commit_list`; the bitmap version
of this call yields the objects straight from the packfile index
(without having to look them up or parse them) and hence is several
orders of magnitude faster.

As an extra optimization, when `prepare_bitmap_walk` succeeds, the
`reuse_partial_packfile_from_bitmap` call can be attempted: it will find
the amount of objects at the beginning of the on-disk packfile that can
be reused as-is, and return an offset into the packfile. The source
packfile can then be loaded and the bytes up to `offset` can be written
directly to the result without having to consider the entires inside the
packfile individually.

If the `prepare_bitmap_walk` call fails (e.g. because no bitmap files
are available), the `rev_info` struct is left untouched, and can be used
to perform a manual rev-walk using `traverse_commit_list`.

Hence, this new set of functions are a generic API that allows to
perform the equivalent of

git rev-list --objects [roots...] [^uninteresting...]

for any set of commits, even if they don't have specific bitmaps
generated for them.

In further patches, we'll use this bitmap traversal optimization to
speed up the `pack-objects` and `rev-list` commands.

Signed-off-by: Vicent Marti tan...@gmail.com
Signed-off-by: Jeff King p...@peff.net
---
 Makefile  |   2 +
 khash.h   | 338 
 pack-bitmap.c | 970 ++
 pack-bitmap.h |  43 +++
 4 files changed, 1353 insertions(+)
 create mode 100644 khash.h
 create mode 100644 pack-bitmap.c
 create mode 100644 pack-bitmap.h

diff --git a/Makefile b/Makefile
index 64a1ed7..b983d78 100644
--- a/Makefile
+++ b/Makefile
@@ -699,6 +699,7 @@ LIB_H += object.h
 LIB_H += pack-objects.h
 LIB_H += pack-revindex.h
 LIB_H += pack.h
+LIB_H += pack-bitmap.h
 LIB_H += parse-options.h
 LIB_H += patch-ids.h
 LIB_H += pathspec.h
@@ -837,6 +838,7 @@ LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
 LIB_OBJS += object.o
+LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-check.o
 LIB_OBJS += pack-objects.o
 LIB_OBJS += pack-revindex.o
diff --git a/khash.h b/khash.h
new file mode 100644
index 000..57ff603
--- /dev/null
+++ b/khash.h
@@ -0,0 +1,338 @@
+/* The MIT License
+
+   Copyright (c) 2008, 2009, 2011 by Attractive Chaos attrac...@live.co.uk
+
+   Permission is hereby granted, free of charge, to any person obtaining
+   a copy of this software and associated documentation files (the
+   Software), to deal in the Software without restriction, including
+   without limitation the rights to use, copy, modify, merge, publish,
+   distribute, sublicense, and/or sell copies of the Software, and to
+   permit persons to whom the Software is