Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
First, the commit message could probably be reordered to start with the
main point (reusing of packfiles at object granularity instead of
allowing reuse of only one contiguous region). To specific points:
> The dynamic array of `struct reused_chunk` is useful
> because we need to know not just the number of zero bits,
> but the accumulated offset due to missing objects.
The number of zero bits is irrelevant, I think. (I know I introduced
this idea, but at that time I was somehow confused that the offset was a
number of objects and not a number of bytes.)
> If a client both asks for a tag by sha1 and specifies
> "include-tag", we may end up including the tag in the reuse
> bitmap (due to the first thing), and then later adding it to
> the packlist (due to the second). This results in duplicate
> objects in the pack, which git chokes on. We should notice
> that we are already including it when doing the include-tag
> portion, and avoid adding it to the packlist.
I still don't understand this part. Going off what's written in the
commit message here, it seems to me that the issue is that (1) an object
can be added to the reuse bitmap without going through to_pack, and (2)
this is done when the client asks for a tag by SHA-1. But all objects
when specified by SHA-1 go through to_pack, don't they?
On to the review of the code. The ordering of - and + lines are
confusing, so I have just removed all - lines.
> +/*
> + * Record the offsets needed in our reused packfile chunks due to
> + * "gaps" where we omitted some objects.
> + */
> +static struct reused_chunk {
> + off_t start;
> + off_t offset;
> +} *reused_chunks;
> +static int reused_chunks_nr;
> +static int reused_chunks_alloc;
A chunk is a set of objects from the original packfile that we are
reusing; all objects in a chunk have the same relative position in the
original packfile and the generated packfile. (This does not mean that
the bytes are exactly the same or, even, that the last object in the
chunk has the same size in the original packfile and the generated
packfile.)
"start" is the offset of the first object of the chunk in the original
packfile.
"offset" is "start" minus the offset of the first object of the chunk in
the generated packfile. (I think it is easier to understand if this was
"new minus old" instead of "old minus new", that is, the negation of its
current value.)
So I would prefer:
/*
* A reused set of objects. All objects in a chunk have the same
* relative position in the original packfile and the generated
* packfile.
*/
struct reused_chunk {
/* The offset of the first object of this chunk in the original
* packfile. */
off_t original;
/* The offset of the first object of this chunk in the generated
* packfile minus "original". */
off_t difference;
}
> +static void record_reused_object(off_t where, off_t offset)
Straightforward, other than the renaming of parameters.
> +/*
> + * Binary search to find the chunk that "where" is in. Note
> + * that we're not looking for an exact match, just the first
> + * chunk that contains it (which implicitly ends at the start
> + * of the next chunk.
> + */
> +static off_t find_reused_offset(off_t where)
Also straightforward.
> +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> + struct pack_window **w_curs)
> +{
> + off_t offset, next, cur;
> + enum object_type type;
> + unsigned long size;
> +
> + offset = reuse_packfile->revindex[pos].offset;
> + next = reuse_packfile->revindex[pos + 1].offset;
>
> + record_reused_object(offset, offset - hashfile_total(out));
This is where I understood what "offset" meant in struct reused_chunk.
>
> + cur = offset;
> + type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> + assert(type >= 0);
>
> + if (type == OBJ_OFS_DELTA) {
> + off_t base_offset;
> + off_t fixup;
> +
> + unsigned char header[MAX_PACK_OBJECT_HEADER];
> + unsigned len;
> +
> + base_offset = get_delta_base(reuse_packfile, w_curs, &cur,
> type, offset);
> + assert(base_offset != 0);
> +
> + /* Convert to REF_DELTA if we must... */
> + if (!allow_ofs_delta) {
> + int base_pos = find_revindex_position(reuse_packfile,
> base_offset);
> + const unsigned char *base_sha1 =
> + nth_packed_object_sha1(reuse_packfile,
> +
> reuse_packfile->revindex[base_pos].nr);
> +
> + len = encode_in_pack_object_header(header,
> sizeof(header),
> +OBJ_REF_DELTA, size);
> + hashwrite(out, header, len);
> + hashwrite(out, base_sha1, 20);
> + copy_pack_data(out, reuse_packfile, w_curs, cur, next
Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
On 20/10/2019 00:23, Jeff King wrote:
On Sat, Oct 19, 2019 at 09:20:11PM +0200, Christian Couder wrote:
+static void write_reused_pack_one(size_t pos, struct hashfile *out,
+ struct pack_window **w_curs)
+{
+ off_t offset, next, cur;
+ enum object_type type;
+ unsigned long size;
Is this a mem_sized size or a counter for less that 4GiB items?
What I can see is that `&size` is passed as the last argument to
unpack_object_header() below. And unpack_object_header() is defined in
packfile.h like this:
int unpack_object_header(struct packed_git *, struct pack_window **,
off_t *, unsigned long *);
since at least 336226c259 (packfile.h: drop extern from function
declarations, 2019-04-05)
So fixing this, if it needs to be fixed, should probably be part of a
separate topic fixing unpack_object_header().
Yeah, this one definitely should be moved to whatever we used to
represent object sizes in the future (size_t, or I guess off_t if we
really want to handle huge objects on 32-bit systems too). But
definitely it shouldn't happen in this series, and I don't think anybody
interested in the other topic (converting the integer type for object
sizes) needs to keep tabs on it. When they convert
unpack_object_header(), the compiler will complain because of passing
it as a pointer (the more insidious ones will be where we return an
unsigned long to represent an object type, and somebody will have to
look into every caller).
Thanks, I'll add that one to my list of size values that I should check
when rebasing my current series.
If I understand correctly gcc is no longer detecting those size
conversions, but clang has -Wshorten-64-to-32 but I've still to check if
it catches some of these conversion issues (esp when int (32) is
extended to 64 bit size_t, rather than being 64 bit in the first place)
Philip
Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
On Sat, Oct 19, 2019 at 09:20:11PM +0200, Christian Couder wrote:
> > > +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> > > + struct pack_window **w_curs)
> > > +{
> > > + off_t offset, next, cur;
> > > + enum object_type type;
> > > + unsigned long size;
> >
> > Is this a mem_sized size or a counter for less that 4GiB items?
>
> What I can see is that `&size` is passed as the last argument to
> unpack_object_header() below. And unpack_object_header() is defined in
> packfile.h like this:
>
> int unpack_object_header(struct packed_git *, struct pack_window **,
> off_t *, unsigned long *);
>
> since at least 336226c259 (packfile.h: drop extern from function
> declarations, 2019-04-05)
>
> So fixing this, if it needs to be fixed, should probably be part of a
> separate topic fixing unpack_object_header().
Yeah, this one definitely should be moved to whatever we used to
represent object sizes in the future (size_t, or I guess off_t if we
really want to handle huge objects on 32-bit systems too). But
definitely it shouldn't happen in this series, and I don't think anybody
interested in the other topic (converting the integer type for object
sizes) needs to keep tabs on it. When they convert
unpack_object_header(), the compiler will complain because of passing
it as a pointer (the more insidious ones will be where we return an
unsigned long to represent an object type, and somebody will have to
look into every caller).
-Peff
Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
Hi Philip,
On Sat, Oct 19, 2019 at 5:30 PM Philip Oakley wrote:
> On 19/10/2019 11:35, Christian Couder wrote:
> > +static void write_reused_pack_one(size_t pos, struct hashfile *out,
> > + struct pack_window **w_curs)
> > +{
> > + off_t offset, next, cur;
> > + enum object_type type;
> > + unsigned long size;
>
> Is this a mem_sized size or a counter for less that 4GiB items?
What I can see is that `&size` is passed as the last argument to
unpack_object_header() below. And unpack_object_header() is defined in
packfile.h like this:
int unpack_object_header(struct packed_git *, struct pack_window **,
off_t *, unsigned long *);
since at least 336226c259 (packfile.h: drop extern from function
declarations, 2019-04-05)
So fixing this, if it needs to be fixed, should probably be part of a
separate topic fixing unpack_object_header().
> > +
> > + offset = reuse_packfile->revindex[pos].offset;
> > + next = reuse_packfile->revindex[pos + 1].offset;
> >
> > - if (reuse_packfile_offset < 0)
> > - reuse_packfile_offset = reuse_packfile->pack_size -
> > the_hash_algo->rawsz;
> > + record_reused_object(offset, offset - hashfile_total(out));
> >
> > - total = to_write = reuse_packfile_offset - sizeof(struct pack_header);
> > + cur = offset;
> > + type = unpack_object_header(reuse_packfile, w_curs, &cur, &size);
> > + assert(type >= 0);
> > +static void try_partial_reuse(struct bitmap_index *bitmap_git,
> > + size_t pos,
> > + struct bitmap *reuse,
> > + struct pack_window **w_curs)
> > {
> > + struct revindex_entry *revidx;
> > + off_t offset;
> > + enum object_type type;
> > + unsigned long size;
>
> Is this mem_sized or a <4GiB size?
Again this `size` variable is passed as the last argument to
unpack_object_header() below.
...
> Apologies if these are dumb queries..
I think it's a valid concern, so it's certainly ok for you to ask
these questions.
Re: [PATCH v2 9/9] pack-objects: improve partial packfile reuse
Hi Christian,
a couple of mem_size questions?
On 19/10/2019 11:35, Christian Couder wrote:
From: Jeff King
Let's store the chunks of the packfile that we reuse in
a dynamic array of `struct reused_chunk`, and let's use
a reuse_packfile_bitmap to speed up reusing parts of
packfiles.
The dynamic array of `struct reused_chunk` is useful
because we need to know not just the number of zero bits,
but the accumulated offset due to missing objects. So
without the array we'd end up having to walk over the
revindex for that set of objects. The array is
basically caching those accumulated offsets (for the
parts we _do_ include), so we don't have to compute them
repeatedly.
The old code just tried to dump a whole segment of the
pack verbatim. That's faster than the traditional way of
actually adding objects to the packing list, but it
didn't kick in very often. This new code is really going
for a middle ground: do _some_ per-object work, but way
less than we'd traditionally do.
For instance, packing torvalds/linux on GitHub servers
just now reused 6.5M objects, but only needed ~50k
chunks.
Additional checks are added in have_duplicate_entry()
and obj_is_packed() to avoid duplicate objects in the
reuse bitmap. It was probably buggy to not have such a
check before.
If a client both asks for a tag by sha1 and specifies
"include-tag", we may end up including the tag in the reuse
bitmap (due to the first thing), and then later adding it to
the packlist (due to the second). This results in duplicate
objects in the pack, which git chokes on. We should notice
that we are already including it when doing the include-tag
portion, and avoid adding it to the packlist.
The simplest place to fix this is right in add_ref_tag,
where we could avoid peeling the tag at all if we know that
we are already including it. However, I've pushed the check
instead into have_duplicate_entry(). This fixes not only
this case, but also means that we cannot have any similar
problems lurking in other code.
No tests, because git does not actually exhibit this "ask
for it and also include-tag" behavior. We do one or the
other on clone, depending on whether --single-branch is set.
However, libgit2 does both.
Signed-off-by: Jeff King
Signed-off-by: Christian Couder
---
builtin/pack-objects.c | 214 -
pack-bitmap.c | 150 +
pack-bitmap.h | 3 +-
3 files changed, 280 insertions(+), 87 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 08898331ef..f710f8dde3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -92,7 +92,7 @@ static struct progress *progress_state;
static struct packed_git *reuse_packfile;
static uint32_t reuse_packfile_objects;
-static off_t reuse_packfile_offset;
+static struct bitmap *reuse_packfile_bitmap;
static int use_bitmap_index_default = 1;
static int use_bitmap_index = -1;
@@ -785,57 +785,177 @@ static struct object_entry **compute_write_order(void)
return wo;
}
-static off_t write_reused_pack(struct hashfile *f)
+/*
+ * Record the offsets needed in our reused packfile chunks due to
+ * "gaps" where we omitted some objects.
+ */
+static struct reused_chunk {
+ off_t start;
+ off_t offset;
+} *reused_chunks;
+static int reused_chunks_nr;
+static int reused_chunks_alloc;
+
+static void record_reused_object(off_t where, off_t offset)
{
- unsigned char buffer[8192];
- off_t to_write, total;
- int fd;
+ if (reused_chunks_nr && reused_chunks[reused_chunks_nr-1].offset ==
offset)
+ return;
- if (!is_pack_valid(reuse_packfile))
- die(_("packfile is invalid: %s"), reuse_packfile->pack_name);
+ ALLOC_GROW(reused_chunks, reused_chunks_nr + 1,
+ reused_chunks_alloc);
+ reused_chunks[reused_chunks_nr].start = where;
+ reused_chunks[reused_chunks_nr].offset = offset;
+ reused_chunks_nr++;
+}
- fd = git_open(reuse_packfile->pack_name);
- if (fd < 0)
- die_errno(_("unable to open packfile for reuse: %s"),
- reuse_packfile->pack_name);
+/*
+ * Binary search to find the chunk that "where" is in. Note
+ * that we're not looking for an exact match, just the first
+ * chunk that contains it (which implicitly ends at the start
+ * of the next chunk.
+ */
+static off_t find_reused_offset(off_t where)
+{
+ int lo = 0, hi = reused_chunks_nr;
+ while (lo < hi) {
+ int mi = lo + ((hi - lo) / 2);
+ if (where == reused_chunks[mi].start)
+ return reused_chunks[mi].offset;
+ if (where < reused_chunks[mi].start)
+ hi = mi;
+ else
+ lo = mi + 1;
+ }
- if (lseek(fd, sizeof(struct pack_header), SEEK_SET) == -1)
- die_errno(_("unable to seek in reused packfile"));
+ /*

