Re: [PATCH v2] pack-objects: fix performance issues on packing large deltas

2018-07-26 Thread Johannes Sixt

Am 22.07.2018 um 10:04 schrieb Nguyễn Thái Ngọc Duy:

+   if (size < pack->oe_delta_size_limit) {
+   e->delta_size_ = size;
+   e->delta_size_valid = 1;
+   } else {
+   packing_data_lock(pack);
+   if (!pack->delta_size)
+   ALLOC_ARRAY(pack->delta_size, pack->nr_alloc);
+   packing_data_unlock(pack);
+
+   pack->delta_size[e - pack->objects] = size;


My first thought was that this is wrong (falling prey to the same 
mistake as the double-checked locking pattern). But after thinking twice 
over it again, I think that this unprotected access of pack->delta_size 
is thread-safe.


Of course, I'm assuming that different threads never assign to the same 
array index.



+   e->delta_size_valid = 0;
+   }
  }


-- Hannes


Re: [PATCH v2] pack-objects: fix performance issues on packing large deltas

2018-07-23 Thread Jeff King
On Mon, Jul 23, 2018 at 08:49:59PM +0200, Duy Nguyen wrote:

> On Mon, Jul 23, 2018 at 8:38 PM Duy Nguyen  wrote:
> > I will have to study the thread dispatch code more to have a better
> > answer, unfortunately.
> 
> Well.. I thought I would need this weekend for this, but a quick look
> and ll_find_deltas() suggests that what we're doing is safe. At least
> you and Jeff are way to familiar with the delta window concept in
> pack-objects. So in multithread mode, we have a big list of all
> objects, then the list is cut in N sublists and each is owned entirely
> by one thread. Each thread then can slide a window over this sublist
> to search for the best delta.
> 
> Because of this partitioning, if trg_entry is being processed now, it
> will not be accessed by any other thread. It's owned by this thread
> and will be accessed again as src_entry when the next entry becomes
> the delta target in the same thread. As long as we don't access
> objects of another thread (and my v1 violates this) we should be ok.

Yes, that matches my knowledge of how this all works. And if it didn't,
I think the code would have been racy even _before_ your patches.

The only thing that this pack->delta_size approach is changing is that
managing that array needs to happen under lock, because it touches the
whole list.

And as long as we back-fill it from any arbitrary e->delta_size_, that
means that touching e->delta_size_ needs to be done under lock. That's
why I think keeping the individual valid flag in each entry makes the
most sense. Then whenever we overflow a particular e->delta_size_, we
don't have to care about anybody else's size.

Which I think is what your v2 patch is doing.

-Peff


Re: [PATCH v2] pack-objects: fix performance issues on packing large deltas

2018-07-23 Thread Duy Nguyen
On Mon, Jul 23, 2018 at 8:38 PM Duy Nguyen  wrote:
> I will have to study the thread dispatch code more to have a better
> answer, unfortunately.

Well.. I thought I would need this weekend for this, but a quick look
and ll_find_deltas() suggests that what we're doing is safe. At least
you and Jeff are way to familiar with the delta window concept in
pack-objects. So in multithread mode, we have a big list of all
objects, then the list is cut in N sublists and each is owned entirely
by one thread. Each thread then can slide a window over this sublist
to search for the best delta.

Because of this partitioning, if trg_entry is being processed now, it
will not be accessed by any other thread. It's owned by this thread
and will be accessed again as src_entry when the next entry becomes
the delta target in the same thread. As long as we don't access
objects of another thread (and my v1 violates this) we should be ok.

At the end of ll_find_deltas() though, we have the "stealing" stuff,
but it's protected by progress_lock() already, outside try_delta() so
we're sure nobody is updating any object_entry when the stealing
happens.

I will double check again this weekend just to be sure.
-- 
Duy


Re: [PATCH v2] pack-objects: fix performance issues on packing large deltas

2018-07-23 Thread Duy Nguyen
On Mon, Jul 23, 2018 at 8:04 PM Junio C Hamano  wrote:
>
> Nguyễn Thái Ngọc Duy   writes:
>
> > Access to e->delta_size_ (and by extension
> > pack->delta_size[e - pack->objects]) is unprotected as before, the
> > thread scheduler in pack-objects must make sure "e" is never updated
> > by two different threads.
>
> OK.  Do we need to worry about "e" (e.g. "e->delta_size_valid")
> being accessed while/before it is set by another thread?

I left some details out because I think it's getting to a gray area.
We already do this

if (!DELTA(trg_entry)) {
max_size = trg_size/2 - the_hash_algo->rawsz;
ref_depth = 1;
} else {
max_size = DELTA_SIZE(trg_entry);
...
SET_DELTA(trg_entry, src_entry);
SET_DELTA_SIZE(trg_entry, delta_size);

if the bottom half is running on one thread and stopped in the middle
while the top half is running in another thread, we have a problem.
But perhaps max_size is not that big a deal because incorrect max_size
may produce a bad pack but can't corrupt it.

I will have to study the thread dispatch code more to have a better
answer, unfortunately.
--
Duy


Re: [PATCH v2] pack-objects: fix performance issues on packing large deltas

2018-07-23 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

> Access to e->delta_size_ (and by extension
> pack->delta_size[e - pack->objects]) is unprotected as before, the
> thread scheduler in pack-objects must make sure "e" is never updated
> by two different threads.

OK.  Do we need to worry about "e" (e.g. "e->delta_size_valid")
being accessed while/before it is set by another thread?

oe_delta_size() makes unprotected accesses to .delta_size_ and
pack->delta_size[e - pack->objects], so we apparently do not, and
oe_set_delta_size() only protects the allocation call and does not
prevent a reader in oe_delta_size() from first reading the _valid
field, noticing that it is 0 as initialized, and goes on to read the
pack->delta_size[] slot for the entry, while the writer is setting
the size to .delta_size_ field and flipping _valid bit, without ever
storing the size in the pack->delta_size[] array.

> @@ -130,6 +131,7 @@ struct packing_data {
>   uint32_t index_size;
>  
>   unsigned int *in_pack_pos;
> + unsigned long *delta_size;
>  
>   /*
>* Only one of these can be non-NULL and they have different
> @@ -140,10 +142,29 @@ struct packing_data {
>   struct packed_git **in_pack_by_idx;
>   struct packed_git **in_pack;
>  
> +#ifndef NO_PTHREADS
> + pthread_mutex_t lock;

I am wondering if we want the variable to say what data it is
protecting from simultaneous accesses, or leave it as generic so
that any new caller that wants to lock any (new) thing that is
associated with a packing_data structure can grab it for other
purposes.  The design of this patch clearly is the latter, which is
OK for now, I think.

> @@ -332,18 +353,34 @@ static inline unsigned long oe_delta_size(struct 
> packing_data *pack,
>  {
>   if (e->delta_size_valid)
>   return e->delta_size_;
> - return oe_size(pack, e);
> +
> + /*
> +  * pack->detla_size[] can't be NULL because oe_set_delta_size()
> +  * must have been called when a new delta is saved with
> +  * oe_set_delta().
> +  * If oe_delta() returns NULL (i.e. default state, which means
> +  * delta_size_valid is also false), then the caller must never
> +  * call oe_delta_size().
> +  */
> + return pack->delta_size[e - pack->objects];
>  }
>  
>  static inline void oe_set_delta_size(struct packing_data *pack,
>struct object_entry *e,
>unsigned long size)
>  {
> - e->delta_size_ = size;
> - e->delta_size_valid = e->delta_size_ == size;
> - if (!e->delta_size_valid && size != oe_size(pack, e))
> - BUG("this can only happen in check_object() "
> - "where delta size is the same as entry size");
> + if (size < pack->oe_delta_size_limit) {
> + e->delta_size_ = size;
> + e->delta_size_valid = 1;
> + } else {
> + packing_data_lock(pack);
> + if (!pack->delta_size)
> + ALLOC_ARRAY(pack->delta_size, pack->nr_alloc);
> + packing_data_unlock(pack);
> +
> + pack->delta_size[e - pack->objects] = size;
> + e->delta_size_valid = 0;
> + }
>  }


[PATCH v2] pack-objects: fix performance issues on packing large deltas

2018-07-22 Thread Nguyễn Thái Ngọc Duy
Let's start with some background about oe_delta_size() and
oe_set_delta_size(). If you already know, skip the next paragraph.

These two are added in 0aca34e826 (pack-objects: shrink delta_size
field in struct object_entry - 2018-04-14) to help reduce 'struct
object_entry' size. The delta size field in this struct is reduced to
only contain max 1MB. So if any new delta is produced and larger than
1MB, it's dropped because we can't really save such a large size
anywhere. Fallback is provided in case existing packfiles already have
large deltas, then we can retrieve it from the pack.

While this should help small machines repacking large repos without
large deltas (i.e. less memory pressure), dropping large deltas during
the delta selection process could end up with worse pack files. And if
existing packfiles already have >1MB delta and pack-objects is
instructed to not reuse deltas, all of them will be dropped on the
floor, and the resulting pack would be definitely bigger.

There is also a regression in terms of CPU/IO if we have large on-disk
deltas because fallback code needs to parse the pack every time the
delta size is needed and just access to the mmap'd pack data is enough
for extra page faults when memory is under pressure.

Both of these issues were reported on the mailing list. Here's some
numbers for comparison.

Version  Pack (MB)  MaxRSS(kB)  Time (s)
---  -  --  
 2.17.0 5498 435136282494.85
 2.18.010531 404495964168.94

This patch provides a better fallback that is

- cheaper in terms of cpu and io because we won't have to read
  existing pack files as much

- better in terms of pack size because the pack heuristics is back to
  2.17.0 time, we do not drop large deltas at all

If we encounter any delta (on-disk or created during try_delta phase)
that is larger than the 1MB limit, we stop using delta_size_ field for
this because it can't contain such size anyway. A new array of delta
size is dynamically allocated and can hold all the deltas that 2.17.0
can. This array only contains delta sizes that delta_size_ can't
contain.

With this, we do not have to drop deltas in try_delta() anymore. Of
course the downside is we use slightly more memory, even compared to
2.17.0. But since this is considered an uncommon case, a bit more
memory consumption should not be a problem.

Delta size limit is also raised from 1MB to 16MB to better cover
common case and avoid that extra memory consumption (99.999% deltas in
this reported repo are under 12MB; Jeff noted binary artifacts topped
out at about 3MB in some other private repos). Other fields are
shuffled around to keep this struct packed tight. We don't use more
memory in common case even with this limit update.

A note about thread synchronization. Since this code can be run in
parallel during delta searching phase, we need a mutex. The realloc
part in packlist_alloc() is not protected because it only happens
during the object counting phase, which is always single-threaded.

Access to e->delta_size_ (and by extension
pack->delta_size[e - pack->objects]) is unprotected as before, the
thread scheduler in pack-objects must make sure "e" is never updated
by two different threads.

The area under the new lock is as small as possible, avoiding locking
at all in common case, since lock contention with high thread count
could be expensive (most blobs are small enough that delta compute
time is short and we end up taking the lock very often). The previous
attempt to always hold a lock in oe_delta_size() and
oe_set_delta_size() increases execution time by 33% when repacking
linux.git with with 40 threads.

Reported-by: Elijah Newren 
Helped-by: Elijah Newren 
Helped-by: Jeff King 
Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 v2 should keep execution time back to 2.17.0 range (I never thought
 locks are that expensive..). Changes since v1:

 - oe_delta_size() and oe_set_delta_size() rewritten to keep locking
   to minimum. This also brings back delta_size_valid field so that
   we avoid accessing shared data as much as possible.
 - delta_size[] is now an array of unsigned long instead of uint32_t
   (i.e. 64-bit on linux, 32-bit on windows, like in 2.17.0)
 - delta size limit is reduced to 16 MB
 - lock_initialized field is removed

 builtin/pack-objects.c|  5 +---
 ci/run-build-and-tests.sh |  1 +
 pack-objects.c|  4 +++
 pack-objects.h| 59 +++
 t/README  |  4 +++
 5 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ebc8cefb53..a77659f9ce 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2023,10 +2023,6 @@ static int try_delta(struct unpacked *trg, struct 
unpacked *src,
delta_buf = create_delta(src->index, trg->data, trg_size, _size, 
max_size);
if (!delta_buf)
return 0;
-   if (delta_size >= (1U