Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-13 Thread Claudio Freire
On Wed, Jul 12, 2017 at 1:29 PM, Claudio Freire  wrote:
> On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire  
> wrote:
>> On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
>>  wrote:
>>> Thank you for the patch and benchmark results, I have a couple remarks.
>>> Firstly, padding in DeadTuplesSegment
>>>
>>> typedef struct DeadTuplesSegment
>>>
>>> {
>>>
>>> ItemPointerData last_dead_tuple;/* Copy of the last dead tuple
>>> (unset
>>>
>>>  * until the segment is fully
>>>
>>>  * populated). Keep it first to
>>> simplify
>>>
>>>  * binary searches */
>>>
>>> unsigned short padding;/* Align dt_tids to 32-bits,
>>>
>>>  * sizeof(ItemPointerData) is aligned to
>>>
>>>  * short, so add a padding short, to make
>>> the
>>>
>>>  * size of DeadTuplesSegment a multiple of
>>>
>>>  * 32-bits and align integer components for
>>>
>>>  * better performance during lookups into
>>> the
>>>
>>>  * multiarray */
>>>
>>> intnum_dead_tuples;/* # of entries in the segment */
>>>
>>> intmax_dead_tuples;/* # of entries allocated in the
>>> segment */
>>>
>>> ItemPointer dt_tids;/* Array of dead tuples */
>>>
>>> }DeadTuplesSegment;
>>>
>>> In the comments to ItemPointerData is written that it is 6 bytes long, but
>>> can be padded to 8 bytes by some compilers, so if we add padding in a
>>> current way, there is no guaranty that it will be done as it is expected.
>>> The other way to do it with pg_attribute_alligned. But in my opinion, there
>>> is no need to do it manually, because the compiler will do this optimization
>>> itself.
>>
>> I'll look into it. But my experience is that compilers won't align
>> struct size like this, only attributes, and this attribute is composed
>> of 16-bit attributes so it doesn't get aligned by default.
>
> Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at
> least in GCC. I'll remove the padding.
>
> Seems I just got the wrong impression at some point.

Updated versions of the patches attached.

A few runs of the benchmark show no significant difference, as it
should (being all cosmetic changes).

The bigger benchmark will take longer.
From b2bf9a671c2c4d517d59628f12a7f564ccf152f6 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  26 +++
 src/test/regress/sql/vacuum.sql  |  19 ++
 3 files changed, 373 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 30a0050..287accd 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're 

Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-12 Thread Claudio Freire
On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire  wrote:
> On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
>  wrote:
>> Thank you for the patch and benchmark results, I have a couple remarks.
>> Firstly, padding in DeadTuplesSegment
>>
>> typedef struct DeadTuplesSegment
>>
>> {
>>
>> ItemPointerData last_dead_tuple;/* Copy of the last dead tuple
>> (unset
>>
>>  * until the segment is fully
>>
>>  * populated). Keep it first to
>> simplify
>>
>>  * binary searches */
>>
>> unsigned short padding;/* Align dt_tids to 32-bits,
>>
>>  * sizeof(ItemPointerData) is aligned to
>>
>>  * short, so add a padding short, to make
>> the
>>
>>  * size of DeadTuplesSegment a multiple of
>>
>>  * 32-bits and align integer components for
>>
>>  * better performance during lookups into
>> the
>>
>>  * multiarray */
>>
>> intnum_dead_tuples;/* # of entries in the segment */
>>
>> intmax_dead_tuples;/* # of entries allocated in the
>> segment */
>>
>> ItemPointer dt_tids;/* Array of dead tuples */
>>
>> }DeadTuplesSegment;
>>
>> In the comments to ItemPointerData is written that it is 6 bytes long, but
>> can be padded to 8 bytes by some compilers, so if we add padding in a
>> current way, there is no guaranty that it will be done as it is expected.
>> The other way to do it with pg_attribute_alligned. But in my opinion, there
>> is no need to do it manually, because the compiler will do this optimization
>> itself.
>
> I'll look into it. But my experience is that compilers won't align
> struct size like this, only attributes, and this attribute is composed
> of 16-bit attributes so it doesn't get aligned by default.

Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at
least in GCC. I'll remove the padding.

Seems I just got the wrong impression at some point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-12 Thread Claudio Freire
On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
 wrote:
> Thank you for the patch and benchmark results, I have a couple remarks.
> Firstly, padding in DeadTuplesSegment
>
> typedef struct DeadTuplesSegment
>
> {
>
> ItemPointerData last_dead_tuple;/* Copy of the last dead tuple
> (unset
>
>  * until the segment is fully
>
>  * populated). Keep it first to
> simplify
>
>  * binary searches */
>
> unsigned short padding;/* Align dt_tids to 32-bits,
>
>  * sizeof(ItemPointerData) is aligned to
>
>  * short, so add a padding short, to make
> the
>
>  * size of DeadTuplesSegment a multiple of
>
>  * 32-bits and align integer components for
>
>  * better performance during lookups into
> the
>
>  * multiarray */
>
> intnum_dead_tuples;/* # of entries in the segment */
>
> intmax_dead_tuples;/* # of entries allocated in the
> segment */
>
> ItemPointer dt_tids;/* Array of dead tuples */
>
> }DeadTuplesSegment;
>
> In the comments to ItemPointerData is written that it is 6 bytes long, but
> can be padded to 8 bytes by some compilers, so if we add padding in a
> current way, there is no guaranty that it will be done as it is expected.
> The other way to do it with pg_attribute_alligned. But in my opinion, there
> is no need to do it manually, because the compiler will do this optimization
> itself.

I'll look into it. But my experience is that compilers won't align
struct size like this, only attributes, and this attribute is composed
of 16-bit attributes so it doesn't get aligned by default.

> On 11.07.2017 19:51, Claudio Freire wrote:
>>>
>>> -
>>>
>>> +   /* Search for the segment likely to contain the item pointer */
>>> +   iseg = vac_itemptr_binsrch(
>>> +   (void *) itemptr,
>>> +   (void *)
>>> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
>>> +   vacrelstats->dead_tuples.last_seg + 1,
>>> +   sizeof(DeadTuplesSegment));
>>> +
>>>
>>> I think that we can change the above to;
>>>
>>> +   /* Search for the segment likely to contain the item pointer */
>>> +   iseg = vac_itemptr_binsrch(
>>> +   (void *) itemptr,
>>> +   (void *) &(seg->last_dead_tuple),
>>> +   vacrelstats->dead_tuples.last_seg + 1,
>>> +   sizeof(DeadTuplesSegment));
>>>
>>> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.
>>
>> Right
>
> In my mind, if you change vacrelstats->dead_tuples.last_seg + 1 with
> GetNumDeadTuplesSegments(vacrelstats), it would be more meaningful.

It's not the same thing. The first run it might, but after a reset of
the multiarray, num segments is the allocated size, while last_seg is
the last one filled with data.

> Besides, you can change the vac_itemptr_binsrch within the segment with
> stdlib bsearch, like:
>
> res = (ItemPointer) bsearch((void *) itemptr,
>
> (void *) seg->dt_tids,
>
> seg->num_dead_tuples,
>
> sizeof(ItemPointerData),
>
> vac_cmp_itemptr);
>
> return (res != NULL);

The stdlib's bsearch is quite slower. The custom bsearch inlines the
comparison making it able to factor out of the loop quite a bit of
logic, and in general generate far more specialized assembly.

For the compiler to optimize the stdlib's bsearch call, whole-program
optimization should be enabled, something that is unlikely. Even then,
it may not be able to, due to aliasing rules.

This is what I came up to make the new approach's performance on par
or better than the old one, in CPU cycles. In fact, benchmarks show
that time spent on the CPU is lower now, in large part, due to this.

It's not like it's the first custom binary search in postgres, also.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-12 Thread Alexey Chernyshov

Thank you for the patch and benchmark results, I have a couple remarks.
Firstly, padding in DeadTuplesSegment

typedef struct DeadTuplesSegment

{

ItemPointerData last_dead_tuple;/* Copy of the last dead tuple 
(unset


 * until the segment is fully

 * populated). Keep it first to 
simplify


 * binary searches */

unsigned short padding;/* Align dt_tids to 32-bits,

 * sizeof(ItemPointerData) is aligned to

 * short, so add a padding short, to 
make the


 * size of DeadTuplesSegment a multiple of

 * 32-bits and align integer components 
for


 * better performance during lookups 
into the


 * multiarray */

intnum_dead_tuples;/* # of entries in the segment */

intmax_dead_tuples;/* # of entries allocated in the 
segment */


ItemPointer dt_tids;/* Array of dead tuples */

}DeadTuplesSegment;

In the comments to ItemPointerData is written that it is 6 bytes long, 
but can be padded to 8 bytes by some compilers, so if we add padding in 
a current way, there is no guaranty that it will be done as it is 
expected. The other way to do it with pg_attribute_alligned. But in my 
opinion, there is no need to do it manually, because the compiler will 
do this optimization itself.


On 11.07.2017 19:51, Claudio Freire wrote:

-

+   /* Search for the segment likely to contain the item pointer */
+   iseg = vac_itemptr_binsrch(
+   (void *) itemptr,
+   (void *)
&(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
+   vacrelstats->dead_tuples.last_seg + 1,
+   sizeof(DeadTuplesSegment));
+

I think that we can change the above to;

+   /* Search for the segment likely to contain the item pointer */
+   iseg = vac_itemptr_binsrch(
+   (void *) itemptr,
+   (void *) &(seg->last_dead_tuple),
+   vacrelstats->dead_tuples.last_seg + 1,
+   sizeof(DeadTuplesSegment));

We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.

Right
In my mind, if you change vacrelstats->dead_tuples.last_seg + 1 with 
GetNumDeadTuplesSegments(vacrelstats), it would be more meaningful.
Besides, you can change the vac_itemptr_binsrch within the segment with 
stdlib bsearch, like:


res = (ItemPointer) bsearch((void *) itemptr,

(void *) seg->dt_tids,

seg->num_dead_tuples,

sizeof(ItemPointerData),

vac_cmp_itemptr);

return (res != NULL);


Those are before the changes in the review. While I don't expect any
change, I'll re-run some of them just in case, and try to investigate
the slowdown. But that will take forever. Each run takes about a week
on my test rig, and I don't have enough hardware to parallelize the
tests. I will run a test on a snapshot of a particularly troublesome
production database we have, that should be interesting.

Very interesting, waiting for the results.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-11 Thread Claudio Freire
Resending without the .tar.bz2 that get blocked


Sorry for the delay, I had extended vacations that kept me away from
my test rigs, and afterward testing too, liteally, a few weeks.

I built a more thoroguh test script that produced some interesting
results. Will attach the results.

For now, to the review comments:

On Thu, Apr 27, 2017 at 4:25 AM, Masahiko Sawada  wrote:
> I've read this patch again and here are some review comments.
>
> + * Lookup in that structure proceeds sequentially in the list of segments,
> + * and with a binary search within each segment. Since segment's size grows
> + * exponentially, this retains O(log N) lookup complexity (2 log N to be
> + * precise).
>
> IIUC we now do binary search even over the list of segments.

Right

>
> -
>
> We often fetch a particular dead tuple segment. How about providing a
> macro for easier understanding?
> For example,
>
>  #define GetDeadTuplsSegment(lvrelstats, seg) \
>   (&(lvrelstats)->dead_tuples.dt_segments[(seg)])
>
> -
>
> +   if (vacrelstats->dead_tuples.num_segs == 0)
> +   return;
> +
>
> +   /* If uninitialized, we have no tuples to delete from the indexes */
> +   if (vacrelstats->dead_tuples.num_segs == 0)
> +   {
> +   return;
> +   }
>
> +   if (vacrelstats->dead_tuples.num_segs == 0)
> +   return false;
> +

Ok

> As I listed, there is code to check if dead tuple is initialized
> already in some places where doing actual vacuum.
> I guess that it should not happen that we attempt to vacuum a
> table/index page while not having any dead tuple. Is it better to have
> Assert or ereport instead?

I'm not sure. Having a non-empty dead tuples array is not necessary to
be able to honor the contract in the docstring. Most of those functions
clean up the heap/index of dead tuples given the array of dead tuples,
which is a no-op for an empty array.

The code that calls those functions doesn't bother calling if the array
is known empty, true, but there's no compelling reason to enforce that at the
interface. Doing so could cause subtle bugs rather than catch them
(in the form of unexpected assertion failures, if some caller forgot to
check the dead tuples array for emptiness).

If you're worried about the possibility that some bugs fails to record
dead tuples in the array, and thus makes VACUUM silently ineffective,
I instead added a test for that case. This should be a better approach,
since it's more likely to catch unexpected failure modes than an assert.

> @@ -1915,2 +2002,2 @@ count_nondeletable_pages(Relation onerel,
> LVRelStats *vacrelstats)
> -   BlockNumber prefetchStart;
> -   BlockNumber pblkno;
> +   BlockNumber prefetchStart;
> +   BlockNumber pblkno;
>
> I think that it's a unnecessary change.

Yep. But funnily that's how it's now in master.

>
> -
>
> +   /* Search for the segment likely to contain the item pointer */
> +   iseg = vac_itemptr_binsrch(
> +   (void *) itemptr,
> +   (void *)
> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
> +   vacrelstats->dead_tuples.last_seg + 1,
> +   sizeof(DeadTuplesSegment));
> +
>
> I think that we can change the above to;
>
> +   /* Search for the segment likely to contain the item pointer */
> +   iseg = vac_itemptr_binsrch(
> +   (void *) itemptr,
> +   (void *) &(seg->last_dead_tuple),
> +   vacrelstats->dead_tuples.last_seg + 1,
> +   sizeof(DeadTuplesSegment));
>
> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.

Right

Attached is a current version of both patches, rebased since we're at it.

I'm also attaching the output from the latest benchmark runs, in raw
(tar.bz2) and digested (bench_report) forms, the script used to run
them (vacuumbench.sh) and to produce the reports
(vacuum_bench_report.sh).

Those are before the changes in the review. While I don't expect any
change, I'll re-run some of them just in case, and try to investigate
the slowdown. But that will take forever. Each run takes about a week
on my test rig, and I don't have enough hardware to parallelize the
tests. I will run a test on a snapshot of a particularly troublesome
production database we have, that should be interesting.

The benchmarks show a consistent improvement at scale 400, which may
be related to the search implementation being better somehow, and a
slowdown at scale 4000 in some variants. I believe this is due to
those variants having highly clustered indexes. While the "shuf"
(shuffled) variantes were intended to be the opposite of that, I
suspect I somehow failed to get the desired outcome, so I'll be
double-checking that.

In any case the slowdown is only materialized when vacuuming with a
large mwm setting, which is something that shouldn't happen
unintentionally.