Re: [HACKERS] Small improvement to compactify_tuples

2017-11-08 Thread Claudio Freire
On Wed, Nov 8, 2017 at 12:33 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Nov 7, 2017 at 4:39 PM, Tom Lane  wrote:
>>> What I'm getting from the standard pgbench measurements, on both machines,
>>> is that this patch might be a couple percent slower than HEAD, but that is
>>> barely above the noise floor so I'm not too sure about it.
>
>> Hmm.  It seems like slowing down single client performance by a couple
>> of percent is something that we really don't want to do.
>
> I do not think there is any change here that can be proven to always be a
> win.  Certainly the original patch, which proposes to replace an O(n log n)
> sort algorithm with an O(n^2) one, should not be thought to be that.
> The question to focus on is what's the average case, and I'm not sure how
> to decide what the average case is.  But more than two test scenarios
> would be a good start.
>
> regards, tom lane

Doing no change to the overall algorithm and replacing qsort with an
inlineable type-specific one should be a net win in all cases.

Doing bucket sort with a qsort of large buckets (or small tuple
arrays) should also be a net win in all cases.

Using shell sort might not seem clear, but lets not forget the
original patch only uses it in very small arrays and very infrequently
at that.

What's perhaps not clear is whether there are better ideas. Like
rebuilding the page as Tom proposes, which doesn't seem like a bad
idea. Bucket sort already is O(bytes), just as memcopy, only it has a
lower constant factor (it's bytes/256 in the original patch), which
might make copying the whole page an extra time lose against bucket
sort in a few cases.

Deciding that last point does need more benchmarking. That doesn't
mean the other improvements can't be pursued in the meanwhile, right?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Claudio Freire
On Tue, Nov 7, 2017 at 11:42 AM, Юрий Соколов <funny.fal...@gmail.com> wrote:
>
>
> 2017-11-07 17:15 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>> Aside from requiring all that include magic, if you place specialized
>> sort functions in a reusable header, using it is as simple as
>> including the type-specific header (or declaring the type macros and
>> including the template), and using them as regular functions. There's
>> no runtime overhead involved, especially if you declare the comparison
>> function as a macro or a static inline function. The sort itself can
>> be declared static inline as well, and the compiler will decide
>> whether it's worth inlining.
>
> Ok, if no one will complain against another one qsort implementation,
> I will add template header for qsort. Since qsort needs insertion sort,
> it will be in a same file.
> Do you approve of this?
>
> With regards,
> Sokolov Yura

If you need it. I'm not particularly fond of writing code before it's needed.

If you can measure the impact for that particular case where qsort
might be needed, and it's a real-world case, then by all means.

Otherwise, if it's a rarely-encountered corner case, I'd recommend
simply calling the stdlib's qsort.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Claudio Freire
On Mon, Nov 6, 2017 at 9:08 PM, Юрий Соколов <funny.fal...@gmail.com> wrote:
> 2017-11-07 1:14 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>>
>> I haven't seen this trick used in postgres, nor do I know whether it
>> would be well received, so this is more like throwing an idea to see
>> if it sticks...
>>
>> But a way to do this without macros is to have an includable
>> "template" algorithm that simply doesn't define the comparison
>> function/type, it rather assumes it:
>>
>> qsort_template.h
>>
>> #define QSORT_NAME qsort_ ## QSORT_SUFFIX
>>
>> static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems)
>> {
>> ... if (ELEM_LESS(arr[a], arr[b]))
>> ...
>> }
>>
>> #undef QSORT_NAME
>>
>> Then, in "offset_qsort.h":
>>
>> #define QSORT_SUFFIX offset
>> #define ELEM_TYPE offset
>> #define ELEM_LESS(a,b) ((a) < (b))
>>
>> #include "qsort_template.h"
>>
>> #undef QSORT_SUFFIX
>> #undef ELEM_TYPE
>> #undef ELEM_LESS
>>
>> Now, I realize this may have its cons, but it does simplify
>> maintainance of type-specific or parameterized variants of
>> performance-critical functions.
>>
>> > I can do specialized qsort for this case. But it will be larger bunch of
>> > code, than
>> > shell sort.
>> >
>> >> And I'd recommend doing that when there is a need, and I don't think
>> >> this patch really needs it, since bucket sort handles most cases
>> >> anyway.
>> >
>> > And it still needs insertion sort for buckets.
>> > I can agree to get rid of shell sort. But insertion sort is necessary.
>>
>> I didn't suggest getting rid of insertion sort. But the trick above is
>> equally applicable to insertion sort.
>
> This trick is used in simplehash.h . I agree, it could be useful for qsort.
> This will not make qsort inlineable, but will reduce overhead much.
>
> This trick is too heavy-weight for insertion sort alone, though. Without
> shellsort, insertion sort could be expressed in 14 line macros ( 8 lines
> without curly braces). But if insertion sort will be defined together with
> qsort (because qsort still needs it), then it is justifiable.

What do you mean by heavy-weight?

Aside from requiring all that include magic, if you place specialized
sort functions in a reusable header, using it is as simple as
including the type-specific header (or declaring the type macros and
including the template), and using them as regular functions. There's
no runtime overhead involved, especially if you declare the comparison
function as a macro or a static inline function. The sort itself can
be declared static inline as well, and the compiler will decide
whether it's worth inlining.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-06 Thread Claudio Freire
On Mon, Nov 6, 2017 at 6:58 PM, Юрий Соколов <funny.fal...@gmail.com> wrote:
>
> 2017-11-06 17:55 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>>
>> On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов <funny.fal...@gmail.com>
>> wrote:
>> >> Maybe leave a fallback to qsort if some corner case produces big
>> >> buckets?
>> >
>> > For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at
>> > most 1 heap-tuple per bucket, and for index pages it is at most 2 index
>> > tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples
>> > per bucket.
>> > It will be unnecessary overhead to call non-inlineable qsort in this
>> > cases
>> >
>> > So, I think, shell sort could be removed, but insertion sort have to
>> > remain.
>> >
>> > I'd prefer shell sort to remain also. It could be useful in other places
>> > also,
>> > because it is easily inlinable, and provides comparable to qsort
>> > performance
>> > up to several hundreds of elements.
>>
>> I'd rather have an inlineable qsort.
>
> But qsort is recursive. It is quite hard to make it inlineable. And still it
> will be
> much heavier than insertion sort (btw, all qsort implementations uses
> insertion
> sort for small arrays). And it will be heavier than shell sort for small
> arrays.

I haven't seen this trick used in postgres, nor do I know whether it
would be well received, so this is more like throwing an idea to see
if it sticks...

But a way to do this without macros is to have an includable
"template" algorithm that simply doesn't define the comparison
function/type, it rather assumes it:

qsort_template.h

#define QSORT_NAME qsort_ ## QSORT_SUFFIX

static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems)
{
... if (ELEM_LESS(arr[a], arr[b]))
...
}

#undef QSORT_NAME

Then, in "offset_qsort.h":

#define QSORT_SUFFIX offset
#define ELEM_TYPE offset
#define ELEM_LESS(a,b) ((a) < (b))

#include "qsort_template.h"

#undef QSORT_SUFFIX
#undef ELEM_TYPE
#undef ELEM_LESS

Now, I realize this may have its cons, but it does simplify
maintainance of type-specific or parameterized variants of
performance-critical functions.

> I can do specialized qsort for this case. But it will be larger bunch of
> code, than
> shell sort.
>
>> And I'd recommend doing that when there is a need, and I don't think
>> this patch really needs it, since bucket sort handles most cases
>> anyway.
>
> And it still needs insertion sort for buckets.
> I can agree to get rid of shell sort. But insertion sort is necessary.

I didn't suggest getting rid of insertion sort. But the trick above is
equally applicable to insertion sort.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-06 Thread Claudio Freire
On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов  wrote:
>> Maybe leave a fallback to qsort if some corner case produces big buckets?
>
> For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at
> most 1 heap-tuple per bucket, and for index pages it is at most 2 index
> tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples
> per bucket.
> It will be unnecessary overhead to call non-inlineable qsort in this cases
>
> So, I think, shell sort could be removed, but insertion sort have to remain.
>
> I'd prefer shell sort to remain also. It could be useful in other places
> also,
> because it is easily inlinable, and provides comparable to qsort performance
> up to several hundreds of elements.

I'd rather have an inlineable qsort.

And I'd recommend doing that when there is a need, and I don't think
this patch really needs it, since bucket sort handles most cases
anyway.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-05 Thread Claudio Freire
On Sat, Nov 4, 2017 at 8:07 PM, Юрий Соколов <funny.fal...@gmail.com> wrote:
> 2017-11-03 5:46 GMT+03:00 Tom Lane <t...@sss.pgh.pa.us>:
>>
>> Sokolov Yura <funny.fal...@postgrespro.ru> writes:
>> > [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ]
>>
>> I went to check the shellsort algorithm against Wikipedia's entry,
>> and found that this appears to be an incorrect implementation of
>> shellsort: where pg_shell_sort_pass has
>>
>> for (_i = off; _i < _n; _i += off) \
>>
>> it seems to me that we need to have
>>
>> for (_i = off; _i < _n; _i += 1) \
>>
>> or maybe just _i++.
>
>
> Shame on me :-(
> I've wrote shell sort several times, so I forgot to recheck myself once
> again.
> And looks like best gap sequence from wikipedia is really best
> ( {301, 132, 57, 23, 10 , 4} in my notation),
>
>
> 2017-11-03 17:37 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> BTW, the originally given test case shows no measurable improvement
>>> on my box.
>>
>> I did manage to reproduce the original test and got a consistent
>> improvement.
>
> I've rechecked my self using my benchmark.
> Without memmove, compactify_tuples comsumes:
> - with qsort 11.66% cpu (pg_qsort + med3 + swapfunc + itemoffcompare +
> compactify_tuples = 5.97 + 0.51 + 2.87 + 1.88 + 0.44)
> - with just insertion sort 6.65% cpu (sort is inlined, itemoffcompare also
> inlined, so whole is compactify_tuples)
> - with just shell sort 5,98% cpu (sort is inlined again)
> - with bucket sort 1,76% cpu (sort_itemIds + compactify_tuples = 1.30 +
> 0.46)

Is that just insertion sort without bucket sort?

Because I think shell sort has little impact in your original patch
because it's rarely exercised. With bucket sort, most buckets are very
small, too small for shell sort to do any useful work.

That's why I'm inclined to agree with Tom in that we could safely
simplify it out, remove it, without much impact.

Maybe leave a fallback to qsort if some corner case produces big buckets?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-03 Thread Claudio Freire
On Fri, Nov 3, 2017 at 4:30 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfre...@gmail.com> writes:
>> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> BTW, the originally given test case shows no measurable improvement
>>> on my box.
>
>> I did manage to reproduce the original test and got a consistent improvement.
>
> This gave me the idea to memcpy the page into some workspace and then use
> memcpy, not memmove, to put the tuples back into the caller's copy of the
> page.  That gave me about a 50% improvement in observed TPS, and a perf
> profile like this:
>
> +   38.50%38.40%299520  postmaster   postgres 
>   [.] compactify_tuples
> +   31.11%31.02%241975  postmaster   libc-2.12.so 
>   [.] memcpy
> +8.74% 8.72% 68022  postmaster   postgres 
>   [.] itemoffcompare
> +6.51% 6.49% 50625  postmaster   postgres 
>   [.] compactify_tuples_loop
> +4.21% 4.19% 32719  postmaster   postgres 
>   [.] pg_qsort
> +1.70% 1.69% 13213  postmaster   postgres 
>   [.] memcpy@plt
>
> There still doesn't seem to be any point in replacing the qsort,
> but it does seem like something like the second attached patch
> might be worth doing.
>
> So I'm now wondering why my results seem to be so much different
> from those of other people who have tried this, both as to whether
> compactify_tuples is worth working on at all and as to what needs
> to be done to it if so.  Thoughts?
>
> regards, tom lane
>

I'm going to venture a guess that the version of gcc and libc, and
build options used both in the libc (ie: the distro) and postgres may
play a part here.

I'm running with glibc 2.22, for instance, and building with gcc 4.8.5.

I will try and benchmark memcpy vs memmove and see what the
performance difference is there with my versions, too. This may
heavily depend on compiler optimizations that may vary between
versions, since memcpy/memmove tend to be inlined a lot.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-03 Thread Claudio Freire
On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane  wrote:
> Sokolov Yura  writes:
>> [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ]
>
> I started to review this patch.  I spent a fair amount of time on
> beautifying the code, because I found it rather ugly and drastically
> undercommented.  Once I had it to the point where it seemed readable,
> I went to check the shellsort algorithm against Wikipedia's entry,
> and found that this appears to be an incorrect implementation of
> shellsort: where pg_shell_sort_pass has
>
> for (_i = off; _i < _n; _i += off) \
>
> it seems to me that we need to have
>
> for (_i = off; _i < _n; _i += 1) \
>
> or maybe just _i++.  As-is, this isn't h-sorting the whole file,
> but just the subset of entries that have multiple-of-h indexes
> (ie, the first of the h distinct subfiles that should get sorted).
> The bug is masked by the final pass of plain insertion sort, but
> we are not getting the benefit we should get from the earlier passes.
>
> However, I'm a bit dubious that it's worth fixing that; instead
> my inclination would be to rip out the shellsort implementation
> entirely.  The code is only using it for the nitems <= 48 case
> (which makes the first three offset steps certainly no-ops) and
> I am really unconvinced that it's worth expending the code space
> for a shellsort rather than plain insertion sort in that case,
> especially when we have good reason to think that the input data
> is nearly sorted.

I actually noticed that and benchmarked some variants. Neither
made any noticeable difference in performance, so I decided not
to complain about them.

I guess the same case can be made for removing the shell sort.
So I'm inclined to agree.

> BTW, the originally given test case shows no measurable improvement
> on my box.

I did manage to reproduce the original test and got a consistent improvement.

> I was eventually able to convince myself by profiling
> that the patch makes us spend less time in compactify_tuples, but
> this test case isn't a very convincing one.
>
> So, quite aside from the bug, I'm not excited about committing the
> attached as-is.  I think we should remove pg_shell_sort and just
> use pg_insertion_sort.  If somebody can show a test case that
> provides a measurable speed improvement from the extra code,
> I could be persuaded to reconsider.

My tests modifying the shell sort didn't produce any measurable
difference, but I didn't test removing it altogether.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-10-02 Thread Claudio Freire
On Sun, Oct 1, 2017 at 8:36 PM, Daniel Gustafsson <dan...@yesql.se> wrote:
>> On 18 Aug 2017, at 13:39, Claudio Freire <klaussfre...@gmail.com> wrote:
>>
>> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfre...@gmail.com> 
>> wrote:
>>> Indeed they do, and that's what motivated this patch. But I'd need
>>> TB-sized tables to set up something like that. I don't have the
>>> hardware or time available to do that (vacuum on bloated TB-sized
>>> tables can take days in my experience). Scale 4000 is as big as I can
>>> get without running out of space for the tests in my test hardware.
>>>
>>> If anybody else has the ability, I'd be thankful if they did test it
>>> under those conditions, but I cannot. I think Anastasia's test is
>>> closer to such a test, that's probably why it shows a bigger
>>> improvement in total elapsed time.
>>>
>>> Our production database could possibly be used, but it can take about
>>> a week to clone it, upgrade it (it's 9.5 currently), and run the
>>> relevant vacuum.
>>
>> It looks like I won't be able to do that test with a production
>> snapshot anytime soon.
>>
>> Getting approval for the budget required to do that looks like it's
>> going to take far longer than I thought.
>>
>> Regardless of that, I think the patch can move forward. I'm still
>> planning to do the test at some point, but this patch shouldn't block
>> on it.
>
> This patch has been marked Ready for committer after review, but wasn’t
> committed in the current commitfest so it will be moved to the next.  Since it
> no longer applies cleanly, it’s being reset to Waiting for author though.
>
> cheers ./daniel

Rebased version of the patches attached
From 8d4ea3bc7b67bc535ac26109a65932044e4a0ab7 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 2 Oct 2017 10:56:40 -0300
Subject: [PATCH 1/2] [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 404 ---
 src/test/regress/expected/vacuum.out |  26 +++
 src/test/regress/sql/vacuum.sql  |  19 ++
 3 files changed, 375 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 30b1c08c6c..ae1e9afbf3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) 

Re: [HACKERS] Small improvement to compactify_tuples

2017-09-25 Thread Claudio Freire
On Sat, Sep 23, 2017 at 5:56 AM, Sokolov Yura
<funny.fal...@postgrespro.ru> wrote:
> Hello, Claudio.
>
> Thank you for review and confirm of improvement.
>
>
> On 2017-09-23 01:12, Claudio Freire wrote:
>>
>>
>> Patch 1 applies cleanly, builds, and make check runs fine.
>>
>> The code looks similar in style to surrounding code too, so I'm not
>> going to complain about the abundance of underscores in the macros :-p
>>
>> I can reproduce the results in the OP's benchmark, with slightly
>> different numbers, but an overall improvement of ~6%, which matches
>> the OP's relative improvement.
>>
>> Algorithmically, everything looks sound.
>>
>>
>> A few minor comments about patch 1:
>>
>> +if (max == 1)
>> +goto end;
>>
>> That goto is unnecessary, you could just as simply say
>>
>> if (max > 1)
>> {
>>...
>> }
>
>
> Done.
> (I don't like indentation, though :-( )
>
>>
>>
>> +#define pg_shell_sort_pass(elem_t, cmp, off) \
>> +do { \
>> +int _i, _j; \
>> +elem_t _temp; \
>> +for (_i = off; _i < _n; _i += off) \
>> +{ \
>>
>> _n right there isn't declared in the macro, and it isn't an argument
>> either. It should be an argument, having stuff inherited from the
>> enclosing context like that is confusing.
>>
>> Same with _arr, btw.
>
>
> pg_shell_sort_pass is not intended to be used outside pg_shell_sort
> and ph_insertion_sort, so I think, stealing from their context is ok.
> Nonetheless, done.

Looks good.

Marking this patch as ready for committer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-09-22 Thread Claudio Freire
On Tue, Sep 12, 2017 at 12:49 PM, Sokolov Yura
 wrote:
> On 2017-07-21 13:49, Sokolov Yura wrote:
>>
>> On 2017-05-17 17:46, Sokolov Yura wrote:
>>>
>>> Alvaro Herrera писал 2017-05-15 20:13:

 As I understand, these patches are logically separate, so putting them
 together in a single file isn't such a great idea.  If you don't edit
 the patches further, then you're all set because we already have the
 previously archived patches.  Next commitfest starts in a few months
 yet, and if you feel the need to submit corrected versions in the
 meantime, please do submit in separate files.  (Some would even argue
 that each should be its own thread, but I don't think that's necessary.)
>>>
>>>
>>> Thank you for explanation.
>>>
>>> I'm adding new version of first patch with minor improvement:
>>> - I added detection of a case when all buckets are trivial
>>>   (ie 0 or 1 element). In this case no need to sort buckets at all.
>>
>>
>> I'm putting rebased version of second patch.
>
>
> Again rebased version of both patches.
> Now second patch applies cleanly independent of first patch.

Patch 1 applies cleanly, builds, and make check runs fine.

The code looks similar in style to surrounding code too, so I'm not
going to complain about the abundance of underscores in the macros :-p

I can reproduce the results in the OP's benchmark, with slightly
different numbers, but an overall improvement of ~6%, which matches
the OP's relative improvement.

Algorithmically, everything looks sound.


A few minor comments about patch 1:

+if (max == 1)
+goto end;

That goto is unnecessary, you could just as simply say

if (max > 1)
{
   ...
}


+#define pg_shell_sort_pass(elem_t, cmp, off) \
+do { \
+int _i, _j; \
+elem_t _temp; \
+for (_i = off; _i < _n; _i += off) \
+{ \

_n right there isn't declared in the macro, and it isn't an argument
either. It should be an argument, having stuff inherited from the
enclosing context like that is confusing.

Same with _arr, btw.


Patch 2 LGTM.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-22 Thread Claudio Freire
On Fri, Sep 22, 2017 at 4:46 AM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
> I apologize in advance of possible silliness.
>
> At Thu, 21 Sep 2017 13:54:01 -0300, Claudio Freire <klaussfre...@gmail.com> 
> wrote in 

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-21 Thread Claudio Freire
On Tue, Sep 19, 2017 at 8:55 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Tue, Sep 19, 2017 at 4:47 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> Maybe this is looking at the problem from the wrong direction.
>>
>> Why can't the page be added to the FSM immediately and the check be
>> done at runtime when looking for a reusable page?
>>
>> Index FSMs currently store only 0 or 255, couldn't they store 128 for
>> half-recyclable pages and make the caller re-check reusability before
>> using it?
>
> No, because it's impossible for them to know whether or not the page
> that their index scan just landed on recycled just a second ago, or
> was like this since before their xact began/snapshot was acquired.
>
> For your reference, this RecentGlobalXmin interlock stuff is what
> Lanin & Shasha call "The Drain Technique" within "2.5 Freeing Empty
> Nodes". Seems pretty hard to do it any other way.

I don't see the difference between a vacuum run and distributed
maintainance at _bt_getbuf time. In fact, the code seems to be in
place already.

_bt_page_recyclable seems to prevent old transactions from treating
those pages as recyclable already, and the description of the
technique in 2.5 doesn't seem to preclude doing the drain while doing
other operations. In fact, Lehman even considers the possibility of
multiple concurrent garbage collectors.

It's only a matter of making the page visible in the FSM in a way that
can be efficiently skipped if we want to go directly to a page that
actually CAN be recycled to avoid looping forever looking for a
recyclable page in _bt_getbuf. In fact, that's pretty much Lehman's
drain technique right there. FSM entries with 128 would be "the
queue", and FSM entries with 255 would be "the freelist". _bt_getbuf
can be the GC getting a buffer to try and recycle, give up after a few
tries, and get an actual recycleable buffer instead (or extend the
relationship). In essence, microvacuum.

Unless I'm missing something and RecentGlobalXmin somehow doesn't
exclude all old transactions, I don't see a problem.

Lanin & Shasha use reference counting to do GC wait during the drain,
and most of the ordering of operations needed there is because of
that, but using the xmin seems to make all those considerations moot.
An xact earlier than RecentGlobalXmin cannot have active transactions
able to follow links to that page AFAIK.

TBH, I didn't read the whole papers, though I probably will.

But, in essence, what's the difference of vacuum doing

if (_bt_page_recyclable(page))
{
/* Okay to recycle this page */
RecordFreeIndexPage(rel, blkno);
vstate->totFreePages++;
stats->pages_deleted++;
}

VS doing it in _bt_getbuf?

What am I missing?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-09-19 Thread Claudio Freire
On Tue, Sep 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
 wrote:
> I was just looking the thread since it is found left alone for a
> long time in the CF app.
>
> At Mon, 18 Sep 2017 16:35:58 -0700, Peter Geoghegan  wrote in 
> 
>> On Wed, Apr 5, 2017 at 3:50 PM, Andres Freund  wrote:
>> > Hi,
>> >
>> > On 2017-04-01 03:05:07 +0900, Masahiko Sawada wrote:
>> >> On Fri, Mar 31, 2017 at 11:44 PM, Robert Haas  
>> >> wrote:
>> >> [ lots of valuable discussion ]
>> >
>> > I think this patch clearly still is in the design stage, and has
>> > received plenty feedback this CF.  I'll therefore move this to the next
>> > commitfest.
>>
>> Does anyone have ideas on a way forward here? I don't, but then I
>> haven't thought about it in detail in several months.
>
> Is the additional storage in metapage to store the current status
> of vaccum is still unacceptable even if it can avoid useless
> full-page scan on indexes especially for stable tables?
>
> Or, how about additional 1 bit in pg_stat_*_index to indicate
> that the index *don't* require vacuum cleanup stage. (default
> value causes cleanup)
>
> index_bulk_delete (or ambulkdelete) returns the flag in
> IndexBulkDeleteResult then lazy_scan_heap stores the flag in
> stats and in the next cycle it is looked up to decide the
> necessity of index cleanup.

Maybe this is looking at the problem from the wrong direction.

Why can't the page be added to the FSM immediately and the check be
done at runtime when looking for a reusable page?

Index FSMs currently store only 0 or 255, couldn't they store 128 for
half-recyclable pages and make the caller re-check reusability before
using it?

This would make the second pass totally unnecessary, for a slight
penalty when looking at the FSM. Ie: 2 lookups in the FSM instead of
one. An initial one with min free space 128 to get a half-recyclable
page, if the check fails on that page, a second lookup with min free
space 255 to get a surely-recycleable page.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-08-18 Thread Claudio Freire
On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> Indeed they do, and that's what motivated this patch. But I'd need
> TB-sized tables to set up something like that. I don't have the
> hardware or time available to do that (vacuum on bloated TB-sized
> tables can take days in my experience). Scale 4000 is as big as I can
> get without running out of space for the tests in my test hardware.
>
> If anybody else has the ability, I'd be thankful if they did test it
> under those conditions, but I cannot. I think Anastasia's test is
> closer to such a test, that's probably why it shows a bigger
> improvement in total elapsed time.
>
> Our production database could possibly be used, but it can take about
> a week to clone it, upgrade it (it's 9.5 currently), and run the
> relevant vacuum.

It looks like I won't be able to do that test with a production
snapshot anytime soon.

Getting approval for the budget required to do that looks like it's
going to take far longer than I thought.

Regardless of that, I think the patch can move forward. I'm still
planning to do the test at some point, but this patch shouldn't block
on it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: In-place index updates and HOT (Was: [HACKERS] Patch: Write Amplification Reduction Method (WARM))

2017-07-28 Thread Claudio Freire
On Fri, Jul 28, 2017 at 8:32 PM, Peter Geoghegan  wrote:
> README.HOT says that that cost is not worth the benefit of
> preventing a new index write, but I think that it ought to take into
> account that not all index writes are equal. There is an appreciable
> difference between inserting a new tuple, and updating one in-place. We
> can remove the cost (hurting new snapshots by making them go through old
> heap pages) while preserving most of the benefits (no logically
> unnecessary index bloat).

It's a neat idea.

And, well, now that you mention, you don't need to touch indexes at all.

You can create the new chain, and "update" the index to point to it,
without ever touching the index itself, since you can repoint the old
HOT chain's start line pointer to point to the new HOT chain, create
a new pointer for the old one and point to it in the new HOT chain's
t_tid.

Existing index tuples thus now point to the right HOT chain without
having to go into the index and make any changes.

You do need the new HOT chain to live in the same page for this,
however.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Vacuum: Update FSM more frequently

2017-07-28 Thread Claudio Freire
As discussed in the vacuum ring buffer thread:

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.

Run a partial FSM vacuum with a low pruning threshold right at
the beginning of the VACUUM to finish up any work left over from
prior canceled vacuum runs, something that is common in highly
contended tables when running only autovacuum.

Autovacuum canceling is thus handled by updating the FSM first-thing
when autovacuum retries vacuuming a relation. I attempted to add
an autovacuum work item for performing the FSM update shortly
after the cancel, but that had some issues so I abandoned the approach.

For one, it would sometimes crash when adding the work item from
inside autovacuum itself. I didn't find the cause of the crash, but I suspect
AutoVacuumRequestWork was being called in a context where it
was not safe. In any case, the way work items work didn't seem like
it would have worked for our purposes anyway, since they will only ever
be processed after processing all tables in the database, something that
could take ages, the work items are limited to 256, which would make
the approach troublesome for databases with more than 256 tables that
trigger this case, and it required de-duplicating work items which had
quadratic cost without major refactoring of the feature.

So, patch attached. I'll add it to the next CF as well.
From 791b9d2006b5abd67a8efb3f7c6cc99141ddbb09 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.

Run a partial FSM vacuum with a low pruning threshold right at
the beginning of the VACUUM to finish up any work left over from
prior canceled vacuum runs, something that is common in highly
contended tables when running only autovacuum.
---
 src/backend/access/brin/brin.c|  2 +-
 src/backend/access/brin/brin_pageops.c| 10 +++---
 src/backend/commands/vacuumlazy.c | 60 +--
 src/backend/storage/freespace/freespace.c | 31 
 src/backend/storage/freespace/indexfsm.c  |  2 +-
 src/include/storage/freespace.h   |  2 +-
 6 files changed, 90 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index efebeb0..bb80edd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1424,5 +1424,5 @@ brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy)
 	 * the way to the top.
 	 */
 	if (vacuum_fsm)
-		FreeSpaceMapVacuum(idxrel);
+		FreeSpaceMapVacuum(idxrel, 0);
 }
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/acces

Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-27 Thread Claudio Freire
On Thu, Jul 27, 2017 at 2:10 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Claudio Freire wrote:
>> On Thu, Jul 27, 2017 at 1:46 PM, Alvaro Herrera
>> <alvhe...@2ndquadrant.com> wrote:
>> > Claudio Freire wrote:
>> >
>> >> > The vacuuming the very large table with no index could also take a
>> >> > long time, and it scans and vacuums blocks one by one. So I imagined
>> >> > that we can vacuum the FSM once vacuumed a certain amount of blocks.
>> >> > And that can avoid bloating table during the long-time vacuum.
>> >>
>> >> Could do that. I'll see about doing something of the sort and
>> >> submitting it as a separate patch.
>> >
>> > Vacuuming of the FSM is in no way strictly tied to vacuuming the heap
>> > (it's not really "vacuuming", it's just about updating the upper layers
>> > to match the data in the leaves).  I think we could use the new autovac
>> > "workitem" infrastructure and tack an item there once in a while for FSM
>> > vacuuming ...
>>
>> Well, it *is* tied in the sense that vacuum is the one massively
>> adding free space.
>
> Yes, but if vacuum dies after releasing lots of space but before
> vacuuming FSM, then it's not tied anymore and you could just as well run
> it anytime.

I see your point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-27 Thread Claudio Freire
On Thu, Jul 27, 2017 at 1:46 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Claudio Freire wrote:
>
>> > The vacuuming the very large table with no index could also take a
>> > long time, and it scans and vacuums blocks one by one. So I imagined
>> > that we can vacuum the FSM once vacuumed a certain amount of blocks.
>> > And that can avoid bloating table during the long-time vacuum.
>>
>> Could do that. I'll see about doing something of the sort and
>> submitting it as a separate patch.
>
> Vacuuming of the FSM is in no way strictly tied to vacuuming the heap
> (it's not really "vacuuming", it's just about updating the upper layers
> to match the data in the leaves).  I think we could use the new autovac
> "workitem" infrastructure and tack an item there once in a while for FSM
> vacuuming ...

Well, it *is* tied in the sense that vacuum is the one massively
adding free space.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-27 Thread Claudio Freire
On Thu, Jul 27, 2017 at 6:16 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Thu, Jul 27, 2017 at 5:48 PM, Sokolov Yura
> <funny.fal...@postgrespro.ru> wrote:
>> On 2017-07-27 11:30, Masahiko Sawada wrote:
>>>
>>> On Tue, Jul 25, 2017 at 2:27 AM, Claudio Freire <klaussfre...@gmail.com>
>>> wrote:
>>>>
>>>> On Mon, Jul 24, 2017 at 2:20 PM, Claudio Freire <klaussfre...@gmail.com>
>>>> wrote:
>>>>>
>>>>> On Mon, Jul 24, 2017 at 2:10 PM, Sokolov Yura
>>>>> <funny.fal...@postgrespro.ru> wrote:
>>>>>>
>>>>>> On 2017-07-24 19:11, Claudio Freire wrote:
>>>>>>>
>>>>>>> I was mostly thinking about something like the attached patch.
>>>>>>>
>>>>>>> Simple, unintrusive, and shouldn't cause any noticeable slowdown.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Your change is small, clear, and currently useful for huge tables under
>>>>>> high update load (until "allowing vacuum to use more than 1GB memory"
>>>>>> is merged).
>>>>>
>>>>>
>>>>> In high-bloat conditions, it doesn't take long to accumulate 1GB of
>>>>> dead tuples (which is about 178M tuples, btw).
>>>>>
>>>>> The index scan takes way longer than the heap scan in that case.
>>>>>
>>>>>> But it still delays updating fsm until whole first batch of dead tuples
>>>>>> cleared (ie all indices scanned, and all heap pages cleared), and on
>>>>>> such
>>>>>> huge table it will be hours.
>>>>>
>>>>>
>>>>> So, true, it will get delayed considerably. But as you realized,
>>>>> there's not much point in trying to vacuum the FSM sooner, since it
>>>>> won't be accurate shortly afterwards anyway. Dead line pointers do use
>>>>> up a fair bit of space, especially on narrow tables.
>>>>>
>>>>> In a particular table I have that exhibits this problem, most of the
>>>>> time is spent scanning the index. It performs dozens of index scans
>>>>> before it's done, so it would vacuum the FSM quite often enough, even
>>>>> if I were to increase the mwm setting n-fold.
>>>>
>>>>
>>>> I hate to reply to myself, but I wanted to add: in any case, the case
>>>> I'm trying to avoid is the case where the FSM *never* gets vacuumed.
>>>> That's bad. But it may not be the phenomenon you're experiencing in
>>>> your tests.
>>>>
>>>
>>> I think the frequently vacuuming the FSM during long-time vacuum would
>>> be worth to have in order to avoid a table bloating. The patch
>>> triggers to vacuum the FSM after vacuumed the table and indexes but I
>>> think that we can have a similar mechanism for a table with no index.
>>>
>>> Regards,
>>>
>>> --
>>> Masahiko Sawada
>>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>>> NTT Open Source Software Center
>>
>>
>> I could be wrong, but it looks like table without index doesn't
>> suffer that much. Since there is no indices, there is only one stage -
>> scanning heap, and no quadratic behavior cause of full dead-tuple array
>> and repeating index vacuuming.
>>
>
> The vacuuming the very large table with no index could also take a
> long time, and it scans and vacuums blocks one by one. So I imagined
> that we can vacuum the FSM once vacuumed a certain amount of blocks.
> And that can avoid bloating table during the long-time vacuum.

Could do that. I'll see about doing something of the sort and
submitting it as a separate patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-26 Thread Claudio Freire
On Wed, Jul 26, 2017 at 1:39 PM, Sokolov Yura
 wrote:
> On 2017-07-24 12:41, Sokolov Yura wrote:
> test_master_1/pretty.log
...
> time   activity  tps  latency   stddev  min  max
> 11130 av+ch  198198ms374ms  7ms   1956ms
> 11160 av+ch  248163ms401ms  7ms   2601ms
> 11190 av+ch  321125ms363ms  7ms   2722ms
> 11220 av+ch 1155 35ms123ms  7ms   2668ms
> 11250 av+ch 1390 29ms 79ms  7ms   1422ms

vs

> test_master_ring16_1/pretty.log
> time   activity  tps  latency   stddev  min  max
> 11130 av+ch   26   1575ms635ms101ms   2536ms
> 11160 av+ch   25   1552ms648ms 58ms   2376ms
> 11190 av+ch   32   1275ms726ms 16ms   2493ms
> 11220 av+ch   23   1584ms674ms 48ms   2454ms
> 11250 av+ch   35   1235ms777ms 22ms   3627ms

That's a very huge change in latency for the worse

Are you sure that's the ring buffer's doing and not some methodology snafu?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2017-07-24 Thread Claudio Freire
On Mon, Jul 24, 2017 at 2:43 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Mon, Jul 24, 2017 at 10:13 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> Vacuum *might* be able to redistribute tuples too, while holding locks
>> to all 3 pages (the two children and the parent page), since it can
>> move the TID to the middle point of two actual child TIDs, mindlessly,
>> without checking to see if such TID actually exists - it just needs to
>> choose a TID cutoff point that will distribute item pointers evently.
>> I haven't tried this, but it is theoretically possible.
>
> I don't understand what you mean here. As long as the TIDs are a first
> class part of the keyspace, what VACUUM does or doesn't do should not
> matter. Page deletion stashes a TID in the highkey position of a leaf
> page within _bt_mark_page_halfdead(), but that shouldn't matter to
> you.
>
> I guess you're talking about contriving a TID value that is the mean
> of two real TID values -- the two that are on each side of the split
> point during a leaf page split. While that seems possible, I don't see
> much value in it. Unless you have normalized keys, you can only really
> truncate whole attributes. And, I think it's a bad idea to truncate
> anything other than the new high key for leaf pages, with or without
> normalized keys. Changing the keys once they're in internal pages is
> never going to be worth it.

That's what I'm saying. It might not be worth it. I haven't tested yet.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-24 Thread Claudio Freire
On Mon, Jul 24, 2017 at 2:20 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Jul 24, 2017 at 2:10 PM, Sokolov Yura
> <funny.fal...@postgrespro.ru> wrote:
>> On 2017-07-24 19:11, Claudio Freire wrote:
>>> I was mostly thinking about something like the attached patch.
>>>
>>> Simple, unintrusive, and shouldn't cause any noticeable slowdown.
>>
>>
>> Your change is small, clear, and currently useful for huge tables under
>> high update load (until "allowing vacuum to use more than 1GB memory"
>> is merged).
>
> In high-bloat conditions, it doesn't take long to accumulate 1GB of
> dead tuples (which is about 178M tuples, btw).
>
> The index scan takes way longer than the heap scan in that case.
>
>> But it still delays updating fsm until whole first batch of dead tuples
>> cleared (ie all indices scanned, and all heap pages cleared), and on such
>> huge table it will be hours.
>
> So, true, it will get delayed considerably. But as you realized,
> there's not much point in trying to vacuum the FSM sooner, since it
> won't be accurate shortly afterwards anyway. Dead line pointers do use
> up a fair bit of space, especially on narrow tables.
>
> In a particular table I have that exhibits this problem, most of the
> time is spent scanning the index. It performs dozens of index scans
> before it's done, so it would vacuum the FSM quite often enough, even
> if I were to increase the mwm setting n-fold.

I hate to reply to myself, but I wanted to add: in any case, the case
I'm trying to avoid is the case where the FSM *never* gets vacuumed.
That's bad. But it may not be the phenomenon you're experiencing in
your tests.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-24 Thread Claudio Freire
On Mon, Jul 24, 2017 at 2:10 PM, Sokolov Yura
<funny.fal...@postgrespro.ru> wrote:
> On 2017-07-24 19:11, Claudio Freire wrote:
>>
>> On Mon, Jul 24, 2017 at 6:37 AM, Sokolov Yura
>> <funny.fal...@postgrespro.ru> wrote:
>>>
>>> Good day, Claudio
>>>
>>>
>>> On 2017-07-22 00:27, Claudio Freire wrote:
>>>>
>>>>
>>>> On Fri, Jul 21, 2017 at 2:41 PM, Sokolov Yura
>>>> <funny.fal...@postgrespro.ru> wrote:
>>>>>
>>>>>
>>>>>
>>>>> My friend noticed, that I didn't said why I bother with autovacuum.
>>>>> Our customers suffers from table bloating. I've made synthetic
>>>>> bloating test, and started experiments with modifying micro- and
>>>>> auto-vacuum. My first attempts were to update FSM early (both in
>>>>> micro and autovacuum) and update it upto root, not only low level.
>>>>
>>>>
>>>>
>>>> This FSM thing is probably not a bad idea as well.
>>>>
>>>> We're forced to run regular manual vacuums because for some tables
>>>> autovacuums seems to never be enough, no matter how it's configured,
>>>> mostly because it gets canceled all the time. These are high-churn,
>>>> huge tables, so vacuuming them takes hours or days, there's always
>>>> someone with a conflicting lock at some point that ends up canceling
>>>> the autovacuum task.
>>>>
>>>> The above paragraph triggered me to go check, and it seems in those
>>>> cases the FSM never gets vacuumed. That's probably not a good thing,
>>>> but I don't see how to vacuum the FSM after a cancel. So vacuuming the
>>>> FSM from time to time during long-running vacuums seems like a good
>>>> idea at this point.
>>>
>>>
>>>
>>> Attached patch changes fsm update: instead of updating only lowest
>>> level, it propagates space increase up to root.
>>>
>>> It slows autovacuum a bit, so that I didn't propose it together with
>>> ring buffer increase.
>>
>>
>> I was mostly thinking about something like the attached patch.
>>
>> Simple, unintrusive, and shouldn't cause any noticeable slowdown.
>
>
> Your change is small, clear, and currently useful for huge tables under
> high update load (until "allowing vacuum to use more than 1GB memory"
> is merged).

In high-bloat conditions, it doesn't take long to accumulate 1GB of
dead tuples (which is about 178M tuples, btw).

The index scan takes way longer than the heap scan in that case.

> But it still delays updating fsm until whole first batch of dead tuples
> cleared (ie all indices scanned, and all heap pages cleared), and on such
> huge table it will be hours.

So, true, it will get delayed considerably. But as you realized,
there's not much point in trying to vacuum the FSM sooner, since it
won't be accurate shortly afterwards anyway. Dead line pointers do use
up a fair bit of space, especially on narrow tables.

In a particular table I have that exhibits this problem, most of the
time is spent scanning the index. It performs dozens of index scans
before it's done, so it would vacuum the FSM quite often enough, even
if I were to increase the mwm setting n-fold.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2017-07-24 Thread Claudio Freire
On Mon, Jul 24, 2017 at 2:02 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Mon, Jul 24, 2017 at 9:51 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> My point was that the TID doesn't have to point to an actual tuple.
>>
>> It's more of a keyspace thing, so it doesn't need to match real
>> tuples, it can just divide the keyspace with an arbitrary cutoff, and
>> should be cheapter to maintain without that requirement.
>
> I agree, but unless you're using normalized keys, then I don't see
> that you get much useful leeway from using fake or truncated TID
> values. Presumably the comparison logic will be based on comparing an
> ItemPointerData field, which is impractical to truncate.

In most cases, the TID itself can be omitted when the key itself
differs already.

When a split happens, a TID will be added referring to a real tid from
a child page iff necessary.

This gives a lot of leeway in choosing the cutoff point, since a
cutoff point is only added when necessary.

Vacuum *might* be able to redistribute tuples too, while holding locks
to all 3 pages (the two children and the parent page), since it can
move the TID to the middle point of two actual child TIDs, mindlessly,
without checking to see if such TID actually exists - it just needs to
choose a TID cutoff point that will distribute item pointers evently.
I haven't tried this, but it is theoretically possible.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2017-07-24 Thread Claudio Freire
On Wed, Nov 23, 2016 at 12:27 AM, Peter Geoghegan <p...@heroku.com> wrote:
> On Mon, Nov 21, 2016 at 5:15 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>>> There are a couple
>>> of tricky issues with that that you'd have to look out for, like
>>> making sure that the high key continues to hold a real TID, which at a
>>> glance looks like something that just happens already. I'm not sure
>>> that we preserve the item pointer TID in all cases, since the current
>>> assumption is that a high key's TID is just filler -- maybe we can
>>> lose that at some point.
>>
>> I thought long about that, and inner pages don't need a real TID in
>> their keys, as those keys only specify key space cutoff points and not
>> real tids, and high tids in leaf pages are always real.
>
> Not if there are duplicates in the inner pages. Then, you have to add
> in the TID, and separate the key space, to guide insertions to the
> right leaf page directly (particularly for non-HOT UPDATEs). That's
> what I'm mostly interested in investigating, here.

My point was that the TID doesn't have to point to an actual tuple.

It's more of a keyspace thing, so it doesn't need to match real
tuples, it can just divide the keyspace with an arbitrary cutoff, and
should be cheapter to maintain without that requirement.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-24 Thread Claudio Freire
On Mon, Jul 24, 2017 at 6:37 AM, Sokolov Yura
<funny.fal...@postgrespro.ru> wrote:
> Good day, Claudio
>
>
> On 2017-07-22 00:27, Claudio Freire wrote:
>>
>> On Fri, Jul 21, 2017 at 2:41 PM, Sokolov Yura
>> <funny.fal...@postgrespro.ru> wrote:
>>>
>>>
>>> My friend noticed, that I didn't said why I bother with autovacuum.
>>> Our customers suffers from table bloating. I've made synthetic
>>> bloating test, and started experiments with modifying micro- and
>>> auto-vacuum. My first attempts were to update FSM early (both in
>>> micro and autovacuum) and update it upto root, not only low level.
>>
>>
>> This FSM thing is probably not a bad idea as well.
>>
>> We're forced to run regular manual vacuums because for some tables
>> autovacuums seems to never be enough, no matter how it's configured,
>> mostly because it gets canceled all the time. These are high-churn,
>> huge tables, so vacuuming them takes hours or days, there's always
>> someone with a conflicting lock at some point that ends up canceling
>> the autovacuum task.
>>
>> The above paragraph triggered me to go check, and it seems in those
>> cases the FSM never gets vacuumed. That's probably not a good thing,
>> but I don't see how to vacuum the FSM after a cancel. So vacuuming the
>> FSM from time to time during long-running vacuums seems like a good
>> idea at this point.
>
>
> Attached patch changes fsm update: instead of updating only lowest
> level, it propagates space increase up to root.
>
> It slows autovacuum a bit, so that I didn't propose it together with
> ring buffer increase.

I was mostly thinking about something like the attached patch.

Simple, unintrusive, and shouldn't cause any noticeable slowdown.
From 5da264507058175e614f6ce7c77d2bd0491b1416 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 24 Jul 2017 13:09:10 -0300
Subject: [PATCH] Vacuum FSM after each index pass

This prevents concurrent writes from accumulating bloat due to
recently freed space being invisible in the FSM yet. When vacuum
can run for hours or days, this can make a huge difference.
---
 src/backend/commands/vacuumlazy.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index fabb2f8d52..4d8d90e833 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -735,6 +735,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			/* Remove tuples from heap */
 			lazy_vacuum_heap(onerel, vacrelstats);
 
+			/* Vacuum the Free Space Map */
+			FreeSpaceMapVacuum(onerel);
+
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
 			 * not to reset latestRemovedXid since we want that value to be
-- 
2.12.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2017-07-23 Thread Claudio Freire
On Fri, Jul 21, 2017 at 10:31 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Wed, Aug 17, 2016 at 7:54 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> The attached patch tries to maintain the initial status of B-Tree
>> indexes, which are created with equal-key runs in physical order,
>> during the whole life of the B-Tree, and make key-tid pairs
>> efficiently searchable in the process.
>
> I don't see an entry for this in the next CF. Do you have a plan for it?
>
> BTW, I did post some remarks on this patch on another thread recently
> [1]. Not sure if any of what I said there is news to you at this
> point.
>
> [1] 
> postgr.es/m/CAH2-Wzn=6Lc3OVA88x=E6SKG72ojNUE6ut6RZAqNnQx-YLcw=q...@mail.gmail.com
> --
> Peter Geoghegan

I plan to restart work on it soonishly, but ATM most of my time is
dedicated to the vacuum patch, which is almost done.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-21 Thread Claudio Freire
On Fri, Jul 21, 2017 at 2:41 PM, Sokolov Yura
 wrote:
>
> My friend noticed, that I didn't said why I bother with autovacuum.
> Our customers suffers from table bloating. I've made synthetic
> bloating test, and started experiments with modifying micro- and
> auto-vacuum. My first attempts were to update FSM early (both in
> micro and autovacuum) and update it upto root, not only low level.

This FSM thing is probably not a bad idea as well.

We're forced to run regular manual vacuums because for some tables
autovacuums seems to never be enough, no matter how it's configured,
mostly because it gets canceled all the time. These are high-churn,
huge tables, so vacuuming them takes hours or days, there's always
someone with a conflicting lock at some point that ends up canceling
the autovacuum task.

The above paragraph triggered me to go check, and it seems in those
cases the FSM never gets vacuumed. That's probably not a good thing,
but I don't see how to vacuum the FSM after a cancel. So vacuuming the
FSM from time to time during long-running vacuums seems like a good
idea at this point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 1:08 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> So, the 64x increase may be justifiable in absolute terms: it's not
> unlikely that a 16MB buffer will be evicted from the OS cache before
> vacuum is done with it, even in heavily throttled vacuums.

Sorry, that should read "It's not *likely* that a 16MB buffer will be evicted"


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 12:51 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jul 20, 2017 at 11:09 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>>> It's no secret that making the ring buffer larger will improve
>>> performance -- in fact, not having a ring buffer at all would improve
>>> performance even more.  But it would also increase the likelihood that
>>> the background work of vacuum would impact the performance of
>>> foreground operations, which is already a pretty serious problem that
>>> we probably don't want to make worse.  I'm not certain what the right
>>> decision is here, but I think that a careful analysis of possible
>>> downsides is needed.
>>
>> IIRC, originally, the default shared_buffers settings was tiny.
>
> It is true that we increased the default shared_buffers value from
> 32MB to 128MB in f358428280953643313ee7756e0a8b8ccfde7660, but it's
> also true ring buffers are capped at 1/8th of shared_buffers
> regardless of anything else, so I don't think that's the explanation
> here.  Even if that weren't the case, how would a 4x increase in the
> default size of shared_buffers (which is probably the most-commonly
> changed GUC of any that PostgreSQL has) justify a 64x increase in the
> size of the ring buffer?

I'm theorizing here, because I've not been involved in any of those
decisions. But I have been stracing and checking on vacuum quite
frequently lately, so my 2 cents:

The 4x increase in shared_buffers acknowledges increases in available
host memory over the years. It's not just about how much of
shared_buffers is dedicated to the ring buffer, but also whether we
can reasonably expect the whole ring to remain in the OS cache while
it's getting dirtied.

Vacuum will almost always dirty pages once and never again, and
flushing dirty pages back to the OS cache ASAP helps avoid a
read-modify-write cycle if the page didn't leave the OS cache. That's
more likely to happen with smaller rings than with bigger rings. But
as memory increases, the ring can be made bigger without fear of it
falling from the OS cache prematurely.

So, the 64x increase may be justifiable in absolute terms: it's not
unlikely that a 16MB buffer will be evicted from the OS cache before
vacuum is done with it, even in heavily throttled vacuums. Memory
pressure on the host would have to be insane to cause that, in modern
systems with GBs of RAM. That might not have been true in the early
days.

Now, whether autovacuum would suck up a big portion of the
shared_buffers or not, is another matter. Perhaps the ring buffer
could tune itself to whatever limit seems comfortable in that regard,
the way it does with other GUCs (like cost_limit): divide it among the
number of workers?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 11:59 AM, Robert Haas  wrote:
> On Tue, Jul 18, 2017 at 6:09 AM, Sokolov Yura
>  wrote:
>> I investigated autovacuum performance, and found that it suffers a lot
>> from small ring buffer. It suffers in a same way bulk writer suffered
>> before Tom Lane's commit 6382448cf96:
>>
>>> Tom Lane   2009-06-23 00:04:28
>>> For bulk write operations (eg COPY IN), use a ring buffer of 16MB
>>> instead of the 256KB limit originally enforced by a patch committed
>>> 2008-11-06. Per recent test results, the smaller size resulted in an
>>> undesirable decrease in bulk data loading speed, due to COPY
>>> processing frequently getting blocked for WAL flushing. This area
>>> might need more tweaking later, but this setting seems to be good
>>> enough for 8.4.
>>
>> It is especially noticable when database doesn't fit in shared_buffers
>> but fit into OS file cache, and data is intensively updated (ie OLTP
>> load). In this scenario autovacuum with current 256kB (32 pages) ring
>> buffer lasts 3-10 times longer than with increased to 16MB ring buffer.
>>
>> I've tested with synthetic load with 256MB or 1GB shared buffers and
>> 2-6GB (with indices) tables, with different load factor and with/without
>> secondary indices on updated columns. Table were randomly updated with
>> hot and non-hot updates. Times before/after buffer increase (depending
>> on load) were 7500sec/1200sec, 75000sec/11500sec. So benefit is
>> consistently reproducible.
>>
>> I didn't tested cases when database doesn't fit OS file cache. Probably
>> in this case benefit will be smaller cause more time will be spent in
>> disk read.
>> I didn't test intensively OLAP load. I've seen once that increased
>> buffer slows a bit scanning almost immutable huge table, perhaps cause
>> of decreased CPU cache locality. But given such scan is already fast,
>> and autovacuum of "almost immutable table" runs rarely, I don't think
>> it is very important.
>>
>> Initially I wanted to make BAS_BULKWRITE and BAS_VACUUM ring sizes
>> configurable, but after testing I don't see much gain from increasing
>> ring buffer above 16MB. So I propose just 1 line change.
>
> I think the question for this patch is "so, why didn't we do it this
> way originally?".
>
> It's no secret that making the ring buffer larger will improve
> performance -- in fact, not having a ring buffer at all would improve
> performance even more.  But it would also increase the likelihood that
> the background work of vacuum would impact the performance of
> foreground operations, which is already a pretty serious problem that
> we probably don't want to make worse.  I'm not certain what the right
> decision is here, but I think that a careful analysis of possible
> downsides is needed.

IIRC, originally, the default shared_buffers settings was tiny.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 12:08 AM, Peter Geoghegan  wrote:
>> The traditional
>> wisdom about btrees, for instance, is that no matter how full you pack
>> them to start with, the steady state is going to involve something like
>> 1/3rd free space.  You can call that bloat if you want, but it's not
>> likely that you'll be able to reduce the number significantly without
>> paying exorbitant costs.
>
> For the purposes of this discussion, I'm mostly talking about
> duplicates within a page on a unique index. If the keyspace owned by
> an int4 unique index page only covers 20 distinct values, it will only
> ever cover 20 distinct values, now and forever, despite the fact that
> there is room for about 400 (a 90/10 split leaves you with 366 items +
> 1 high key).

Microvacuum could also help.

If during a scan you find pointers that point to dead (in vacuum terms)
tuples, the pointers in the index could be deleted. That could be done
during insert into unique indexes before a split, to avoid the split.

Chances are, if there are duplicates, at least a few of them will be dead.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-13 Thread Claudio Freire
On Wed, Jul 12, 2017 at 1:29 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
>> <a.chernys...@postgrespro.ru> wrote:
>>> Thank you for the patch and benchmark results, I have a couple remarks.
>>> Firstly, padding in DeadTuplesSegment
>>>
>>> typedef struct DeadTuplesSegment
>>>
>>> {
>>>
>>> ItemPointerData last_dead_tuple;/* Copy of the last dead tuple
>>> (unset
>>>
>>>  * until the segment is fully
>>>
>>>  * populated). Keep it first to
>>> simplify
>>>
>>>  * binary searches */
>>>
>>> unsigned short padding;/* Align dt_tids to 32-bits,
>>>
>>>  * sizeof(ItemPointerData) is aligned to
>>>
>>>  * short, so add a padding short, to make
>>> the
>>>
>>>  * size of DeadTuplesSegment a multiple of
>>>
>>>  * 32-bits and align integer components for
>>>
>>>  * better performance during lookups into
>>> the
>>>
>>>  * multiarray */
>>>
>>> intnum_dead_tuples;/* # of entries in the segment */
>>>
>>> intmax_dead_tuples;/* # of entries allocated in the
>>> segment */
>>>
>>> ItemPointer dt_tids;/* Array of dead tuples */
>>>
>>> }DeadTuplesSegment;
>>>
>>> In the comments to ItemPointerData is written that it is 6 bytes long, but
>>> can be padded to 8 bytes by some compilers, so if we add padding in a
>>> current way, there is no guaranty that it will be done as it is expected.
>>> The other way to do it with pg_attribute_alligned. But in my opinion, there
>>> is no need to do it manually, because the compiler will do this optimization
>>> itself.
>>
>> I'll look into it. But my experience is that compilers won't align
>> struct size like this, only attributes, and this attribute is composed
>> of 16-bit attributes so it doesn't get aligned by default.
>
> Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at
> least in GCC. I'll remove the padding.
>
> Seems I just got the wrong impression at some point.

Updated versions of the patches attached.

A few runs of the benchmark show no significant difference, as it
should (being all cosmetic changes).

The bigger benchmark will take longer.
From b2bf9a671c2c4d517d59628f12a7f564ccf152f6 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  26 +++
 src/test/regress/sql/vacuum.sql  |  19 ++
 3 files changed, 373 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 30a0050..287accd 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search wit

Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-12 Thread Claudio Freire
On Wed, Jul 12, 2017 at 1:08 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
> <a.chernys...@postgrespro.ru> wrote:
>> Thank you for the patch and benchmark results, I have a couple remarks.
>> Firstly, padding in DeadTuplesSegment
>>
>> typedef struct DeadTuplesSegment
>>
>> {
>>
>> ItemPointerData last_dead_tuple;/* Copy of the last dead tuple
>> (unset
>>
>>  * until the segment is fully
>>
>>  * populated). Keep it first to
>> simplify
>>
>>  * binary searches */
>>
>> unsigned short padding;/* Align dt_tids to 32-bits,
>>
>>  * sizeof(ItemPointerData) is aligned to
>>
>>  * short, so add a padding short, to make
>> the
>>
>>  * size of DeadTuplesSegment a multiple of
>>
>>  * 32-bits and align integer components for
>>
>>  * better performance during lookups into
>> the
>>
>>  * multiarray */
>>
>> intnum_dead_tuples;/* # of entries in the segment */
>>
>> intmax_dead_tuples;/* # of entries allocated in the
>> segment */
>>
>> ItemPointer dt_tids;/* Array of dead tuples */
>>
>> }DeadTuplesSegment;
>>
>> In the comments to ItemPointerData is written that it is 6 bytes long, but
>> can be padded to 8 bytes by some compilers, so if we add padding in a
>> current way, there is no guaranty that it will be done as it is expected.
>> The other way to do it with pg_attribute_alligned. But in my opinion, there
>> is no need to do it manually, because the compiler will do this optimization
>> itself.
>
> I'll look into it. But my experience is that compilers won't align
> struct size like this, only attributes, and this attribute is composed
> of 16-bit attributes so it doesn't get aligned by default.

Doing sizeof(DeadTuplesSegment) suggests you were indeed right, at
least in GCC. I'll remove the padding.

Seems I just got the wrong impression at some point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-12 Thread Claudio Freire
On Wed, Jul 12, 2017 at 11:48 AM, Alexey Chernyshov
<a.chernys...@postgrespro.ru> wrote:
> Thank you for the patch and benchmark results, I have a couple remarks.
> Firstly, padding in DeadTuplesSegment
>
> typedef struct DeadTuplesSegment
>
> {
>
> ItemPointerData last_dead_tuple;/* Copy of the last dead tuple
> (unset
>
>  * until the segment is fully
>
>  * populated). Keep it first to
> simplify
>
>  * binary searches */
>
> unsigned short padding;/* Align dt_tids to 32-bits,
>
>  * sizeof(ItemPointerData) is aligned to
>
>  * short, so add a padding short, to make
> the
>
>  * size of DeadTuplesSegment a multiple of
>
>  * 32-bits and align integer components for
>
>  * better performance during lookups into
> the
>
>  * multiarray */
>
> intnum_dead_tuples;/* # of entries in the segment */
>
> intmax_dead_tuples;/* # of entries allocated in the
> segment */
>
> ItemPointer dt_tids;/* Array of dead tuples */
>
> }DeadTuplesSegment;
>
> In the comments to ItemPointerData is written that it is 6 bytes long, but
> can be padded to 8 bytes by some compilers, so if we add padding in a
> current way, there is no guaranty that it will be done as it is expected.
> The other way to do it with pg_attribute_alligned. But in my opinion, there
> is no need to do it manually, because the compiler will do this optimization
> itself.

I'll look into it. But my experience is that compilers won't align
struct size like this, only attributes, and this attribute is composed
of 16-bit attributes so it doesn't get aligned by default.

> On 11.07.2017 19:51, Claudio Freire wrote:
>>>
>>> -
>>>
>>> +   /* Search for the segment likely to contain the item pointer */
>>> +   iseg = vac_itemptr_binsrch(
>>> +   (void *) itemptr,
>>> +   (void *)
>>> &(vacrelstats->dead_tuples.dt_segments->last_dead_tuple),
>>> +   vacrelstats->dead_tuples.last_seg + 1,
>>> +   sizeof(DeadTuplesSegment));
>>> +
>>>
>>> I think that we can change the above to;
>>>
>>> +   /* Search for the segment likely to contain the item pointer */
>>> +   iseg = vac_itemptr_binsrch(
>>> +   (void *) itemptr,
>>> +   (void *) &(seg->last_dead_tuple),
>>> +   vacrelstats->dead_tuples.last_seg + 1,
>>> +   sizeof(DeadTuplesSegment));
>>>
>>> We set "seg = vacrelstats->dead_tuples.dt_segments" just before this.
>>
>> Right
>
> In my mind, if you change vacrelstats->dead_tuples.last_seg + 1 with
> GetNumDeadTuplesSegments(vacrelstats), it would be more meaningful.

It's not the same thing. The first run it might, but after a reset of
the multiarray, num segments is the allocated size, while last_seg is
the last one filled with data.

> Besides, you can change the vac_itemptr_binsrch within the segment with
> stdlib bsearch, like:
>
> res = (ItemPointer) bsearch((void *) itemptr,
>
> (void *) seg->dt_tids,
>
> seg->num_dead_tuples,
>
> sizeof(ItemPointerData),
>
> vac_cmp_itemptr);
>
> return (res != NULL);

The stdlib's bsearch is quite slower. The custom bsearch inlines the
comparison making it able to factor out of the loop quite a bit of
logic, and in general generate far more specialized assembly.

For the compiler to optimize the stdlib's bsearch call, whole-program
optimization should be enabled, something that is unlikely. Even then,
it may not be able to, due to aliasing rules.

This is what I came up to make the new approach's performance on par
or better than the old one, in CPU cycles. In fact, benchmarks show
that time spent on the CPU is lower now, in large part, due to this.

It's not like it's the first custom binary search in postgres, also.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-07-11 Thread Claudio Freire
re intended to be the opposite of that, I
suspect I somehow failed to get the desired outcome, so I'll be
double-checking that.

In any case the slowdown is only materialized when vacuuming with a
large mwm setting, which is something that shouldn't happen
unintentionally.
From 1886860a97245219b328b50b9aca9f65c1da30d7 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 409 ---
 src/test/regress/expected/vacuum.out |  26 +++
 src/test/regress/sql/vacuum.sql  |  19 ++
 3 files changed, 380 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 30a0050..69fc00d 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -103,6 +124,34 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	unsigned short padding;		/* Align dt_tids to 32-bits,
+ * sizeof(ItemPointerData) is aligned to
+ * short, so add a padding short, to make the
+ * size of DeadTuplesSegment a multiple of
+ * 32-bits and align integer components for
+ * better performance during lookups into the
+ * multiarray */
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer dt_tids;		/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dt_segments;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -123,14 +172,20 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tup

Re: [HACKERS] RFC: Key normalization for nbtree

2017-07-10 Thread Claudio Freire
On Mon, Jul 10, 2017 at 3:40 PM, Peter Geoghegan  wrote:
> It might appear excessive to talk about several different techniques
> in one place, but that seemed like the best way to me, because there
> are subtle dependencies. If most of the optimizations are pursued as a
> project all at once (say, key normalization, suffix truncation, and
> treating heap TID as a unique-ifier), that may actually be more likely
> to succeed than a project to do just one. The techniques don't appear
> to be related at first, but they really are.

I do have a patch that attacks suffix truncation, heap tid unification
and prefix compression all at once.

It's on a hiatus ATM, but, as you say, the implementations are highly
correlated so attacking them at once makes a lot of sense. Or, at
least, attacking one having the other in the back of your mind.

Key normalization would simplify prefix compression considerably, for instance.

A missing optimization is that having tid unification allows VACUUM to
implement a different strategy when it needs to clean up only a tiny
fraction of the index. It can do the lookup by key-tid instead of
scanning the whole index, which can be a win if the index is large and
the number of index pointers to kill is small.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perfomance bug in v10

2017-06-02 Thread Claudio Freire
On Fri, Jun 2, 2017 at 11:46 AM, Teodor Sigaev  wrote:
>> There were old threads about considering a risk factor when estimating
>> plans, and I'm thinking this issue is the planner failing to do
>> exactly that.
>>
> I'm afraid it's tool late for v10

Clearly


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perfomance bug in v10

2017-06-02 Thread Claudio Freire
On Fri, Jun 2, 2017 at 10:27 AM, Tom Lane  wrote:
> Teodor Sigaev  writes:
>>> Teodor, could you check if this patch fixes your real-world problem?
>
>> It works fine with original query, thank you. But some other query slowdowns 
>> for
>> ~10% (9 secs vs 10 secs). Look at following part of plans of huge query:
>> ...
>> As you said, it removes Materialize node, although it's useful here.
>
> Meh.  If it's expecting only one outer row, it shouldn't be using a
> Materialize on the inner side, period.  That's not sane per the cost
> model.  You haven't shown us enough to guess why the rowcount estimates
> are so far off reality in this query, but I don't think it's the fault
> of this patch if the result is slightly slower given that much error.
>
> Most likely, the answer for your real-world problem is "you need to
> ANALYZE before running the query".
>
> regards, tom lane

I don't know. Perhaps the risky part is assuming rows=1 for non-unique
inner scans. In fact a wrongly estimated rows=1 outer scan would be
just as risky.

There were old threads about considering a risk factor when estimating
plans, and I'm thinking this issue is the planner failing to do
exactly that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use of non-restart-safe storage by temp_tablespaces

2017-05-29 Thread Claudio Freire
On Mon, May 29, 2017 at 3:53 PM, Bruce Momjian  wrote:
> Right now we don't document that temp_tablespaces can use
> non-restart-safe storage, e.g. /tmp, ramdisks.  Would this be safe?
> Should we document this?

I have set up things like that, but it's nontrivial.

Just pointing the tablespace to non'restart'safe storage will get you
an installation that fails to boot after a restart, since there's a
tree structure that is expected to survive, and when it's not found,
postgres just fails to boot.

Or just use the tablespaces, I forget which. But in essence, it won't
do to do just that. What I ended up doing was backing up the empty
tablespace and adding a restore to system bootup scripts, so that
after a restart postgres finds an empty tablespace there.

So, there's a lot of internal hackery to document if you'd like to
document that kind of usage.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-03 Thread Claudio Freire
On Wed, May 3, 2017 at 11:31 AM, David Fetter <da...@fetter.org> wrote:
> On Wed, May 03, 2017 at 11:26:27AM -0300, Claudio Freire wrote:
>> On Wed, May 3, 2017 at 2:19 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:
>> >> Or we will choose WITH MATERIALIZE, and then the users aware of
>> >> the fencing (and using the CTEs for that purpose) will have to
>> >> modify the queries. But does adding MATERIALIZE quality as major
>> >> query rewrite?
>> >
>> > Hardly.
>> >
>> >> Perhaps combining this with a GUC would be a solution. I mean, a
>> >> GUC specifying the default behavior, and then INLINE /
>> >> MATERIALIZE for individual CTEs in a query?
>> >
>> > It'd be nice if we could do that for a couple of releases as an
>> > interim measure, but people will then get locked into relying on
>> > it, and we'll never be able to remove it.
>>
>> The proposed guc seems like a good idea, without which ORMs that
>> support CTEs would be at a loss.
>
> Are you aware of such an ORM which both supports WITH and doesn't also
> closely track PostgreSQL development?  I'm not.
>
> Even assuming that such a thing exists, it's not at all obvious to me
> that we should be stalling and/or putting in what will turn out to be
> misfeatures to accommodate it.

I know SQLAlchemy does support CTEs, and lags quite considerably in
its support of the latest syntactic elements.

For instance, it took them 8 months to support the "skip locked" option.

Not sure whether that qualifies as "closely tracking" postgres for
you. Clearly they do track it, but that doesn't mean they're fast or
as fast as one would like/need.

Sure, that might not be enough to warrant the GUC. I would think so,
those are my 2 cents. YMMV.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-03 Thread Claudio Freire
On Wed, May 3, 2017 at 2:19 AM, Craig Ringer  wrote:
>> Or we will choose WITH MATERIALIZE, and then the users aware of the fencing
>> (and using the CTEs for that purpose) will have to modify the queries. But
>> does adding MATERIALIZE quality as major query rewrite?
>
> Hardly.
>
>> Perhaps combining this with a GUC would be a solution. I mean, a GUC
>> specifying the default behavior, and then INLINE / MATERIALIZE for
>> individual CTEs in a query?
>
> It'd be nice if we could do that for a couple of releases as an
> interim measure, but people will then get locked into relying on it,
> and we'll never be able to remove it.

The proposed guc seems like a good idea, without which ORMs that
support CTEs would be at a loss. People using those ORMs that need
materialized behavior would have to wait for the ORM to catch up with
postgres syntax before upgrading, and that wouldn't be a nice thing.

It's not about requiring testing before upgrading, of course users
should/will do that. But if said testing says inlined CTEs perform
horribly, and the ORM has no support for the materialized keyword, the
only option is to not upgrade. With the CTE, people can upgrade,
changing the default behavior back to what it was.

That seems to me a useful thing to have.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-25 Thread Claudio Freire
On Tue, Apr 25, 2017 at 2:45 PM, Bruce Momjian  wrote:
> On Tue, Apr 25, 2017 at 10:37:48AM -0700, Andres Freund wrote:
>> On 2017-04-25 13:11:32 -0400, Bruce Momjian wrote:
>> > I don't think this warrants inclusion in the release notes for reasons
>> > already discussed.  The vacuum truncation operation is a rare one and
>> > an implementation detail.
>>
>> I think that's backwards. The truncate operation is quite delicate
>> because it happens with AccessExclusiveLock held.  This regularly does
>> cause issues in production.  When users look for things they possibly
>> should update for, something like "performance improvents in final
>> vacuum phase" + oneliner is going to be a lot more interesting than,
>> say, "Add MONEY operators for multiplication and division with INT8
>> values".
>>
>> More and more users are going to be primarily interested in three
>> classes of new things in postgres: 1) performance 2) replication 3)
>> easier management. Arbitrarily excluding one of the major categories
>> from release notes isn't a useful policy, especially if we continue to
>> list new feature that'll effect no more than a handful of people.
>
> I disputed that my logic is arbitrary.  However, given your
> explanation, I have added the item:
>
>Improve speed of VACUUM's removal of trailing empty
>heap pages (Alvaro Herrera)

That's enough for me, thanks.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-25 Thread Claudio Freire
On Tue, Apr 25, 2017 at 2:11 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Tue, Apr 25, 2017 at 01:37:13PM -0300, Claudio Freire wrote:
>> > I think it has been pretty common to accumulate a lot of such changes
>> > into generic entries like, say, "speedups for hash joins".  More detail
>> > than that simply isn't useful to end users; and as a rule, our release
>> > notes are too long anyway.
>>
>> In that spirit, the truncation speedups it seems are missing:
>>
>> Might be summarized simply as:
>>
>> Vacuum truncation has been sped up for rotating media, sometimes
>> considerably (up to five times in certain configurations).
>>
>> Full commit, for reference:
>>
>> commit 7e26e02eec90370dd222f35f00042f8188488ac4
>> Author: Alvaro Herrera <alvhe...@alvh.no-ip.org>
>> Date:   Mon Jan 23 12:55:18 2017 -0300
>>
>> Prefetch blocks during lazy vacuum's truncation scan
>>
>> Vacuum truncation scan can be sped up on rotating media by prefetching
>> blocks in forward direction.  That makes the blocks already present in
>> memory by the time they are needed, while also letting OS read-ahead
>> kick in.
>>
>> The truncate scan has been measured to be five times faster than without
>> this patch (that was on a slow disk, but it shouldn't hurt on fast
>> disks.)
>>
>> Author: Álvaro Herrera, loosely based on a submission by Claudio Freire
>> Discussion:
>> https://postgr.es/m/cagtbqpa6nfgo_6g_y_7zqx8l9gchdsqkydo1tguh791z6py...@mail.gmail.com
>
> I don't think this warrants inclusion in the release notes for reasons
> already discussed.  The vacuum truncation operation is a rare one and
> an implementation detail.

\_(0_0)_/

As you wish.

Though if I wasn't already aware of it, I would like to know, because
it's been a source of trouble in the past.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-25 Thread Claudio Freire
On Tue, Apr 25, 2017 at 12:45 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Andres Freund <and...@anarazel.de> writes:
>> On 2017-04-24 23:37:42 -0400, Bruce Momjian wrote:
>>> I remember seeing those and those are normally details I do not put in
>>> the release notes as there isn't a clear user experience change except
>>> "Postgres is faster".  Yeah, a bummer, and I can change my filter, but
>>> it would require discussion.
>
>> I think "postgres is faster" is one of the bigger user demands, so I
>> don't think that policy makes much sense.  A large number of the changes
>> over the next few releases will focus solely on that.  Nor do I think
>> past release notes particularly filtered such changes out.
>
> I think it has been pretty common to accumulate a lot of such changes
> into generic entries like, say, "speedups for hash joins".  More detail
> than that simply isn't useful to end users; and as a rule, our release
> notes are too long anyway.

In that spirit, the truncation speedups it seems are missing:

Might be summarized simply as:

Vacuum truncation has been sped up for rotating media, sometimes
considerably (up to five times in certain configurations).

Full commit, for reference:

commit 7e26e02eec90370dd222f35f00042f8188488ac4
Author: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date:   Mon Jan 23 12:55:18 2017 -0300

Prefetch blocks during lazy vacuum's truncation scan

Vacuum truncation scan can be sped up on rotating media by prefetching
blocks in forward direction.  That makes the blocks already present in
memory by the time they are needed, while also letting OS read-ahead
kick in.

The truncate scan has been measured to be five times faster than without
this patch (that was on a slow disk, but it shouldn't hurt on fast
disks.)

Author: Álvaro Herrera, loosely based on a submission by Claudio Freire
Discussion:
https://postgr.es/m/cagtbqpa6nfgo_6g_y_7zqx8l9gchdsqkydo1tguh791z6py...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-24 Thread Claudio Freire
On Sun, Apr 23, 2017 at 12:41 PM, Robert Haas  wrote:
>> That's after inlining the compare on both the linear and sequential
>> code, and it seems it lets the compiler optimize the binary search to
>> the point where it outperforms the sequential search.
>>
>> That's not the case when the compare isn't inlined.
>>
>> That seems in line with [1], that show the impact of various
>> optimizations on both algorithms. It's clearly a close enough race
>> that optimizations play a huge role.
>>
>> Since we're not likely to go and implement SSE2-optimized versions, I
>> believe I'll leave the binary search only. That's the attached patch
>> set.
>
> That sounds reasonable based on your test results.  I guess part of
> what I was wondering is whether a vacuum on a table large enough to
> require multiple gigabytes of work_mem isn't likely to be I/O-bound
> anyway.  If so, a few cycles one way or the other other isn't likely
> to matter much.  If not, where exactly are all of those CPU cycles
> going?

I haven't been able to produce a table large enough to get a CPU-bound
vacuum, so such a case is likely to require huge storage and a very
powerful I/O system. Mine can only get about 100MB/s tops, and at that
speed, vacuum is I/O bound even for multi-GB work_mem. That's why I've
been using the reported CPU time as benchmark.

BTW, I left the benchmark script running all weekend at the office,
and when I got back a power outage had aborted it. In a few days I'll
be out on vacation, so I'm not sure I'll get the benchmark results
anytime soon. But this patch moved to 11.0 I guess there's no rush.

Just FTR, in case I leave before the script is done, the script got to
scale 400 before the outage:

INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 4000 row versions
DETAIL:  CPU: user: 5.94 s, system: 1.26 s, elapsed: 26.77 s.
INFO:  "pgbench_accounts": removed 4000 row versions in 655739 pages
DETAIL:  CPU: user: 3.36 s, system: 2.57 s, elapsed: 61.67 s.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in 109679 pages
DETAIL:  4000 index row versions were removed.
109289 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.06 s.
INFO:  "pgbench_accounts": found 38925546 removable, 0 nonremovable
row versions in 655738 out of 655738 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1098
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s.
INFO:  "pgbench_accounts": truncated 655738 to 0 pages
DETAIL:  CPU: user: 0.22 s, system: 2.10 s, elapsed: 8.10 s.

In summary:

binsrch v10:

s100: CPU: user: 3.02 s, system: 1.51 s, elapsed: 16.43 s.
s400: CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s.

The old results:

Old Patched (sequential search):

s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.

Unpatched:

s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.

I wouldn't fret over the slight slowdown vs the old patch, it could be
noise (the script only completed a single run at scale 400).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-20 Thread Claudio Freire
On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> In essence, the patch as it is proposed, doesn't *need* a binary
>> search, because the segment list can only grow up to 15 segments at
>> its biggest, and that's a size small enough that linear search will
>> outperform (or at least perform as well as) binary search. Reducing
>> the initial segment size wouldn't change that. If the 12GB limit is
>> lifted, or the maximum segment size reduced (from 1GB to 128MB for
>> example), however, that would change.
>>
>> I'd be more in favor of lifting the 12GB limit than of reducing the
>> maximum segment size, for the reasons above. Raising the 12GB limit
>> has concrete and readily apparent benefits, whereas using bigger (or
>> smaller) segments is far more debatable. Yes, that will need a binary
>> search. But, I was hoping that could be a second (or third) patch, to
>> keep things simple, and benefits measurable.
>
> To me, it seems a bit short-sighted to say, OK, let's use a linear
> search because there's this 12GB limit so we can limit ourselves to 15
> segments.  Because somebody will want to remove that 12GB limit, and
> then we'll have to revisit the whole thing anyway.  I think, anyway.

Ok, attached an updated patch that implements the binary search

> What's not clear to me is how sensitive the performance of vacuum is
> to the number of cycles used here.  For a large index, the number of
> searches will presumably be quite large, so it does seem worth
> worrying about performance.  But if we just always used a binary
> search, would that lose enough performance with small numbers of
> segments that anyone would care?  If so, maybe we need to use linear
> search for small numbers of segments and switch to binary search with
> larger numbers of segments.

I just went and tested.

I implemented the hybrid binary search attached, and ran a few tests
with and without the sequential code enabled, at small scales.

The difference is statistically significant, but small (less than 3%).
With proper optimization of the binary search, however, the difference
flips:

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
fullbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s,
elapsed: 18.34 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s,
elapsed: 19.75 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s,
elapsed: 18.48 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s,
elapsed: 20.60 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s,
elapsed: 19.16 s.

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
hybridbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s,
elapsed: 19.15 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s,
elapsed: 18.40 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s,
elapsed: 18.87 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s,
elapsed: 26.43 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s,
elapsed: 20.02 s.

That's after inlining the compare on both the linear and sequential
code, and it seems it lets the compiler optimize the binary search to
the point where it outperforms the sequential search.

That's not the case when the compare isn't inlined.

That seems in line with [1], that show the impact of various
optimizations on both algorithms. It's clearly a close enough race
that optimizations play a huge role.

Since we're not likely to go and implement SSE2-optimized versions, I
believe I'll leave the binary search only. That's the attached patch
set.

I'm running the full test suite, but that takes a very long while.
I'll post the results when they're done.

[1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/
From 709428f67960dd20eaf34846a0b579766e0701f6 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 408 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 +
 3 files changed, 350 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..ddf19d7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Claudio Freire
On Tue, Apr 11, 2017 at 4:17 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>>> maintenance_work_mem
>>>
>>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>>> than three-quarters of that last allocation ends up being wasted.
>>> I've been told on this list before that doubling is the one true way
>>> of increasing the size of an allocated chunk of memory, but I'm still
>>> a bit unconvinced.
>>
>> There you're wrong. The allocation is capped to 1GB, so wastage has an
>> upper bound of 1GB.
>
> Ah, OK.  Sorry, didn't really look at the code.  I stand corrected,
> but then it seems a bit strange to me that the largest and smallest
> allocations are only 8x different. I still don't really understand
> what that buys us.

Basically, attacking the problem (that, I think, you mentioned) of
very small systems in which overallocation for small vacuums was an
issue.

The "slow start" behavior of starting with smaller segments tries to
improve the situation for small vacuums, not big ones.

By starting at 128M and growing up to 1GB, overallocation is bound to
the range 128M-1GB and is proportional to the amount of dead tuples,
not table size, as it was before. Starting at 128M helps the initial
segment search, but I could readily go for starting at 64M, I don't
think it would make a huge difference. Removing exponential growth,
however, would.

As the patch stands, small systems (say 32-bit systems) without
overcommit and with slowly-changing data can now set high m_w_m
without running into overallocation issues with autovacuum reserving
too much virtual space, as it will reserve memory only proportional to
the amount of dead tuples. Previously, it would reserve all of m_w_m
regardless of whether it was needed or not, with the only exception
being really small tables, so m_w_m=1GB was unworkable in those cases.
Now it should be fine.

> What would we lose if we just made 'em all 128MB?

TBH, not that much. We'd need 8x compares to find the segment, that
forces a switch to binary search of the segments, which is less
cache-friendly. So it's more complex code, less cache locality. I'm
just not sure what's the benefit given current limits.

The only aim of this multiarray approach was making *virtual address
space reservations* proportional to the amount of actual memory
needed, as opposed to configured limits. It doesn't need to be a tight
fit, because calling palloc on its own doesn't actually use that
memory, at least on big allocations like these - the OS will not map
the memory pages until they're first touched. That's true in most
modern systems, and many ancient ones too.

In essence, the patch as it is proposed, doesn't *need* a binary
search, because the segment list can only grow up to 15 segments at
its biggest, and that's a size small enough that linear search will
outperform (or at least perform as well as) binary search. Reducing
the initial segment size wouldn't change that. If the 12GB limit is
lifted, or the maximum segment size reduced (from 1GB to 128MB for
example), however, that would change.

I'd be more in favor of lifting the 12GB limit than of reducing the
maximum segment size, for the reasons above. Raising the 12GB limit
has concrete and readily apparent benefits, whereas using bigger (or
smaller) segments is far more debatable. Yes, that will need a binary
search. But, I was hoping that could be a second (or third) patch, to
keep things simple, and benefits measurable.

Also, the plan as discussed in this very long thread, was to
eventually try to turn segments into bitmaps if dead tuple density was
big enough. That benefits considerably from big segments, since lookup
on a bitmap is O(1) - the bigger the segments, the faster the lookup,
as the search on the segment list would be dominant.

So... what shall we do?

At this point, I've given all my arguments for the current design. If
the more senior developers don't agree, I'll be happy to try your way.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Claudio Freire
On Tue, Apr 11, 2017 at 3:59 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>> maintenance_work_mem
>>
>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>> than three-quarters of that last allocation ends up being wasted.
>> I've been told on this list before that doubling is the one true way
>> of increasing the size of an allocated chunk of memory, but I'm still
>> a bit unconvinced.
>
> There you're wrong. The allocation is capped to 1GB, so wastage has an
> upper bound of 1GB.

And total m_w_m for vacuum is still capped to 12GB (as big you can get
with 32-bit integer indices).

So you can get at most 15 segments (a binary search is thus not worth
it), and overallocate by at most 1GB (the maximum segment size).

At least that's my rationale.

Removing the 12GB limit requires a bit of care (there are some 32-bit
counters still around I believe).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Claudio Freire
On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas  wrote:
> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
> maintenance_work_mem
>
> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
> enough so we'll allocate another 8GB, for a total of 16256MB, but more
> than three-quarters of that last allocation ends up being wasted.
> I've been told on this list before that doubling is the one true way
> of increasing the size of an allocated chunk of memory, but I'm still
> a bit unconvinced.

There you're wrong. The allocation is capped to 1GB, so wastage has an
upper bound of 1GB.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 10:06 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>>> >> + if (seg->num_dead_tuples >= seg->max_dead_tuples)
>>> >> + {
>>> >> + /*
>>> >> +  * The segment is overflowing, so we must allocate 
>>> >> a new segment.
>>> >> +  * We could have a preallocated segment descriptor 
>>> >> already, in
>>> >> +  * which case we just reinitialize it, or we may 
>>> >> need to repalloc
>>> >> +  * the vacrelstats->dead_tuples array. In that 
>>> >> case, seg will no
>>> >> +  * longer be valid, so we must be careful about 
>>> >> that. In any case,
>>> >> +  * we must update the last_dead_tuple copy in the 
>>> >> overflowing
>>> >> +  * segment descriptor.
>>> >> +  */
>>> >> + Assert(seg->num_dead_tuples == 
>>> >> seg->max_dead_tuples);
>>> >> + seg->last_dead_tuple = 
>>> >> seg->dt_tids[seg->num_dead_tuples - 1];
>>> >> + if (vacrelstats->dead_tuples.last_seg + 1 >= 
>>> >> vacrelstats->dead_tuples.num_segs)
>>> >> + {
>>> >> + int new_num_segs = 
>>> >> vacrelstats->dead_tuples.num_segs * 2;
>>> >> +
>>> >> + vacrelstats->dead_tuples.dt_segments = 
>>> >> (DeadTuplesSegment *) repalloc(
>>> >> +(void *) 
>>> >> vacrelstats->dead_tuples.dt_segments,
>>> >> +
>>> >> new_num_segs * sizeof(DeadTuplesSegment));
>>> >
>>> > Might be worth breaking this into some sub-statements, it's quite hard
>>> > to read.
>>>
>>> Breaking what precisely? The comment?
>>
>> No, the three-line statement computing the new value of
>> dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
>> variable, to cut the length of the statement down.
>
> Ah, alright. Will try to do that.

Attached is an updated patch set with the requested changes.

Segment allocation still follows the exponential strategy, and segment
lookup is still linear.

I rebased the early free patch (patch 3) to apply on top of the v9
patch 2 (it needed some changes). I recognize the early free patch
didn't get nearly as much scrutiny, so I'm fine with commiting only 2
if that one's ready to go but 3 isn't.

If it's decided to go for fixed 128M segments and a binary search of
segments, I don't think I can get that ready and tested before the
commitfest ends.
From 9b8f90f19d558a7e6a32cb253d89819f7c300598 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 346 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 +
 3 files changed, 299 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..4f0cf1b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,25 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budge

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 10:12 PM, Andres Freund <and...@anarazel.de> wrote:
> On 2017-04-07 22:06:13 -0300, Claudio Freire wrote:
>> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <and...@anarazel.de> wrote:
>> > Hi,
>> >
>> >
>> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
>> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <and...@anarazel.de> wrote:
>> >> > Hi,
>> >> >
>> >> > I've *not* read the history of this thread.  So I really might be
>> >> > missing some context.
>> >> >
>> >> >
>> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> >> >> From: Claudio Freire <klaussfre...@gmail.com>
>> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>> >> >>
>> >> >> Turn the dead_tuples array into a structure composed of several
>> >> >> exponentially bigger arrays, to enable usage of more than 1GB
>> >> >> of work mem during vacuum and thus reduce the number of full
>> >> >> index scans necessary to remove all dead tids when the memory is
>> >> >> available.
>> >> >
>> >> >>   * We are willing to use at most maintenance_work_mem (or perhaps
>> >> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> >> >> - * initially allocate an array of TIDs of that size, with an upper 
>> >> >> limit that
>> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit 
>> >> >> that
>> >> >>   * depends on table size (this limit ensures we don't allocate a huge 
>> >> >> area
>> >> >> - * uselessly for vacuuming small tables).  If the array threatens to 
>> >> >> overflow,
>> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup 
>> >> >> and page
>> >> >> - * compaction, then resume the heap scan with an empty TID array.
>> >> >> + * uselessly for vacuuming small tables). Additional arrays of 
>> >> >> increasingly
>> >> >> + * large sizes are allocated as they become necessary.
>> >> >> + *
>> >> >> + * The TID array is thus represented as a list of multiple segments of
>> >> >> + * varying size, beginning with the initial size of up to 128MB, and 
>> >> >> growing
>> >> >> + * exponentially until the whole budget of 
>> >> >> (autovacuum_)maintenance_work_mem
>> >> >> + * is used up.
>> >> >
>> >> > When the chunk size is 128MB, I'm a bit unconvinced that using
>> >> > exponential growth is worth it. The allocator overhead can't be
>> >> > meaningful in comparison to collecting 128MB dead tuples, the potential
>> >> > waste is pretty big, and it increases memory fragmentation.
>> >>
>> >> The exponential strategy is mainly to improve lookup time (ie: to
>> >> avoid large segment lists).
>> >
>> > Well, if we were to do binary search on the segment list, that'd not be
>> > necessary.
>>
>> True, but the initial lookup might be slower in the end, since the
>> array would be bigger and cache locality worse.
>>
>> Why do you say exponential growth fragments memory? AFAIK, all those
>> allocations are well beyond the point where malloc starts mmaping
>> memory, so each of those segments should be a mmap segment,
>> independently freeable.
>
> Not all platforms have that, and even on platforms with it, frequent,
> unevenly sized, very large allocations can lead to enough fragmentation
> that further allocations are harder and fragment / enlarge the
> pagetable.

I wouldn't call this frequent. You can get at most slightly more than
a dozen such allocations given the current limits.
And allocation sizes are quite regular - you get 128M or multiples of
128M, so each free block can be reused for N smaller allocations if
needed. I don't think it has much potential to fragment memory.

This isn't significantly different from tuplesort or any other code
that can do big allocations, and the differences favor less
fragmentation than those, so I don't see why this would need special
treatment.

My point being that it's not been simple getting to a point where this
beats even in CPU time the original single binar

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
>
> On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
>> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <and...@anarazel.de> wrote:
>> > Hi,
>> >
>> > I've *not* read the history of this thread.  So I really might be
>> > missing some context.
>> >
>> >
>> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> >> From: Claudio Freire <klaussfre...@gmail.com>
>> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>> >>
>> >> Turn the dead_tuples array into a structure composed of several
>> >> exponentially bigger arrays, to enable usage of more than 1GB
>> >> of work mem during vacuum and thus reduce the number of full
>> >> index scans necessary to remove all dead tids when the memory is
>> >> available.
>> >
>> >>   * We are willing to use at most maintenance_work_mem (or perhaps
>> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> >> - * initially allocate an array of TIDs of that size, with an upper limit 
>> >> that
>> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>> >>   * depends on table size (this limit ensures we don't allocate a huge 
>> >> area
>> >> - * uselessly for vacuuming small tables).  If the array threatens to 
>> >> overflow,
>> >> - * we suspend the heap scan phase and perform a pass of index cleanup 
>> >> and page
>> >> - * compaction, then resume the heap scan with an empty TID array.
>> >> + * uselessly for vacuuming small tables). Additional arrays of 
>> >> increasingly
>> >> + * large sizes are allocated as they become necessary.
>> >> + *
>> >> + * The TID array is thus represented as a list of multiple segments of
>> >> + * varying size, beginning with the initial size of up to 128MB, and 
>> >> growing
>> >> + * exponentially until the whole budget of 
>> >> (autovacuum_)maintenance_work_mem
>> >> + * is used up.
>> >
>> > When the chunk size is 128MB, I'm a bit unconvinced that using
>> > exponential growth is worth it. The allocator overhead can't be
>> > meaningful in comparison to collecting 128MB dead tuples, the potential
>> > waste is pretty big, and it increases memory fragmentation.
>>
>> The exponential strategy is mainly to improve lookup time (ie: to
>> avoid large segment lists).
>
> Well, if we were to do binary search on the segment list, that'd not be
> necessary.

True, but the initial lookup might be slower in the end, since the
array would be bigger and cache locality worse.

Why do you say exponential growth fragments memory? AFAIK, all those
allocations are well beyond the point where malloc starts mmaping
memory, so each of those segments should be a mmap segment,
independently freeable.

>> >> + if (seg->num_dead_tuples >= seg->max_dead_tuples)
>> >> + {
>> >> + /*
>> >> +  * The segment is overflowing, so we must allocate 
>> >> a new segment.
>> >> +  * We could have a preallocated segment descriptor 
>> >> already, in
>> >> +  * which case we just reinitialize it, or we may 
>> >> need to repalloc
>> >> +  * the vacrelstats->dead_tuples array. In that 
>> >> case, seg will no
>> >> +  * longer be valid, so we must be careful about 
>> >> that. In any case,
>> >> +  * we must update the last_dead_tuple copy in the 
>> >> overflowing
>> >> +  * segment descriptor.
>> >> +  */
>> >> + Assert(seg->num_dead_tuples == 
>> >> seg->max_dead_tuples);
>> >> + seg->last_dead_tuple = 
>> >> seg->dt_tids[seg->num_dead_tuples - 1];
>> >> + if (vacrelstats->dead_tuples.last_seg + 1 >= 
>> >> vacrelstats->dead_tuples.num_segs)
>> >> + {
>> >> + int new_num_segs = 
>> >> vacrelstats->dead_tuples.num_segs * 2;
>> >> +
>

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> I've *not* read the history of this thread.  So I really might be
> missing some context.
>
>
>> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> From: Claudio Freire <klaussfre...@gmail.com>
>> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>>
>> Turn the dead_tuples array into a structure composed of several
>> exponentially bigger arrays, to enable usage of more than 1GB
>> of work mem during vacuum and thus reduce the number of full
>> index scans necessary to remove all dead tids when the memory is
>> available.
>
>>   * We are willing to use at most maintenance_work_mem (or perhaps
>>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> - * initially allocate an array of TIDs of that size, with an upper limit 
>> that
>> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>>   * depends on table size (this limit ensures we don't allocate a huge area
>> - * uselessly for vacuuming small tables).  If the array threatens to 
>> overflow,
>> - * we suspend the heap scan phase and perform a pass of index cleanup and 
>> page
>> - * compaction, then resume the heap scan with an empty TID array.
>> + * uselessly for vacuuming small tables). Additional arrays of increasingly
>> + * large sizes are allocated as they become necessary.
>> + *
>> + * The TID array is thus represented as a list of multiple segments of
>> + * varying size, beginning with the initial size of up to 128MB, and growing
>> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
>> + * is used up.
>
> When the chunk size is 128MB, I'm a bit unconvinced that using
> exponential growth is worth it. The allocator overhead can't be
> meaningful in comparison to collecting 128MB dead tuples, the potential
> waste is pretty big, and it increases memory fragmentation.

The exponential strategy is mainly to improve lookup time (ie: to
avoid large segment lists).

>> + * Lookup in that structure proceeds sequentially in the list of segments,
>> + * and with a binary search within each segment. Since segment's size grows
>> + * exponentially, this retains O(N log N) lookup complexity.
>
> N log N is a horrible lookup complexity.  That's the complexity of
> *sorting* an entire array.  I think you might be trying to argue that
> it's log(N) * log(N)? Once log(n) for the exponentially growing size of
> segments, one for the binary search?
>
> Afaics you could quite easily make it O(2 log(N)) by simply also doing
> binary search over the segments.  Might not be worth it due to the small
> constant involved normally.

It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N))

>> + * If the array threatens to overflow, we suspend the heap scan phase and
>> + * perform a pass of index cleanup and page compaction, then resume the heap
>> + * scan with an array of logically empty but already preallocated TID 
>> segments
>> + * to be refilled with more dead tuple TIDs.
>
> Hm, it's not really the array that overflows, it's m_w_m that'd be
> exceeded, right?

Yes, will rephrase. Although that's how the original comment expressed
the same concept.

>>  /*
>> + * Minimum (starting) size of the dead_tuples array segments. Will allocate
>> + * space for 128MB worth of tid pointers in the first segment, further 
>> segments
>> + * will grow in size exponentially. Don't make it too small or the segment 
>> list
>> + * will grow bigger than the sweetspot for search efficiency on big vacuums.
>> + */
>> +#define LAZY_MIN_TUPLES  Max(MaxHeapTuplesPerPage, (128<<20) / 
>> sizeof(ItemPointerData))
>
> That's not really the minimum, no? s/MIN/INIT/?

Ok

>> +typedef struct DeadTuplesSegment
>> +{
>> + int num_dead_tuples;/* # of entries in the 
>> segment */
>> + int max_dead_tuples;/* # of entries 
>> allocated in the segment */
>> + ItemPointerData last_dead_tuple;/* Copy of the last dead tuple 
>> (unset
>> +
>>   * until the segment is fully
>> +
>>   * populated) */
>> + unsigned short padding;
>> + ItemPointer dt_tids;/* Array of dead tuples */
>> +}DeadTuplesSegment;
>
> Whenever padding is needed,

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 7:43 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>>> + * Lookup in that structure proceeds sequentially in the list of segments,
>>> + * and with a binary search within each segment. Since segment's size grows
>>> + * exponentially, this retains O(N log N) lookup complexity.
>>
>> N log N is a horrible lookup complexity.  That's the complexity of
>> *sorting* an entire array.  I think you might be trying to argue that
>> it's log(N) * log(N)? Once log(n) for the exponentially growing size of
>> segments, one for the binary search?
>>
>> Afaics you could quite easily make it O(2 log(N)) by simply also doing
>> binary search over the segments.  Might not be worth it due to the small
>> constant involved normally.
>
> It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N))


To clarify, lookup over the segments is linear, so it's O(M) with M
the number of segments, then the binary search is O(log N) with N the
number of dead tuples.

So lookup is O(M + log N), but M < log N because of the segment's
exponential growth, therefore the lookup is O(2 log N)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 2:28 AM, David Rowley
 wrote:
>> + if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
>> DEFAULT_INEQ_SEL)
>> + {
>> + /* No point in checking null selectivity, DEFAULT_INEQ_SEL
>> implies we have no stats */
>> + s2 = DEFAULT_RANGE_INEQ_SEL;
>> + }
>> + else
>> + {
>> ...
>> +   s2 = rqlist->hibound + rqlist->lobound - 1.0;
>> +   nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
>> +   notnullsel = 1.0 - nullsel;
>> +
>> +   /* Adjust for double-exclusion of NULLs */
>> +   s2 += nullsel;
>> +   if (s2 <= 0.0) {
>> +  if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
>> +  {
>> +   /* Most likely contradictory clauses found */
>> +   s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
>> +  }
>> +  else
>> +  {
>> +   /* Could be a rounding error */
>> +   s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
>> +  }
>> +   }
>> + }
>>
>> Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
>> cautious) estimation of the amount of rounding error that could be
>> there with 32-bit floats.
>>
>> The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
>> an estimation based on stats, maybe approximate, but one that makes
>> sense (ie: preserves the monotonicity of the CPF), and as such
>> negative values are either sign of a contradiction or rounding error.
>> The previous code left the possibility of negatives coming out of
>> default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
>> but that doesn't seem possible coming out of scalarineqsel.
>
> I notice this does change the estimates for contradictory clauses such as:
>
> create table a (value int);
> insert into a select x/100 from generate_Series(1,1) x;
> analyze a;
> explain analyze select * from a where value between 101 and -1;
>
> We now get:
>
>  QUERY PLAN
> -
>  Seq Scan on a  (cost=0.00..195.00 rows=1 width=4) (actual
> time=1.904..1.904 rows=0 loops=1)
>Filter: ((value >= 101) AND (value <= '-1'::integer))
>Rows Removed by Filter: 1
>  Planning time: 0.671 ms
>  Execution time: 1.950 ms
> (5 rows)
>
> where before we'd get:
>
>   QUERY PLAN
> --
>  Seq Scan on a  (cost=0.00..195.00 rows=50 width=4) (actual
> time=0.903..0.903 rows=0 loops=1)
>Filter: ((value >= 101) AND (value <= '-1'::integer))
>Rows Removed by Filter: 1
>  Planning time: 0.162 ms
>  Execution time: 0.925 ms
> (5 rows)
>
> Which comes from the 1 * 0.005 selectivity estimate from tuples *
> DEFAULT_RANGE_INEQ_SEL
>
> I've attached a patch to this effect.

+/*
+ * A zero or slightly negative s2 should be converted into
+ * a small positive value; we probably are dealing with a
+ * very tight range and got a bogus result due to roundoff
+ * errors. However, if s2 is very negative, then we
+ * probably have default selectivity estimates on one or
+ * both sides of the range that we failed to recognize
+ * above for some reason.
+ */
+if (s2 <= 0.0)

That comment seems outdated

Otherwise, the patch LGTM, but I'd like to solve the quadratic
behavior too... are you going to try? Otherwise I could take a stab at
it myself. It doesn't seem very difficult.

Also, can you add a test case? Cost values could make the test
fragile, so if that gives you trouble I'll understand, but it'd be
best to try and test this if possible.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-06 Thread Claudio Freire
On Tue, Apr 4, 2017 at 1:00 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>>> If you ask me, I'd just leave:
>>>
>>> + if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
>>> DEFAULT_INEQ_SEL)
>>> + {
>>> + /* No point in checking null selectivity, DEFAULT_INEQ_SEL
>>> implies we have no stats */
>>> + s2 = DEFAULT_RANGE_INEQ_SEL;
>>> + }
>>> + else
>>> + {
>>> ...
>>> +   s2 = rqlist->hibound + rqlist->lobound - 1.0;
>>> +   nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
>>> +   notnullsel = 1.0 - nullsel;
>>> +
>>> +   /* Adjust for double-exclusion of NULLs */
>>> +   s2 += nullsel;
>>> +   if (s2 <= 0.0) {
>>> +  if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
>>> +  {
>>> +   /* Most likely contradictory clauses found */
>>> +   s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
>>> +  }
>>> +  else
>>> +  {
>>> +   /* Could be a rounding error */
>>> +   s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
>>> +  }
>>> +   }
>>> + }
>>>
>>> Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
>>> cautious) estimation of the amount of rounding error that could be
>>> there with 32-bit floats.
>>>
>>> The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
>>> an estimation based on stats, maybe approximate, but one that makes
>>> sense (ie: preserves the monotonicity of the CPF), and as such
>>> negative values are either sign of a contradiction or rounding error.
>>> The previous code left the possibility of negatives coming out of
>>> default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
>>> but that doesn't seem possible coming out of scalarineqsel.
>>>
>>> But I'd like it if Tom could comment on that, since he's the one that
>>> did that in commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a (way back
>>> in 2004).
>>>
>>> Barring that, I'd just change the
>>>
>>> s2 = DEFAULT_RANGE_INEQ_SEL;
>>>
>>> To
>>>
>>> s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
>>>
>>> Which makes a lot more sense.
>>
>> I think to fix this properly the selfuncs would have to return the
>> estimate, and nullfrac separately, so the nullfrac could just be
>> applied once per expression. There's already hacks in there to get rid
>> of the double null filtering. I'm not proposing that as something for
>> this patch. It would be pretty invasive to change this.
>
> There's no need, you can compute the nullfrac with nulltestsel. While
> there's some rework involved, it's not a big deal (it just reads the
> stats tuple), and it's a clean solution.

I'm marking this Waiting on Author until we figure out what to do with
those issues (the null issue quoted above, and the quadratic behavior)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-04 Thread Claudio Freire
On Tue, Apr 4, 2017 at 8:21 AM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> On 4 April 2017 at 13:35, Claudio Freire <klaussfre...@gmail.com> wrote:
>> On Mon, Apr 3, 2017 at 9:19 PM, Claudio Freire <klaussfre...@gmail.com> 
>> wrote:
>>> On Mon, Apr 3, 2017 at 8:52 PM, David Rowley
>>> <david.row...@2ndquadrant.com> wrote:
>>>>> One last observation:
>>>>>
>>>>> +/*
>>>>> + * An IS NOT NULL test is a no-op if there's any other strict 
>>>>> quals,
>>>>> + * so if that's the case, then we'll only apply this, otherwise 
>>>>> we'll
>>>>> + * ignore it.
>>>>> + */
>>>>> +else if (cslist->selmask == CACHESEL_NOTNULLTEST)
>>>>> +s1 *= cslist->notnullsel;
>>>>>
>>>>> In some paths, the range selectivity estimation code punts and returns
>>>>> DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was
>>>>> present, it should still be applied. It could make a big difference if
>>>>> the selectivity for the nulltest is high.
>>>>
>>>> I'd say that the location that applies the DEFAULT_RANGE_INEQ_SEL
>>>> should apply the nullfrac. Seems wrong to rely on a IS NOT NULL test
>>>> to exists to estimate that properly. I don't think that needs done as
>>>> part of this patch. I'd rather limit the scope since "returned with
>>>> feedback" is already knocking at the door of this patch.
>>>
>>> I agree, but this patch regresses for those cases if the user
>>> compensated with a not null test, leaving little recourse, so I'd fix
>>> it in this patch to avoid the regression.
>>>
>>> Maybe you're right that the null fraction should be applied regardless
>>> of whether there's an explicit null check, though.
>>
>> The more I read that code, the more I'm convinced there's no way to
>> get a DEFAULT_RANGE_INEQ_SEL if (rqlist->hibound != DEFAULT_INEQ_SEL
>> &&
>>  rqlist->lobound != DEFAULT_INEQ_SEL) other than from contradictory
>> clauses, a case the current code handles very poorly, returning
>> DEFAULT_RANGE_INEQ_SEL and ignoring nullfrac, something that could
>> give way-off estimates if the table had lots of nulls.
>>
>> Say, consider if "value" was mostly null and you write:
>>
>> SELECT * FROM testdata WHERE value BETWEEN 10 AND 20 AND value BETWEEN
>> 50 AND 60;
>>
>> All other cases I can think of would end with either hibound or
>> lobound being equal to DEFAULT_INEQ_SEL.
>>
>> It seems to me, doing a git blame, that the checks in the else branch
>> were left only as a precaution, one that seems unnecessary.
>>
>> If you ask me, I'd just leave:
>>
>> + if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
>> DEFAULT_INEQ_SEL)
>> + {
>> + /* No point in checking null selectivity, DEFAULT_INEQ_SEL
>> implies we have no stats */
>> + s2 = DEFAULT_RANGE_INEQ_SEL;
>> + }
>> + else
>> + {
>> ...
>> +   s2 = rqlist->hibound + rqlist->lobound - 1.0;
>> +   nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
>> +   notnullsel = 1.0 - nullsel;
>> +
>> +   /* Adjust for double-exclusion of NULLs */
>> +   s2 += nullsel;
>> +   if (s2 <= 0.0) {
>> +  if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
>> +  {
>> +   /* Most likely contradictory clauses found */
>> +   s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
>> +  }
>> +  else
>> +  {
>> +   /* Could be a rounding error */
>> +   s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
>> +  }
>> +   }
>> + }
>>
>> Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
>> cautious) estimation of the amount of rounding error that could be
>> there with 32-bit floats.
>>
>> The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
>> an estimation based on stats, maybe approximate, but one that makes
>> sense (ie: preserves the monotonicity of the CPF), and as such
>> negative values are either sign of a contradiction or rounding error.
>> The previous code left the possibility of negatives coming out of
>> default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
>> but that doesn't seem possible coming out of scalarineqsel.
>>
>> B

Re: [HACKERS] Making clausesel.c Smarter

2017-04-04 Thread Claudio Freire
On Tue, Apr 4, 2017 at 8:12 AM, David Rowley
 wrote:
> Result Comparison
>
> Master median tps Patch median tps comparison
> Test 1 6993.7 6714.3 104.16%
> Test 2 7053.1 6921.6 101.90%
> Test 3 5137.2 4954.2 103.69%
> Test 4 27.1 19.4 139.72%
> Test 5 54.1 51.4 105.28%
> Test 6 9328.1 9478.2 98.42%
>
> Results Analyzed:
>
> Test 1 has caused planning to slow down 4.16%. There's quite a bit of
> noise from the results, but I think this indicates there is some
> overhead to having to add items to the cslist and searching the cslist
> when new quals are seen.
>
> Test 2 has a lesser slowdown than test 1, as this test will excercise
> the existing rqlist caching in master too. Patched does a little more
> work adding the equality condition to the list too.
>
> Test 3 similar to test 1
>
> Test 4 adds quite an overhead and causes 0.5 million comparisons to
> find the expressions in the cslist.
>
> Test 5 shows less overhead than test 4 since the Master code has to
> also do the expression caching and searching.
>
> Test 6 is a control test

That's consistent with the operation being quadratic.

While I was about to point it out, the old code was quadratic too, so
it sucks in both versions. This version only adds other opexprs to the
list and makes the quadratic cost easier to run into, but it's nothing
new.

I don't think there's an easy fix for that. You'd have to build a hash
table of expression nodes or something of the sort to avoid the
quadratic cost. If you can do that, by all means, but that's a much
bigger patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-03 Thread Claudio Freire
On Mon, Apr 3, 2017 at 9:19 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Apr 3, 2017 at 8:52 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>>> One last observation:
>>>
>>> +/*
>>> + * An IS NOT NULL test is a no-op if there's any other strict 
>>> quals,
>>> + * so if that's the case, then we'll only apply this, otherwise 
>>> we'll
>>> + * ignore it.
>>> + */
>>> +else if (cslist->selmask == CACHESEL_NOTNULLTEST)
>>> +s1 *= cslist->notnullsel;
>>>
>>> In some paths, the range selectivity estimation code punts and returns
>>> DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was
>>> present, it should still be applied. It could make a big difference if
>>> the selectivity for the nulltest is high.
>>
>> I'd say that the location that applies the DEFAULT_RANGE_INEQ_SEL
>> should apply the nullfrac. Seems wrong to rely on a IS NOT NULL test
>> to exists to estimate that properly. I don't think that needs done as
>> part of this patch. I'd rather limit the scope since "returned with
>> feedback" is already knocking at the door of this patch.
>
> I agree, but this patch regresses for those cases if the user
> compensated with a not null test, leaving little recourse, so I'd fix
> it in this patch to avoid the regression.
>
> Maybe you're right that the null fraction should be applied regardless
> of whether there's an explicit null check, though.

The more I read that code, the more I'm convinced there's no way to
get a DEFAULT_RANGE_INEQ_SEL if (rqlist->hibound != DEFAULT_INEQ_SEL
&&
 rqlist->lobound != DEFAULT_INEQ_SEL) other than from contradictory
clauses, a case the current code handles very poorly, returning
DEFAULT_RANGE_INEQ_SEL and ignoring nullfrac, something that could
give way-off estimates if the table had lots of nulls.

Say, consider if "value" was mostly null and you write:

SELECT * FROM testdata WHERE value BETWEEN 10 AND 20 AND value BETWEEN
50 AND 60;

All other cases I can think of would end with either hibound or
lobound being equal to DEFAULT_INEQ_SEL.

It seems to me, doing a git blame, that the checks in the else branch
were left only as a precaution, one that seems unnecessary.

If you ask me, I'd just leave:

+ if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
DEFAULT_INEQ_SEL)
+ {
+ /* No point in checking null selectivity, DEFAULT_INEQ_SEL
implies we have no stats */
+ s2 = DEFAULT_RANGE_INEQ_SEL;
+ }
+ else
+ {
...
+   s2 = rqlist->hibound + rqlist->lobound - 1.0;
+   nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
+   notnullsel = 1.0 - nullsel;
+
+   /* Adjust for double-exclusion of NULLs */
+   s2 += nullsel;
+   if (s2 <= 0.0) {
+  if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
+  {
+   /* Most likely contradictory clauses found */
+   s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
+  }
+  else
+  {
+   /* Could be a rounding error */
+   s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
+  }
+   }
+ }

Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
cautious) estimation of the amount of rounding error that could be
there with 32-bit floats.

The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
an estimation based on stats, maybe approximate, but one that makes
sense (ie: preserves the monotonicity of the CPF), and as such
negative values are either sign of a contradiction or rounding error.
The previous code left the possibility of negatives coming out of
default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
but that doesn't seem possible coming out of scalarineqsel.

But I'd like it if Tom could comment on that, since he's the one that
did that in commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a (way back
in 2004).

Barring that, I'd just change the

s2 = DEFAULT_RANGE_INEQ_SEL;

To

s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;

Which makes a lot more sense.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-03 Thread Claudio Freire
On Mon, Apr 3, 2017 at 8:52 PM, David Rowley
 wrote:
>> One last observation:
>>
>> +/*
>> + * An IS NOT NULL test is a no-op if there's any other strict quals,
>> + * so if that's the case, then we'll only apply this, otherwise 
>> we'll
>> + * ignore it.
>> + */
>> +else if (cslist->selmask == CACHESEL_NOTNULLTEST)
>> +s1 *= cslist->notnullsel;
>>
>> In some paths, the range selectivity estimation code punts and returns
>> DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was
>> present, it should still be applied. It could make a big difference if
>> the selectivity for the nulltest is high.
>
> I'd say that the location that applies the DEFAULT_RANGE_INEQ_SEL
> should apply the nullfrac. Seems wrong to rely on a IS NOT NULL test
> to exists to estimate that properly. I don't think that needs done as
> part of this patch. I'd rather limit the scope since "returned with
> feedback" is already knocking at the door of this patch.

I agree, but this patch regresses for those cases if the user
compensated with a not null test, leaving little recourse, so I'd fix
it in this patch to avoid the regression.

Maybe you're right that the null fraction should be applied regardless
of whether there's an explicit null check, though.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-03 Thread Claudio Freire
On Mon, Apr 3, 2017 at 5:59 AM, David Rowley
 wrote:
>> +static void addCachedSelectivityRangeVar(CachedSelectivityClause
>> **cslist, Node *var,
>> bool varonleft, bool isLTsel, Selectivity s2);
>>
>> You're changing clause -> var throughout the code when dealing with
>> nodes, but looking at their use, they hold neither. All those
>> addCachedSelectivity functions are usually passed expression nodes. If
>> you're renaming, perhaps you should rename to "expr".
>
> hmm, this is true. I kind of followed the lead of the name of the
> variable in the old RangeQueryClause struct. I have changed the names
> of these to be expr in the attached, but I didn't change the name of
> the Var in the new CachedSelectivityClause struct. It seems like a
> change not related to this patch.
>
> Do you think I should change that too?

I'd prefer it if all occurrences of the concept were changed, to
maintain consistency.
That would include variables used to hold expressions that refer to
these as well, as in the case of:

+Node   *var;
+
+if (varonleft)
+var = leftexpr;
+else
+var = rightexpr;
+


>> I would suggest avoiding making the assumption that it doesn't unless
>> the operator is strict.
>>
>> One could argue that such an operator should provide its own
>> selectivity estimator, but the strictness check shouldn't be too
>> difficult to add, and I think it's a better choice.
>>
>> So you'd have to check that:
>>
>> default:
>> if (op_strict(expr->opno) && func_strict(expr->opfuncid))
>> addCachedSelectivityOtherClause(, var, s2);
>> else
>> s1 = s1 * s2;
>> break;
>>
>
> I've changed it to something along those lines. I don't think the
> func_strict here is required, though, so I've gone with just
> op_strict().

Indeed, I realized right after typing it, I thought I had corrected it
before I hit send. Seems I hadn't. Good thing you noticed.

One last observation:

+/*
+ * An IS NOT NULL test is a no-op if there's any other strict quals,
+ * so if that's the case, then we'll only apply this, otherwise we'll
+ * ignore it.
+ */
+else if (cslist->selmask == CACHESEL_NOTNULLTEST)
+s1 *= cslist->notnullsel;

In some paths, the range selectivity estimation code punts and returns
DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was
present, it should still be applied. It could make a big difference if
the selectivity for the nulltest is high.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-03 Thread Claudio Freire
On Mon, Apr 3, 2017 at 6:22 PM, David Rowley
 wrote:
> On 4 April 2017 at 08:24, Andres Freund  wrote:
>> On 2017-04-03 20:59:42 +1200, David Rowley wrote:
>>> Updated patch attached.
>>>
>>> Thanks for reviewing it.
>>
>> Given the time in the release cycle I'm afraid that this it's too late
>> to get this into v10.  Does anybody disagree?  If not, it should be
>> moved to the next CF.
>
> I strongly disagree. The time in the release cycle is the final
> commitfest. This is when patches are commited to the repository. The
> exception to this is that no large invasive patches should arrive new
> in the final commitfest. This is not one of those.

Selectivity misestimations can have a huge impact on query
performance, and that could be a big deal if there's a big regression
on some queries.

But, the beta period could be enough to detect any issues there. I'll
leave that decision to committers or commitfest managers I guess.

>
> Tom has already mentioned he'd like to look at this. If you're not
> willing, then please just don't look at it, but please also don't
> remove other peoples opportunity for doing so.
>
> Please explain your logic for thinking otherwise. Have I somehow
> misunderstood what the 1 week extension means?

But Tom hasn't yet, and even if he does, we'd have to be done with the
review within this week.

In any case, I intend to go on with the review later today, and at
first look the patch seems in good shape (though it'll take a deeper
look before I make that my official review). Most other patches in
need of review are a bit complex for my knowledge of the code base,
and I already started reviewing this one. We could perhaps be done by
the end of the week.

We could defer that decision to within 4 days. If we're done, we're
done. If we're not, move to next CF.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-03 Thread Claudio Freire
On Mon, Apr 3, 2017 at 5:24 PM, Andres Freund  wrote:
> On 2017-04-03 20:59:42 +1200, David Rowley wrote:
>> Updated patch attached.
>>
>> Thanks for reviewing it.
>
> Given the time in the release cycle I'm afraid that this it's too late
> to get this into v10.  Does anybody disagree?  If not, it should be
> moved to the next CF.
>
> Greetings,
>
> Andres Freund

While personally I'd like to see this patch make it (I got bitten by
this more than once), I can't argue with that logic with the feature
freeze almost upon us, especially given the stage the review is at
(ie: just beginning).

I'm still going to review it though.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-03-31 Thread Claudio Freire
On Fri, Feb 24, 2017 at 7:00 AM, David Rowley
 wrote:
> I've stumbled over an interesting query out in the wild where the
> query was testing a nullable timestamptz column was earlier than some
> point in time, and also checking the column IS NOT NULL.
>
> The use pattern here was that records which required processing had
> this timestamp set to something other than NULL, a worker would come
> along and process those, and UPDATE the record to NULL to mark the
> fact that it was now processed. So what we are left with was a table
> with a small number of rows with a value in this timestamp column, and
> an ever-growing number of rows with a NULL value.
>
> A highly simplified version of the query was checking for records that
> required processing before some date, say:
>
> SELECT * FROM a WHERE ts < '2017-02-24' AND ts IS NOT NULL;
>
> Of course, the ts IS NOT NULL is not required here, but I can
> understand how someone might make the mistake of adding this. The
> simple solution to the problem was to have that null test removed, but
> that seemingly innocent little mistake caused some pain due to very
> slow running queries which held on to a nested loop plan 33 times
> longer than it should have been doing. Basically what was happening
> here is that clauselist_selectivity() was applying the selectivity
> with the ~0.97 null_fact from pg_statistic over the top of the already
> applied estimate on the number of rows below the constant timestamp.
>
> Since the diagnosis of this problem was not instant, and some amount
> of pain was suffered before the solution was found,  I wondered if it
> might be worth smartening up the planner a bit in this area.
>
> We're already pretty good at handling conditions like: SELECT * FROM a
> WHERE x < 10 and x < 1; where we'll effectively ignore the x < 10
> estimate since x < 1 is more restrictive, so if we just build on that
> ability a bit we could get enough code to cover the above case.
>
> I've attached a draft patch which improves the situation here.

I thought to take a quick look at this patch. I'll probably take a
deeper look later, but some initial comments:

-typedef struct RangeQueryClause
+typedef struct CachedSelectivityClause
 {
-struct RangeQueryClause *next;/* next in linked list */
+struct CachedSelectivityClause *next;/* next in linked list */
 Node   *var;/* The common variable of the clauses */
-boolhave_lobound;/* found a low-bound clause yet? */
-boolhave_hibound;/* found a high-bound clause yet? */
+intselmask;/* Bitmask of which sel types are stored */
 Selectivity lobound;/* Selectivity of a var > something clause */
 Selectivity hibound;/* Selectivity of a var < something clause */
-} RangeQueryClause;


As seems customary in other code, perhaps you should define some
HAS_X() macros for dealing with the selmask.

Something like

#SELMASK_HAS_LOBOUND(selmask) (((selmask) & CACHESEL_LOBOUND) != 0)
etc..

+static void addCachedSelectivityRangeVar(CachedSelectivityClause
**cslist, Node *var,
bool varonleft, bool isLTsel, Selectivity s2);

You're changing clause -> var throughout the code when dealing with
nodes, but looking at their use, they hold neither. All those
addCachedSelectivity functions are usually passed expression nodes. If
you're renaming, perhaps you should rename to "expr".


On Fri, Feb 24, 2017 at 7:00 AM, David Rowley
 wrote:
> Now one thing I was unsure of in the patch was this "Other clauses"
> concept that I've come up with. In the patch we have:
>
> default:
> addCachedSelectivityOtherClause(, var, s2);
> break;
>
> I was unsure if this should only apply to F_EQSEL and F_NEQSEL. My
> imagination here won't stretch far enough to imagine a OpExpr which
> passes with a NULL operand. If it did my logic is broken, and if
> that's the case then I think limiting "Others" to mean F_EQSEL and
> F_NEQSEL would be the workaround.

While not specifically applicable only to "Others", something needs
consideration here:

User-defined functions can be nonstrict. An OpExpr involving such a
user-defined function could possibly pass on null.

I would suggest avoiding making the assumption that it doesn't unless
the operator is strict.

One could argue that such an operator should provide its own
selectivity estimator, but the strictness check shouldn't be too
difficult to add, and I think it's a better choice.

So you'd have to check that:

default:
if (op_strict(expr->opno) && func_strict(expr->opfuncid))
addCachedSelectivityOtherClause(, var, s2);
else
s1 = s1 * s2;
break;

So, I went ahead and did that.

Considering this setup:

createdb pgtest
cat 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-03-28 Thread Claudio Freire
On Wed, Feb 1, 2017 at 7:55 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfre...@gmail.com> 
>> wrote:
>>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>>> wrote:
>>>> Thank you for updating the patch.
>>>>
>>>> Whole patch looks good to me except for the following one comment.
>>>> This is the final comment from me.
>>>>
>>>> /*
>>>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>>>  *
>>>>  *  This has the right signature to be an IndexBulkDeleteCallback.
>>>>  *
>>>>  *  Assumes dead_tuples array is in sorted order.
>>>>  */
>>>> static bool
>>>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>>>> {
>>>> LVRelStats *vacrelstats = (LVRelStats *) state;
>>>>
>>>> You might want to update the comment of lazy_tid_reaped() as well.
>>>
>>> I don't see the mismatch with reality there (if you consider
>>> "dead_tples array" in the proper context, that is, the multiarray).
>>>
>>> What in particular do you find out of sync there?
>>
>> The current lazy_tid_reaped just find a tid from a tid array using
>> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
>> and processing method become complicated. So I thought it's better to
>> add the description of this function.
>
> Alright, updated with some more remarks that seemed relevant

I just realized I never updated the early free patch after the
multiarray version.

So attached is a patch that frees the multiarray as early as possible
(just after finishing with index bulk deletes, right before doing
index cleanup and attempting truncation).

This should make the possibly big amount of memory available to other
processes for the duration of those tasks, which could be a long time
in some cases.
From f080f8377b7200ae9c2a02abfb0ef0bf6e2d5d69 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Tue, 28 Mar 2017 22:40:39 -0300
Subject: [PATCH] Vacuum: free dead tuples array as early as possible

Allow other parts of the system to benefit from the possibly
large amount of memory reserved for dead tuples after they're
no longer necessary.

While the memory would be freed when the command finishes, it
can sometimes be some considerable time between the time vacuum
is done with the array until the command finishes - mostly due
to truncate taking a long time.
---
 src/backend/commands/vacuumlazy.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4a6895c..60a6c18 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -206,6 +206,7 @@ static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
 static void lazy_clear_dead_tuples(LVRelStats *vacrelstats);
+static void lazy_free_dead_tuples(LVRelStats *vacrelstats);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static inline int vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
@@ -1358,6 +1359,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
  PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
 
+	/* dead tuples no longer needed past this point */
+	lazy_free_dead_tuples(vacrelstats);
+
 	/* Do post-vacuum cleanup and statistics update for each index */
 	for (i = 0; i < nindexes; i++)
 		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
@@ -2165,6 +2169,24 @@ lazy_clear_dead_tuples(LVRelStats *vacrelstats)
 }
 
 /*
+ * lazy_free_dead_tuples - reset all dead tuples segments
+ * and free all allocated memory
+ */
+static void
+lazy_free_dead_tuples(LVRelStats *vacrelstats)
+{
+	int			nseg;
+
+	for (nseg = 0; nseg < vacrelstats->dead_tuples.num_segs; nseg++)
+		pfree(vacrelstats->dead_tuples.dt_segments[nseg].dt_tids);
+	pfree(vacrelstats->dead_tuples.dt_segments);
+	vacrelstats->dead_tuples.last_seg = 0;
+	vacrelstats->dead_tuples.num_segs = 0;
+	vacrelstats->dead_tuples.num_entries = 0;
+	vacrelstats->dead_tuples.dt_segments = NULL;
+}
+
+/*
  *	lazy_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
-- 
1.8.4.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Claudio Freire
On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas  wrote:
> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai  wrote:
>> This attached patch re-designed the previous v2 according to Robert's 
>> comment.
>> All I had to do is putting entrypoint for ForeignScan/CustomScan at
>> ExecShutdownNode,
>> because it is already modified to call child node first, earlier than
>> ExecShutdownGather
>> which releases DSM segment.
>
> Aside from the documentation, which needs some work, this looks fine
> to me on a quick read.

LGTM too.

You may want to clarify in the docs when the hook is called in
relation to other hooks too.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-15 Thread Claudio Freire
On Wed, Feb 15, 2017 at 9:52 AM, Robert Haas  wrote:
> On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane  wrote:
>> You could perhaps make an argument that sum(float4) would have less risk
>> of overflow if it accumulated in and returned float8, but frankly that
>> seems a bit thin.
>
> I think that's more or less the argument Konstantin is in fact making.
> Whether it's a good argument or a thin one is a value judgement.
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.  He doesn't like
> the way it works currently, but we have no way of knowing how many
> people who are happy with the way it works today would become unhappy
> if we changed it.  We need a fairly compelling reason to risk breaking
> somebody's SQL, and I don't think this rises to that level.


I know this is said from time to time in this list, but a third option
that wouldn't break anybody's SQL would be using compensated summation
in the input type.

AFAIK, that can only increase precision, but it will cost cycles. The
impact could however fall below the noise and be worth a try.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-09 Thread Claudio Freire
On Thu, Feb 9, 2017 at 9:50 PM, Jim Nasby  wrote:
> WHERE t1 IN ('a','b') OR t2 IN ('c','d')
>
> into
>
> WHERE f1 IN (1,2) OR f2 IN (3,4)
>
> (assuming a,b,c,d maps to 1,2,3,4)
>
> BTW, there's an important caveat here: users generally do NOT want duplicate
> rows from the fact table if the dimension table results aren't unique. I
> thought my array solution was equivalent to what the JOINs would do in that
> case but that's actually wrong. The array solution does provide the behavior
> users generally want here though. JOIN is the easiest tool to pick up for
> this, so it's what people gravitate to, but I suspect most users would be
> happier with a construct that worked like the array trick does, but was
> easier to accomplish.
>
> I wonder if any other databases have come up with non-standard syntax to do
> this.

What I've been doing is do those transforms (tn -> fn) in application
code. While it's a chore, the improvement in plans is usually well
worth the trouble.

IF there's a FK between fact and dimension tables, you can be certain
the transform will yield equivalent results, becuase you'll be certain
the joins don't duplicate rows.

So the transform should be rather general and useful.

If you have a join of the form:

a join b on a.f1 = b.id

Where a.f1 has an FK referencing b.id, and a filter on b X of any
form, you can turn the plan into:

with b_ids as (select id from b where X)
...
a join b on a.f1 = b.id and a.f1 in (select id from b_ids)

In order to be useful, the expected row count from b_ids should be rather small.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Claudio Freire
On Mon, Feb 6, 2017 at 1:42 AM, Kouhei Kaigai  wrote:
> I also had thought an idea to have extra space to Instrumentation structure,
> however, it needs to make Instrumentation flexible-length structure according
> to the custom format by CSP/FDW. Likely, it is not a good design.
> As long as extension can retrieve its custom statistics on DSM area required
> by ExecParallelEstimate(), I have no preference on the hook location.

That's what I had in mind: the hook happens there, but the extension
retrieves the information from some extension-specific DSM area, just
as it would on the ParallelFinish hook.

> One thing we may pay attention is, some extension (not mine) may want to
> collect worker's statistics regardless of Instrumentation (in other words,
> even if plan is not under EXPLAIN ANALYZE).
> It is the reason why I didn't put a hook under the 
> ExecParallelRetrieveInstrumentation().

I don't think you should worry about that as long as it's a hypothetical case.

If/when some extension actually needs to do that, the design can be
discussed with a real use case at hand, and not a hypothetical one.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Claudio Freire
On Mon, Feb 6, 2017 at 1:00 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Sun, Feb 5, 2017 at 9:19 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
>>> If the use case for this is to gather instrumentation, I'd suggest calling
>>> the hook in RetrieveInstrumentation, and calling it appropriately. It would
>>> make the intended use far clearer than it is now.
>>>
>>> And if it saves some work, all the better.
>>>
>>> Until there's a use case for a non-instrumentation hook in that place, I
>>> wouldn't add it. This level of generality sounds like a solution waiting
>>> for a problem to solve.
>>>
>> The use cases I'd like to add are extension specific but significant for
>> performance analytics. These statistics are not included in Instrumentation.
>> For example, my problems are GPU execution time, data transfer ratio over
>> DMA, synchronization time for GPU task completion, and so on.
>> Only extension can know which attributes are collected during the execution,
>> and its data format. I don't think Instrumentation fits these requirements.
>> This is a problem I faced on the v9.6 based interface design, so I could
>> notice it.
>
>
> What RetrieveInstrumentation currently does may not cover the
> extension's specific instrumentation, but what I'm suggesting is
> adding a hook, like the one you're proposing for ParallelFinish, that
> would collect extension-specific instrumentation. Couple that with
> extension-specific explain output and you'll get your use cases
> covered, I think.


In essence, you added a ParallelFinish that is called after
RetrieveInstrumentation. That's overreaching, for your use case.

If instrumentation is what you're collecting, it's far more correct,
and more readable, to have a hook called from inside
RetrieveInstrumentation that will retrieve extension-specific
instrumentation.

RetrieveInstrumentation itself doesn't need to parse that information,
that will be loaded into the custom scan nodes, and those are the ones
that will parse it when generating explain output.

So, in essence it's almost what you did with ParallelFinish, more
clearly aimed at collecting runtime statistics.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Claudio Freire
On Sun, Feb 5, 2017 at 9:19 PM, Kouhei Kaigai  wrote:
>> If the use case for this is to gather instrumentation, I'd suggest calling
>> the hook in RetrieveInstrumentation, and calling it appropriately. It would
>> make the intended use far clearer than it is now.
>>
>> And if it saves some work, all the better.
>>
>> Until there's a use case for a non-instrumentation hook in that place, I
>> wouldn't add it. This level of generality sounds like a solution waiting
>> for a problem to solve.
>>
> The use cases I'd like to add are extension specific but significant for
> performance analytics. These statistics are not included in Instrumentation.
> For example, my problems are GPU execution time, data transfer ratio over
> DMA, synchronization time for GPU task completion, and so on.
> Only extension can know which attributes are collected during the execution,
> and its data format. I don't think Instrumentation fits these requirements.
> This is a problem I faced on the v9.6 based interface design, so I could
> notice it.


What RetrieveInstrumentation currently does may not cover the
extension's specific instrumentation, but what I'm suggesting is
adding a hook, like the one you're proposing for ParallelFinish, that
would collect extension-specific instrumentation. Couple that with
extension-specific explain output and you'll get your use cases
covered, I think.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-03 Thread Claudio Freire
On Mon, Oct 31, 2016 at 11:33 AM, Kouhei Kaigai  wrote:
> Hello,
>
> The attached patch implements the suggestion by Amit before.
>
> What I'm motivated is to collect extra run-time statistics specific
> to a particular ForeignScan/CustomScan, not only the standard
> Instrumentation; like DMA transfer rate or execution time of GPU
> kernels in my case.
>
> Per-node DSM toc is one of the best way to return run-time statistics
> to the master backend, because FDW/CSP can assign arbitrary length of
> the region according to its needs. It is quite easy to require.
> However, one problem is, the per-node DSM toc is already released when
> ExecEndNode() is called on the child node of Gather.
>
> This patch allows extensions to get control on the master backend's
> context when all the worker node gets finished but prior to release
> of the DSM segment. If FDW/CSP has its special statistics on the
> segment, it can move to the private memory area for EXPLAIN output
> or something other purpose.
>
> One design consideration is whether the hook shall be called from
> ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
> The former is a function to retrieve the standard Instrumentation
> information, thus, it is valid only if EXPLAIN ANALYZE.
> On the other hands, if we put entrypoint at ExecParallelFinish(),
> extension can get control regardless of EXPLAIN ANALYZE, however,
> it also needs an extra planstate_tree_walker().

If the use case for this is to gather instrumentation, I'd suggest calling the
hook in RetrieveInstrumentation, and calling it appropriately. It would
make the intended use far clearer than it is now.

And if it saves some work, all the better.

Until there's a use case for a non-instrumentation hook in that place,
I wouldn't add it. This level of generality sounds like a solution waiting
for a problem to solve.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-02-01 Thread Claudio Freire
On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> Thank you for updating the patch.
>>>
>>> Whole patch looks good to me except for the following one comment.
>>> This is the final comment from me.
>>>
>>> /*
>>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>>  *
>>>  *  This has the right signature to be an IndexBulkDeleteCallback.
>>>  *
>>>  *  Assumes dead_tuples array is in sorted order.
>>>  */
>>> static bool
>>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>>> {
>>> LVRelStats *vacrelstats = (LVRelStats *) state;
>>>
>>> You might want to update the comment of lazy_tid_reaped() as well.
>>
>> I don't see the mismatch with reality there (if you consider
>> "dead_tples array" in the proper context, that is, the multiarray).
>>
>> What in particular do you find out of sync there?
>
> The current lazy_tid_reaped just find a tid from a tid array using
> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
> and processing method become complicated. So I thought it's better to
> add the description of this function.

Alright, updated with some more remarks that seemed relevant
From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 318 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 ++
 3 files changed, 271 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..4a6895c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
- * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure proceeds sequentially in the list of segments,
+ * and with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(N log N) lookup complexity.
+ *
+ * If the array threatens to overflow, we suspend the heap scan phase and
+ * perform a pass of index cleanup and page compaction, then resume the heap
+ * scan with an array of logically empty but already preallocated TID segments
+ * to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -93,6 +106,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -104,6 +125,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_t

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-02-01 Thread Claudio Freire
On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada  wrote:
> Thank you for updating the patch.
>
> Whole patch looks good to me except for the following one comment.
> This is the final comment from me.
>
> /*
>  *  lazy_tid_reaped() -- is a particular tid deletable?
>  *
>  *  This has the right signature to be an IndexBulkDeleteCallback.
>  *
>  *  Assumes dead_tuples array is in sorted order.
>  */
> static bool
> lazy_tid_reaped(ItemPointer itemptr, void *state)
> {
> LVRelStats *vacrelstats = (LVRelStats *) state;
>
> You might want to update the comment of lazy_tid_reaped() as well.

I don't see the mismatch with reality there (if you consider
"dead_tples array" in the proper context, that is, the multiarray).

What in particular do you find out of sync there?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-30 Thread Claudio Freire
On Mon, Jan 30, 2017 at 5:51 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> 
>  * We are willing to use at most maintenance_work_mem (or perhaps
>  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>  * initially allocate an array of TIDs of that size, with an upper limit that
>  * depends on table size (this limit ensures we don't allocate a huge area
>  * uselessly for vacuuming small tables).  If the array threatens to overflow,
>
> I think that we need to update the above paragraph comment at top of
> vacuumlazy.c file.

Indeed, I missed that one. Fixing.

>
> 
> +   numtuples = Max(numtuples,
> MaxHeapTuplesPerPage);
> +   numtuples = Min(numtuples, INT_MAX / 2);
> +   numtuples = Min(numtuples, 2 *
> pseg->max_dead_tuples);
> +   numtuples = Min(numtuples,
> MaxAllocSize / sizeof(ItemPointerData));
> +   seg->dt_tids = (ItemPointer)
> palloc(sizeof(ItemPointerData) * numtuples);
>
> Why numtuples is limited to "INT_MAX / 2" but not INT_MAX?

I forgot to mention this one in the OP.

Googling around, I found out some implemetations of bsearch break with
array sizes beyond INT_MAX/2 [1] (they'd overflow when computing the
midpoint).

Before this patch, this bsearch call had no way of reaching that size.
An initial version of the patch (the one that allocated a big array
with huge allocation) could reach that point, though, so I reduced the
limit to play it safe. This latest version is back to the starting
point, since it cannot allocate segments bigger than 1GB, but I opted
to keep playing it safe and leave the reduced limit just in case.

> 
> @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats
> *vacrelstats)
> pg_rusage_init();
> npages = 0;
>
> -   tupindex = 0;
> -   while (tupindex < vacrelstats->num_dead_tuples)
> +   segindex = 0;
> +   tottuples = 0;
> +   for (segindex = tupindex = 0; segindex <=
> vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
> {
> -   BlockNumber tblk;
> -   Buffer  buf;
> -   Pagepage;
> -   Sizefreespace;
>
> This is a minute thing but tupindex can be define inside of for loop.

Right, changing.

>
> 
> @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>   * instead of doing a second scan.
>   */
>  if (nindexes == 0 &&
> -vacrelstats->num_dead_tuples > 0)
> +vacrelstats->dead_tuples.num_entries > 0)
>  {
>  /* Remove tuples from heap */
> -lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, );
> +Assert(vacrelstats->dead_tuples.last_seg == 0);/*
> Should not need more
> + *
> than one segment per
> + * page */
>
> I'm not sure we need to add Assert() here but it seems to me that the
> comment and code is not properly correspond and the comment for
> Assert() should be wrote above of Assert() line.

Well, that assert is the one that found the second bug in
lazy_clear_dead_tuples, so clearly it's not without merit.

I'll rearrange the comments as you ask though.


Updated and rebased v7 attached.


[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776671
From d32610b0ad6b9413aa4b2d808019d3c67d578f3c Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 314 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 ++
 3 files changed, 268 insertions(+), 64 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..8cb614f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-25 Thread Claudio Freire
On Wed, Jan 25, 2017 at 1:54 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> Thank you for updating the patch!
>
> +   /*
> +* Quickly rule out by lower bound (should happen a lot) Upper bound 
> was
> +* already checked by segment search
> +*/
> +   if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0)
> +   return false;
>
> I think that if the above result is 0, we can return true as itemptr
> matched lower bound item pointer in rseg->dead_tuples.

That's right. Possibly not a great speedup but... why not?

>
>  +typedef struct DeadTuplesSegment
> +{
> +   int num_dead_tuples;/* # of
> entries in the segment */
> +   int max_dead_tuples;/* # of
> entries allocated in the segment */
> +   ItemPointerData last_dead_tuple;/* Copy of the last
> dead tuple (unset
> +
>   * until the segment is fully
> +
>   * populated) */
> +   unsigned short padding;
> +   ItemPointer dead_tuples;/* Array of dead tuples */
> +}  DeadTuplesSegment;
> +
> +typedef struct DeadTuplesMultiArray
> +{
> +   int num_entries;/* current # of entries */
> +   int max_entries;/* total # of slots
> that can be allocated in
> +* array */
> +   int num_segs;   /* number of
> dead tuple segments allocated */
> +   int last_seg;   /* last dead
> tuple segment with data (or 0) */
> +   DeadTuplesSegment *dead_tuples; /* array of num_segs segments 
> */
> +}  DeadTuplesMultiArray;
>
> It's a matter of personal preference but some same dead_tuples
> variables having different meaning confused me.
> If we want to access first dead tuple location of first segment, we
> need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For
> example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to
> me.

Yes, I can see how that could be confusing.

I went for vacrelstats->dead_tuples.dt_segments[i].dt_tids[j]

> +   nseg->num_dead_tuples = 0;
> +   nseg->max_dead_tuples = 0;
> +   nseg->dead_tuples = NULL;
> +   vacrelstats->dead_tuples.num_segs++;
> +   }
> +   seg = DeadTuplesCurrentSegment(vacrelstats);
> +   }
> +   vacrelstats->dead_tuples.last_seg++;
> +   seg = DeadTuplesCurrentSegment(vacrelstats);
>
> Because seg is always set later I think the first line starting with
> "seg = ..." is not necessary. Thought?

That's correct.

Attached a v6 with those changes (and rebased).

Make check still passes.
From c89019089a71517befac0920f22ed75577dda6c6 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 291 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 ++
 3 files changed, 250 insertions(+), 59 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..1d2441f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -93,6 +93,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -104,6 +112,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-23 Thread Claudio Freire
On Fri, Jan 20, 2017 at 6:24 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
>> <a.lubennik...@postgrespro.ru> wrote:
>>> 28.12.2016 23:43, Claudio Freire:
>>>
>>> Attached v4 patches with the requested fixes.
>>>
>>>
>>> Sorry for being late, but the tests took a lot of time.
>>
>> I know. Takes me several days to run my test scripts once.
>>
>>> create table t1 as select i, md5(random()::text) from
>>> generate_series(0,4) as i;
>>> create index md5_idx ON  t1(md5);
>>> update t1 set md5 = md5((random() * (100 + 500))::text);
>>> vacuum;
>>>
>>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
>>> while for old version it took three passes (1GB+1GB+0.9GB).
>>> Vacuum duration results:
>>>
>>> vanilla:
>>> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
>>> patched:
>>> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>>>
>>> We can see 30% vacuum speedup. I should note that this case can be
>>> considered
>>> as favorable to vanilla vacuum: the table is not that big, it has just one
>>> index
>>> and disk used is a fast fusionIO. We can expect even more gain on slower
>>> disks.
>>>
>>> Thank you again for the patch. Hope to see it in 10.0.
>>
>> Cool. Thanks for the review and the tests.
>>
>
> I encountered a bug with following scenario.
> 1. Create table and disable autovacuum on that table.
> 2. Make about 20 dead tuples on the table.
> 3. SET maintenance_work_mem TO 1024
> 4. VACUUM
>
> @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>  * not to reset latestRemovedXid since we want
> that value to be
>  * valid.
>  */
> -   vacrelstats->num_dead_tuples = 0;
> +   lazy_clear_dead_tuples(vacrelstats);
> vacrelstats->num_index_scans++;
>
> /* Report that we are once again scanning the heap */
>
> I think that we should do vacrelstats->dead_tuples.num_entries = 0 as
> well in lazy_clear_dead_tuples(). Once the amount of dead tuples
> reached to maintenance_work_mem, lazy_scan_heap can never finish.

That's right.

I added a test for it in the attached patch set, which uncovered
another bug in lazy_clear_dead_tuples, and took the opportunity to
rebase.

On Mon, Jan 23, 2017 at 1:06 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> I pushed this patch after rewriting it rather completely.  I added
> tracing notices to inspect the blocks it was prefetching and observed
> that the original coding was failing to prefetch the final streak of
> blocks in the table, which is an important oversight considering that it
> may very well be that those are the only blocks to read at all.
>
> I timed vacuuming a 4000-block table in my laptop (single SSD disk;
> dropped FS caches after deleting all rows in table, so that vacuum has
> to read all blocks from disk); it changes from 387ms without patch to
> 155ms with patch.  I didn't measure how much it takes to run the other
> steps in the vacuum, but it's clear that the speedup for the truncation
> phase is considerable.
>
> ĄThanks, Claudio!

Cool.

Though it wasn't the first time this idea has been floating around, I
can't take all the credit.


On Fri, Jan 20, 2017 at 6:25 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> FWIW, I think this patch is completely separate from the maint_work_mem
> patch and should have had its own thread and its own commitfest entry.
> I intend to get a look at the other patch next week, after pushing this
> one.

That's because it did have it, and was left in limbo due to lack of
testing on SSDs. I just had to adopt it here because otherwise tests
took way too long.
From 92c610b4ed064afba0f914efd78a03c61f4742c6 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 288 ---
 src/test/regress/expected/vac

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-19 Thread Claudio Freire
On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> 28.12.2016 23:43, Claudio Freire:
>
> Attached v4 patches with the requested fixes.
>
>
> Sorry for being late, but the tests took a lot of time.

I know. Takes me several days to run my test scripts once.

> create table t1 as select i, md5(random()::text) from
> generate_series(0,4) as i;
> create index md5_idx ON  t1(md5);
> update t1 set md5 = md5((random() * (100 + 500))::text);
> vacuum;
>
> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
> while for old version it took three passes (1GB+1GB+0.9GB).
> Vacuum duration results:
>
> vanilla:
> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
> patched:
> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>
> We can see 30% vacuum speedup. I should note that this case can be
> considered
> as favorable to vanilla vacuum: the table is not that big, it has just one
> index
> and disk used is a fast fusionIO. We can expect even more gain on slower
> disks.
>
> Thank you again for the patch. Hope to see it in 10.0.

Cool. Thanks for the review and the tests.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block level parallel vacuum WIP

2017-01-10 Thread Claudio Freire
On Tue, Jan 10, 2017 at 6:42 AM, Masahiko Sawada  wrote:
>> Does this work negate the other work to allow VACUUM to use >
>> 1GB memory?
>
> Partly yes. Because memory space for dead TIDs needs to be allocated
> in DSM before vacuum worker launches, parallel lazy vacuum cannot use
> such a variable amount of memory as that work does. But in
> non-parallel lazy vacuum, that work would be effective. We might be
> able to do similar thing using DSA but I'm not sure that is better.

I think it would work well with DSA as well.

Just instead of having a single segment list, you'd have one per worker.

Since workers work on disjoint tid sets, that shouldn't pose a problem.

The segment list can be joined together later rather efficiently
(simple logical joining of the segment pointer arrays) for the index
scan phases.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block level parallel vacuum WIP

2017-01-10 Thread Claudio Freire
On Tue, Jan 10, 2017 at 6:42 AM, Masahiko Sawada  wrote:
> Attached result of performance test with scale factor = 500 and the
> test script I used. I measured each test at four times and plot
> average of last three execution times to sf_500.png file. When table
> has index, vacuum execution time is smallest when number of index and
> parallel degree is same.

It does seem from those results that parallel heap scans aren't paying
off, and in fact are hurting.

It could be your I/O that's at odds with the parallel degree settings
rather than the approach (ie: your I/O system can't handle that many
parallel scans), but in any case it does warrant a few more tests.

I'd suggest you try to:

1. Disable parallel lazy vacuum, leave parallel index scans
2. Limit parallel degree to number of indexes, leaving parallel lazy
vacuum enabled
3. Cap lazy vacuum parallel degree by effective_io_concurrency, and
index scan parallel degree to number of indexes

And compare against your earlier test results.

I suspect 1 could be the winner, but 3 has a chance too (if e_i_c is
properly set up for your I/O system).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Block level parallel vacuum WIP

2017-01-06 Thread Claudio Freire
On Fri, Jan 6, 2017 at 2:38 PM, Masahiko Sawada  wrote:
>  table_size | indexes | parallel_degree |   time
> +-+-+--
>  6.5GB  |   0 |   1 | 00:00:14
>  6.5GB  |   0 |   2 | 00:00:02
>  6.5GB  |   0 |   4 | 00:00:02

Those numbers look highly suspect.

Are you sure you're not experiencing caching effects? (ie: maybe you
ran the second and third vacuums after the first, and didn't flush the
page cache, so the table was cached)

>  6.5GB  |   2 |   1 | 00:02:18
>  6.5GB  |   2 |   2 | 00:00:38
>  6.5GB  |   2 |   4 | 00:00:46
...
>  13GB   |   0 |   1 | 00:03:52
>  13GB   |   0 |   2 | 00:00:49
>  13GB   |   0 |   4 | 00:00:50
..
>  13GB   |   2 |   1 | 00:12:42
>  13GB   |   2 |   2 | 00:01:17
>  13GB   |   2 |   4 | 00:02:12

These would also be consistent with caching effects


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-28 Thread Claudio Freire
On Wed, Dec 28, 2016 at 3:41 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> Anyway, I found the problem that had caused segfault.
>>
>> for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex =
>> 0, segindex++)
>> {
>> DeadTuplesSegment *seg =
>> &(vacrelstats->dead_tuples.dead_tuples[segindex]);
>> intnum_dead_tuples = seg->num_dead_tuples;
>>
>> while (tupindex < num_dead_tuples)
>> ...
>>
>> You rely on the value of tupindex here, while during the very first pass the
>> 'tupindex' variable
>> may contain any garbage. And it happend that on my system there was negative
>> value
>> as I found inspecting core dump:
>>
>> (gdb) info locals
>> num_dead_tuples = 5
>> tottuples = 0
>> tupindex = -1819017215
>>
>> Which leads to failure in the next line
>> tblk = ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
>>
>> The solution is to move this assignment inside the cycle.
>
> Good catch. I read that line suspecting that very same thing but
> somehow I was blind to it.

Attached v4 patches with the requested fixes.
From 988527b49ec0f48ea7fc85c7533e8eaa07d6fc3b Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 2 Sep 2016 23:33:48 -0300
Subject: [PATCH 1/2] Vacuum: prefetch buffers on backward scan

This patch speeds up truncation by prefetching buffers during
the backward scan phase just prior to truncation. This optimization
helps rotating media because disk rotation induces a buffer-to-buffer
latency of roughly a whole rotation when reading backwards, such
disks are usually optimized for forward scans. Prefetching buffers
in larger chunks ameliorates the issue considerably and is otherwise
harmless. The prefetch amount should also not be an issue on SSD,
which won't benefit from the optimization, but won't be hurt either.
---
 src/backend/commands/vacuumlazy.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..854bce3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1825,7 +1825,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 static BlockNumber
 count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
-	BlockNumber blkno;
+	BlockNumber blkno, prefetchBlkno;
 	instr_time	starttime;
 
 	/* Initialize the starttime if we check for conflicting lock requests */
@@ -1833,6 +1833,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 	/* Strange coding of loop control is needed because blkno is unsigned */
 	blkno = vacrelstats->rel_pages;
+	prefetchBlkno = blkno & ~0x1f;
+	prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
 	while (blkno > vacrelstats->nonempty_pages)
 	{
 		Buffer		buf;
@@ -1882,6 +1884,23 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		blkno--;
 
+		/*
+		 * Speed up reads on rotating rust by prefetching a few blocks ahead.
+		 * Backward scans on rotary disks are slow if done one buffer at a time
+		 * because readahead won't kick in on most OSes, and each buffer will
+		 * have to wait for the platter to do a full rotation.
+		 * Should be harmless on SSD.
+		 */
+		if (blkno <= prefetchBlkno) {
+			BlockNumber pblkno;
+			if (prefetchBlkno >= 32)
+prefetchBlkno -= 32;
+			else
+prefetchBlkno = 0;
+			for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++)
+PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+		}
+
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
  RBM_NORMAL, vac_strategy);
 
-- 
1.8.4.5

From d360720415a2db5d5285a757c9a06e7ed854c5c5 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 302 ++
 1 file changed, 237 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..8b714e0 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -93,11 +93,40 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-28 Thread Claudio Freire
On Wed, Dec 28, 2016 at 10:26 AM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> 27.12.2016 20:14, Claudio Freire:
>
> On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
> <a.lubennik...@postgrespro.ru> wrote:
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417tblk =
> ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
> (gdb) bt
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> #1  0x00693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
> vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
> at vacuumlazy.c:1337
> #2  0x00691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
> params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
> #3  0x0069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
> params=0x7ffe0f866310) at vacuum.c:1418
>
> Those line numbers don't match my code.
>
> Which commit are you based on?
>
> My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610
>
>
> Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well.
> I merely applied patches
> 0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch
> and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch
> then ran configure and make as usual.
> Am I doing something wrong?

Doesn't sound wrong. Maybe it's my tree the unclean one. I'll have to
do a clean checkout to verify.

> Anyway, I found the problem that had caused segfault.
>
> for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex =
> 0, segindex++)
> {
> DeadTuplesSegment *seg =
> &(vacrelstats->dead_tuples.dead_tuples[segindex]);
> intnum_dead_tuples = seg->num_dead_tuples;
>
> while (tupindex < num_dead_tuples)
> ...
>
> You rely on the value of tupindex here, while during the very first pass the
> 'tupindex' variable
> may contain any garbage. And it happend that on my system there was negative
> value
> as I found inspecting core dump:
>
> (gdb) info locals
> num_dead_tuples = 5
> tottuples = 0
> tupindex = -1819017215
>
> Which leads to failure in the next line
> tblk = ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
>
> The solution is to move this assignment inside the cycle.

Good catch. I read that line suspecting that very same thing but
somehow I was blind to it.

> I've read the second patch.
>
> 1. What is the reason to inline vac_cmp_itemptr() ?

Performance, mostly. By inlining some transformations can be applied
that wouldn't be possible otherwise. During the binary search, this
matters performance-wise.

> 2. +#define LAZY_MIN_TUPLESMax(MaxHeapTuplesPerPage, (128<<20) /
> sizeof(ItemPointerData))
> What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with named
> constant,
> or at least add a comment.

I tought it was more readable like that, since 1<<20 is known to be
"MB", that reads as "128 MB".

I guess I can add a comment saying so.

> I'll share my results of performance testing it in a few days.

Thanks


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-27 Thread Claudio Freire
On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
 wrote:
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417tblk =
> ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
> (gdb) bt
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> #1  0x00693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
> vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
> at vacuumlazy.c:1337
> #2  0x00691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
> params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
> #3  0x0069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
> params=0x7ffe0f866310) at vacuum.c:1418


Those line numbers don't match my code.

Which commit are you based on?

My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-27 Thread Claudio Freire
On Tue, Dec 27, 2016 at 10:54 AM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Anastasia Lubennikova wrote:
>
>> I ran configure using following set of flags:
>>  ./configure --enable-tap-tests --enable-cassert --enable-debug
>> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
>> And then ran make check. Here is the stacktrace:
>>
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
>> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
>> 1417tblk =
>> ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
>
> This doesn't make sense, since the patch removes the "tupindex"
> variable in that function.

The variable is still there. It just has a slightly different meaning
(index within the current segment, rather than global index).


On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> 23.12.2016 22:54, Claudio Freire:
>
> On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
> <a.lubennik...@postgrespro.ru> wrote:
>
> I found the reason. I configure postgres with CFLAGS="-O0" and it causes
> Segfault on initdb.
> It works fine and passes tests with default configure flags, but I'm pretty
> sure that we should fix segfault before testing the feature.
> If you need it, I'll send a core dump.
>
> I just ran it with CFLAGS="-O0" and it passes all checks too:
>
> CFLAGS='-O0' ./configure --enable-debug --enable-cassert
> make clean && make -j8 && make check-world
>
> A stacktrace and a thorough description of your build environment
> would be helpful to understand why it breaks on your system.
>
>
> I ran configure using following set of flags:
>  ./configure --enable-tap-tests --enable-cassert --enable-debug
> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
> And then ran make check. Here is the stacktrace:

Same procedure runs fine on my end.

> core file is quite big, so I didn't attach it to the mail. You can download 
> it here: core dump file.

Can you attach your binary as well? (it needs to be identical to be
able to inspect the core dump, and quite clearly my build is
different)

I'll keep looking for ways it could crash there, but being unable to
reproduce the crash is a big hindrance, so if you can send the binary
that could help speed things up.


On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> 1. prefetchBlkno = blkno & ~0x1f;
> prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
>
> I didn't get it what for we need these tricks. How does it differ from:
> prefetchBlkno = (blkno > 32) ? blkno - 32 : 0;

It makes all prefetches ranges of 32 blocks aligned to 32-block boundaries.

It helps since it's at 32 block boundaries that the truncate stops to
check for locking conflicts and abort, guaranteeing no prefetch will
be needless (if it goes into that code it *will* read the next 32
blocks).

> 2. Why do we decrease prefetchBlckno twice?
>
> Here:
> +prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
> And here:
> if (prefetchBlkno >= 32)
> +prefetchBlkno -= 32;

The first one is outside the loop, it's finding the first prefetch
range that is boundary aligned, taking care not to cause underflow.

The second one is inside the loop, it's moving the prefetch window
down as the truncate moves along. Since it's already aligned, it
doesn't need to be realigned, just clamped to zero if it happens to
reach the bottom.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-23 Thread Claudio Freire
On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
>> On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfre...@gmail.com>
>> wrote:
>>>
>>> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
>>> <lubennikov...@gmail.com> wrote:
>>>>
>>>> The following review has been posted through the commitfest application:
>>>> make installcheck-world:  tested, failed
>>>> Implements feature:   not tested
>>>> Spec compliant:   not tested
>>>> Documentation:not tested
>>>>
>>>> Hi,
>>>> I haven't read through the thread yet, just tried to apply the patch and
>>>> run tests.
>>>> And it seems that the last attached version is outdated now. It doesn't
>>>> apply to the master
>>>> and after I've tried to fix merge conflict, it segfaults at initdb.
>>>
>>>
>>> I'll rebase when I get some time to do it and post an updated version
>>
>> Attached rebased patches. I called them both v3 to be consistent.
>>
>> I'm not sure how you ran it, but this works fine for me:
>>
>> ./configure --enable-debug --enable-cassert
>> make clean
>> make check
>>
>> ... after a while ...
>>
>> ===
>>   All 168 tests passed.
>> ===
>>
>> I reckon the above is equivalent to installcheck, but doesn't require
>> sudo or actually installing the server, so installcheck should work
>> assuming the install went ok.
>
>
> I found the reason. I configure postgres with CFLAGS="-O0" and it causes
> Segfault on initdb.
> It works fine and passes tests with default configure flags, but I'm pretty
> sure that we should fix segfault before testing the feature.
> If you need it, I'll send a core dump.

I just ran it with CFLAGS="-O0" and it passes all checks too:

CFLAGS='-O0' ./configure --enable-debug --enable-cassert
make clean && make -j8 && make check-world

A stacktrace and a thorough description of your build environment
would be helpful to understand why it breaks on your system.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Claudio Freire
On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
> <lubennikov...@gmail.com> wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:   not tested
>> Spec compliant:   not tested
>> Documentation:not tested
>>
>> Hi,
>> I haven't read through the thread yet, just tried to apply the patch and run 
>> tests.
>> And it seems that the last attached version is outdated now. It doesn't 
>> apply to the master
>> and after I've tried to fix merge conflict, it segfaults at initdb.
>
>
> I'll rebase when I get some time to do it and post an updated version

Attached rebased patches. I called them both v3 to be consistent.

I'm not sure how you ran it, but this works fine for me:

./configure --enable-debug --enable-cassert
make clean
make check

... after a while ...

===
 All 168 tests passed.
===

I reckon the above is equivalent to installcheck, but doesn't require
sudo or actually installing the server, so installcheck should work
assuming the install went ok.
From f7c6a46ea2d721e472237e00fd2404081a4ddc75 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 2 Sep 2016 23:33:48 -0300
Subject: [PATCH 1/2] Vacuum: prefetch buffers on backward scan

This patch speeds up truncation by prefetching buffers during
the backward scan phase just prior to truncation. This optimization
helps rotating media because disk rotation induces a buffer-to-buffer
latency of roughly a whole rotation when reading backwards, such
disks are usually optimized for forward scans. Prefetching buffers
in larger chunks ameliorates the issue considerably and is otherwise
harmless. The prefetch amount should also not be an issue on SSD,
which won't benefit from the optimization, but won't be hurt either.
---
 src/backend/commands/vacuumlazy.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..854bce3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1825,7 +1825,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 static BlockNumber
 count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
-	BlockNumber blkno;
+	BlockNumber blkno, prefetchBlkno;
 	instr_time	starttime;
 
 	/* Initialize the starttime if we check for conflicting lock requests */
@@ -1833,6 +1833,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 	/* Strange coding of loop control is needed because blkno is unsigned */
 	blkno = vacrelstats->rel_pages;
+	prefetchBlkno = blkno & ~0x1f;
+	prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
 	while (blkno > vacrelstats->nonempty_pages)
 	{
 		Buffer		buf;
@@ -1882,6 +1884,23 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		blkno--;
 
+		/*
+		 * Speed up reads on rotating rust by prefetching a few blocks ahead.
+		 * Backward scans on rotary disks are slow if done one buffer at a time
+		 * because readahead won't kick in on most OSes, and each buffer will
+		 * have to wait for the platter to do a full rotation.
+		 * Should be harmless on SSD.
+		 */
+		if (blkno <= prefetchBlkno) {
+			BlockNumber pblkno;
+			if (prefetchBlkno >= 32)
+prefetchBlkno -= 32;
+			else
+prefetchBlkno = 0;
+			for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++)
+PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+		}
+
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
 				 RBM_NORMAL, vac_strategy);
 
-- 
1.8.4.5

From 084805d565dced136903262cfad4fd395468a9bb Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 295 +-
 1 file changed, 230 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..dfb0612 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -91,6 +91,7 @@
  * tables.
  */
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
 
 /*
  * Before we consider skipping a page that's marked as clean in
@@ -98,6 +

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Claudio Freire
On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
 wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> Hi,
> I haven't read through the thread yet, just tried to apply the patch and run 
> tests.
> And it seems that the last attached version is outdated now. It doesn't apply 
> to the master
> and after I've tried to fix merge conflict, it segfaults at initdb.


I'll rebase when I get some time to do it and post an updated version


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-11-21 Thread Claudio Freire
On Mon, Nov 21, 2016 at 5:42 PM, Peter Geoghegan  wrote:
>>> Or, you might want to make
>>> sure that B-Tree tuple insertions still find that "first page" to
>>> check, despite still generally treating heap item pointer as part of
>>> the key proper (in unique indexes, it can still behave like NULL, and
>>> not participate in uniqueness checking, while still essentially being
>>> part of the key/sort order).
>>
>> It behaves like that (even though it's not coded like that)
>
> Why not? What do you mean?
>
> What I'm interested in evaluating is an implementation where an index
> on (foo) has a sort order/key space that is precisely the same as an
> index on (foo, ctid), with zero exceptions.

Well, the patch does the same sorting, but without explicitly adding
the ctid to the scankey.

When inserting into a unique index, the lookup for the insertion point
proceeds as it would if the key was (foo, ctid), finding a page in the
middle of the range that contain item pointers for foo.

When that happens on a regular index, the insertion is done at that
point and nothing else needs to be done. But when it happens on a
unique index, the code first checks to see if the first item pointer
for foo is on the same page (allegedly a frequent case). If so, the
uniqueness check is done in a very straightforward and efficient
manner. If not, however, the code retraces its steps up the tree to
the first time it followed a key other than foo (that's the only way
it could land at a middle page), and then follows the downlinks
looking at a truncated key (just foo, ignoring ctid).

Kind of what you say, but without the explicit null.

>
>>> It also occurs to me that our performance for updates in the event of
>>> many NULL values in indexes is probably really bad, and could be
>>> helped by this. You'll want to test all of this with a workload that
>>> isn't very sympathetic to HOT, of course -- most of these benefits are
>>> seen when HOT doesn't help.
>>
>> I haven't really tested with an overabundance of NULLs, I'll add that
>> to the tests. I have tested low-cardinality indexes though, but only
>> for correctness.
>
> I think we'll have to model data distribution to evaluate the idea
> well. After all, if there is a uniform distribution of values anyway,
> you're going to end up with many dirty leaf pages. Whereas, in the
> more realistic case where updates are localized, we can hope to better
> amortize the cost of inserting index tuples, and recycling index
> tuples.

Quite possibly

>> What do you mean with "first class part"?
>>
>> It's not included in the scankey for a variety of reasons, not the
>> least of which is not breaking the interface for current uses, and
>> because the tid type is quite limited in its capabilities ATM. Also,
>> the heap TID is usually there on most AM calls that care about it (ie:
>> insertions have it, of course), so adding it to the scankey also
>> seemed redundant.
>
> I mean that insertions into indexes that are low cardinality should be
> largely guided by TID comparisons.

...

>> If not adding to the scankey, what do you mean by "first class" then?
>
> Just what I said about the sort order of an index on "(foo)" now
> precisely matching what we'd get for "(foo, ctid)".

It is the case in both versions of the patch

> There are a couple
> of tricky issues with that that you'd have to look out for, like
> making sure that the high key continues to hold a real TID, which at a
> glance looks like something that just happens already. I'm not sure
> that we preserve the item pointer TID in all cases, since the current
> assumption is that a high key's TID is just filler -- maybe we can
> lose that at some point.

I thought long about that, and inner pages don't need a real TID in
their keys, as those keys only specify key space cutoff points and not
real tids, and high tids in leaf pages are always real.

I haven't had any issue with that, aside from the headaches resulting
from thinking about that for protracted periods of time.

> You should use amcheck to specifically verify that that happens
> reliably in all cases. Presumably, its use of an insertion scankey
> would automatically see the use of TID as a tie-breaker with patched
> Postgres amcheck verification, and so amcheck will work for this
> purpose unmodified.

It's totally on my roadmap. I'll have to adapt it for the new index
format though, it doesn't work without modification.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-21 Thread Claudio Freire
On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfre...@gmail.com> 
>>> wrote:
>>>> Attached is patch 0002 with pgindent applied over it
>>>>
>>>> I don't think there's any other formatting issue, but feel free to
>>>> point a finger to it if I missed any
>>>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> I'm concerned that it could use repalloc for large memory area when
> vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
> and slow.

How large?

That array cannot be very large. It contains pointers to
exponentially-growing arrays, but the repalloc'd array itself is
small: one struct per segment, each segment starts at 128MB and grows
exponentially.

In fact, IIRC, it can be proven that such a repalloc strategy has an
amortized cost of O(log log n) per tuple. If it repallocd the whole
tid array it would be O(log n), but since it handles only pointers to
segments of exponentially growing tuples it becomes O(log log n).

Furthermore, n there is still limited to MAX_INT, which means the cost
per tuple is bound by O(log log 2^32) = 5. And that's an absolute
worst case that's ignoring the 128MB constant factor which is indeed
relevant.

> What about using semi-fixed memroy space without repalloc;
> Allocate the array of ItemPointerData array, and each ItemPointerData
> array stores the dead tuple locations. The size of ItemPointerData
> array starts with small size (e.g. 16MB or 32MB). After we used an
> array up, we then allocate next segment with twice size as previous
> segment.

That's what the patch does.

> But to prevent over allocating memory, it would be better to
> set a limit of allocating size of ItemPointerData array to 512MB or
> 1GB.

There already is a limit to 1GB (the maximum amount palloc can allocate)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-11-21 Thread Claudio Freire
On Fri, Nov 18, 2016 at 11:09 PM, Peter Geoghegan <p...@heroku.com> wrote:
> On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> I've been changing the on-disk format considerably, to one that makes
>> more sense.
>
> I can see how that would be necessary. I'm going to suggest some more
> things that need a new btree version number now, too.
>
> I realized today, quite suddenly, why I disliked your original patch:
> it didn't go far enough with embracing the idea of
> heap-item-pointer-as-key. In other words, I didn't like that you
> didn't change anything in the internal pages.

But it did. In fact it only changed internal pages.

> Maybe you should put
> heap TIDs someplace in new internal pages, so that even there we treat
> them as part of the key.

That is indeed what's happening (both in the old and new version).

The new version also opens up to the possibility of omitting the heap
TID in inner pages, if they're redundant (ie: if two consecutive keys
are different already, the heap TID part of the key is redundant,
since it's not necessary to make tree descent decisions).

> This may significantly benefit UPDATE-heavy
> workloads where HOT doesn't work out so well. Particularly for
> non-unique indexes, we currently have to wade through a lot of bloat
> from the leaf level, rather than jumping straight to the correct leaf
> page when updating or inserting.

That is improved in the patch as well. The lookup for an insertion
point for a heavily bloated (unique or not) index is done efficiently,
instead of the O(N^2) way it used before.

> You might not want to do the same with unique indexes, where we must
> initially buffer lock "the first leaf page that a duplicate could be
> on" while we potentially look in one or more additional sibling pages
> (but bloat won't be so bad there, perhaps).

It's done even for unique indexes. Locking in that case becomes
complex, true, but I believe it's not an insurmountable problem.

> Or, you might want to make
> sure that B-Tree tuple insertions still find that "first page" to
> check, despite still generally treating heap item pointer as part of
> the key proper (in unique indexes, it can still behave like NULL, and
> not participate in uniqueness checking, while still essentially being
> part of the key/sort order).

It behaves like that (even though it's not coded like that)

> It also occurs to me that our performance for updates in the event of
> many NULL values in indexes is probably really bad, and could be
> helped by this. You'll want to test all of this with a workload that
> isn't very sympathetic to HOT, of course -- most of these benefits are
> seen when HOT doesn't help.

I haven't really tested with an overabundance of NULLs, I'll add that
to the tests. I have tested low-cardinality indexes though, but only
for correctness.

>> However, I haven't tested the current state of the patch thoroughly.
>>
>> If a WIP is fine, I can post the what I have, rebased.
>
> I'm mostly curious about the effects on bloat. I now feel like you
> haven't sufficiently examined the potential benefits there, since you
> never made heap item pointer a first class part of the key space.
> Maybe you'd like to look into that yourself first?

What do you mean with "first class part"?

It's not included in the scankey for a variety of reasons, not the
least of which is not breaking the interface for current uses, and
because the tid type is quite limited in its capabilities ATM. Also,
the heap TID is usually there on most AM calls that care about it (ie:
insertions have it, of course), so adding it to the scankey also
seemed redundant.

If not adding to the scankey, what do you mean by "first class" then?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-11-18 Thread Claudio Freire
On Fri, Nov 18, 2016 at 10:05 PM, Peter Geoghegan  wrote:
> On Thu, Aug 18, 2016 at 2:15 PM, Peter Geoghegan  wrote:
>> I think that this is a bad idea. We need to implement suffix
>> truncation of internal page index tuples at some point, to make them
>> contain less information from the original leaf page index tuple.
>> That's an important optimization, because it increases fan-in. This
>> seems like a move in the opposite direction.
>
> Maybe I was too hasty, here. Can you rebase this patch, Claudio?

I've been changing the on-disk format considerably, to one that makes
more sense.

I haven't posted it because it still doesn't have suffix (and prefix)
truncation, but it should be compatible with it (ie: it could be
implemented as an optimization that doesn't change the on-disk
format).

However, I haven't tested the current state of the patch thoroughly.

If a WIP is fine, I can post the what I have, rebased.

> It's possible that this idea could have further non-obvious benefits.
> I see some potential for this to help with nbtree page deletion, item
> recycling, etc. For example, maybe VACUUM can manage to do something
> much closer to a regular index scan when that seems to make sense --

That was on my mind too

> I also think that this might help with bitmap index scans.

How so?

AFAIK it helps regular scans on low-cardinality indexes more than
bitmap index scans. With low cardinality indexes, the resulting heap
access pattern will be an unordered series of sequential range scans,
which is better than the fully random scan the current layout causes.
Bitmap index scans, however, deny that benefit.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire
On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> Attached is patch 0002 with pgindent applied over it
>>
>> I don't think there's any other formatting issue, but feel free to
>> point a finger to it if I missed any
>
> Hmm, I had imagined making all of the segments the same size rather
> than having the size grow exponentially.  The whole point of this is
> to save memory, and even in the worst case you don't end up with that
> many segments as long as you pick a reasonable base size (e.g. 64MB).

Wastage is bound by a fraction of the total required RAM, that is,
it's proportional to the amount of required RAM, not the amount
allocated. So it should still be fine, and the exponential strategy
should improve lookup performance considerably.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire
On Thu, Nov 17, 2016 at 2:51 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> I glanced at the patches but the both patches don't obey the coding
>> style of PostgreSQL.
>> Please refer to [1].
>>
>> [1] 
>> http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.
>
> I thought I had. I'll go through that list to check what I missed.

Attached is patch 0002 with pgindent applied over it

I don't think there's any other formatting issue, but feel free to
point a finger to it if I missed any
From b9782cef0d971de341115bd3474d192419674219 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 295 +-
 1 file changed, 230 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..dfb0612 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -91,6 +91,7 @@
  * tables.
  */
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
 
 /*
  * Before we consider skipping a page that's marked as clean in
@@ -98,6 +99,27 @@
  */
 #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated) */
+	unsigned short padding;
+	ItemPointer dead_tuples;	/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dead_tuples;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -117,14 +139,14 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tuples we intend to delete */
 	/* NB: this list is ordered by TID address */
-	int			num_dead_tuples;	/* current # of entries */
-	int			max_dead_tuples;	/* # slots allocated in array */
-	ItemPointer dead_tuples;	/* array of ItemPointerData */
+	DeadTuplesMultiArray dead_tuples;
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+#define DeadTuplesCurrentSegment(lvrelstats) (&((lvrelstats)->dead_tuples.dead_tuples[ \
+	(lvrelstats)->dead_tuples.last_seg ]))
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -149,7 +171,7 @@ static void lazy_cleanup_index(Relation indrel,
    IndexBulkDeleteResult *stats,
    LVRelStats *vacrelstats);
 static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+ int tupindex, LVRelStats *vacrelstats, DeadTuplesSegment * seg, Buffer *vmbuffer);
 static bool should_attempt_truncation(LVRelStats *vacrelstats);
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
@@ -157,8 +179,9 @@ static BlockNumber count_nondeletable_pages(Relation onerel,
 static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
+static void lazy_clear_dead_tuples(LVRelStats *vacrelstats);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
+static inline int vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
 	 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 
@@ -498,7 +521,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = nblocks;

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire
On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada  wrote:
> I glanced at the patches but the both patches don't obey the coding
> style of PostgreSQL.
> Please refer to [1].
>
> [1] 
> http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.

I thought I had. I'll go through that list to check what I missed.

>> In the results archive, the .vacuum prefix is the patched version with
>> both patch 1 and 2, .git.ref is just patch 1 (without which the
>> truncate takes unholily long).
>
> Did you measure the performance benefit of 0001 patch by comparing
> HEAD with HEAD+0001 patch?

Not the whole test, but yes. Without the 0001 patch the backward scan
step during truncate goes between 3 and 5 times slower. I don't have
timings because the test never finished without the patch. It would
have finished, but it would have taken about a day.

>> Grepping the results a bit, picking an average run out of all runs on
>> each scale:
>>
>> Timings:
>>
>> Patched:
>>
>> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
>> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
>> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.
>>
>> Unpatched:
>>
>> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
>> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
>> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.
>>
>> Total I/O (in MB)
>>
>> Patched:
>>
>> s100: R:2.4 - W:5862
>> s400: R:1337.4 - W:29385.6
>> s4000: R:318631 - W:370154
>>
>> Unpatched:
>>
>> s100: R:1412.4 - W:7644.6
>> s400: R:3180.6 - W:36281.4
>> s4000: R:330683 - W:370391
>>
>>
>> So, in essence, CPU time didn't get adversely affected. If anything,
>> it got improved by about 20% on the biggest case (scale 4000).
>
> And this test case deletes all tuples in relation and then measure
> duration of vacuum.
> It would not be effect much in practical use case.

Well, this patch set started because I had to do exactly that, and
realized just how inefficient vacuum was in that case.

But it doesn't mean it won't benefit more realistic use cases. Almost
any big database ends up hitting this 1GB limit because big tables
take very long to vacuum and accumulate a lot of bloat in-between
vacuums.

If you have a specific test case in mind I can try to run it.

>> While total runtime didn't change much, I believe this is only due to
>> the fact that the index is perfectly correlated (clustered?) since
>> it's a pristine index, so index scans either remove or skip full
>> pages, never leaving things half-way. A bloated index would probably
>> show a substantially different behavior, I'll try to get a script that
>> does it by running pgbench a while before the vacuum or something like
>> that.
>>
>> However, the total I/O metric already shows remarkable improvement.
>> This metric is measuring all the I/O including pgbench init, the
>> initial vacuum pgbench init does, the delete and the final vacuum. So
>> it's not just the I/O for the vacuum itself, but the whole run. We can
>> see the patched version reading a lot less (less scans over the
>> indexes will do that), and in some cases writing less too (again, less
>> index scans may be performing less redundant writes when cleaning
>> up/reclaiming index pages).
>
> What value of maintenance_work_mem did you use for this test?

4GB on both, patched and HEAD.

> Since
> DeadTuplesSegment struct still stores array of ItemPointerData(6byte)
> representing dead tuple I supposed that the frequency of index vacuum
> does not change. But according to the test result, a index vacuum is
> invoked once and removes 4 rows at the time. It means that the
> vacuum stored about 2289 MB memory during heap vacuum. On the other
> side, the result of test without 0002 patch show that a index vacuum
> remove 178956737 rows at the time, which means 1GB memory was used.

1GB is a hardcoded limit on HEAD for vacuum work mem. This patch
removes that limit and lets vacuum use all the memory the user gave to
vacuum.

In the test, in both cases, 4GB was used as maintenance_work_mem
value, but HEAD cannot use all the 4GB, and it will limit itself to
just 1GB.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLUSTER, reform_and_rewrite_tuple(), and parallelism

2016-10-27 Thread Claudio Freire
On Wed, Oct 26, 2016 at 9:33 PM, Peter Geoghegan  wrote:
> Besides, parallel CREATE INDEX alone will probably
> be quite effective at speeding up CLUSTER in practice, simply because
> that's often where most wall clock time is spent during a CLUSTER
> operation.

Creating all indexes in parallel could also help considerably, for the
case when there are multiple indexes, and seems far simpler.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-20 Thread Claudio Freire
On Thu, Oct 20, 2016 at 1:08 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 9:20 PM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
>>
>>
>>
>> With indirect indexes, since you don't need to insert a tid, you can
>> just "insert on conflict do nothing" on the index.
>
>
> Would that work with non-unique indexes? Anyways, the point I was trying to
> make is that there are a similar technical challenges and we could solve it
> for WARM as well with your work for finding conflicting <key, tid> pairs and
> then not doing inserts. My thinking currently is that it will lead to other
> challenges, especially around vacuum, but I could be wrong.

Consider this:

Have a vacuum_cycle_id field in the metapage, with one bit reserved to
whether there's a vacuum in progress.

While there is a vacuum in progress on the index, all kinds of
modifications will look up the <key, pk> entry, and store the current
vacuum_cycle_id on the unused space for the tid pointer on the index
entry. When not, only new entries will be added (with the current
vacuum cycle id). So, during vacuum, indirect indexes incur a similar
cost to that of regular indexes, but only during vacuum.

When vacuuming, allocate 1/2 maintenance_work_mem for a bloom filter,
and increase all vacuum cycle ids (on the metapage) and mark a vacuum
in progress.

Scan the heap, add <key, pk> pairs of *non-dead* tuples to the bloom
filter. That's one BF per index, sadly, but bear with me.

Then scan the indexes. <key, pk> pairs *not* in the BF that have the
*old* vacuum cycle id get removed.

Clear the vacuum in progress flag on all indexes' metapage.

The only drawback here is that mwm dictates the amount of uncleanable
waste left on the indexes (BF false positives). Surely, the BF could
be replaced with an accurate set rather than an approximate one, but
that could require a lot of memory if keys are big, and a lot of scans
of the indexes. The BF trick bounds the amount of waste left while
minimizing work.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Indirect indexes

2016-10-20 Thread Claudio Freire
On Thu, Oct 20, 2016 at 12:30 PM, Pavan Deolasee
 wrote:
>
>
> On Thu, Oct 20, 2016 at 8:44 PM, Petr Jelinek  wrote:
>>
>>
>>
>> WARM can do WARM update 50% of time, indirect index can do HOT update
>> 100% of time (provided the column is not changed), I don't see why we
>> could not have both solutions.
>>
>
> I think the reason why I restricted WARM to one update per chain, also
> applies to indirect index. For example, if a indirect column value is
> changed from 'a' to 'b' and back to 'a', there will be two pointers from 'a'
> to the PK and AFAICS that would lead to the same duplicate scan issue.
>
> We have a design to convert WARM chains back to HOT and that will increase
> the percentage of WARM updates much beyond 50%. I was waiting for feedback
> on the basic patch before putting in more efforts, but it went unnoticed
> last CF.

With indirect indexes, since you don't need to insert a tid, you can
just "insert on conflict do nothing" on the index.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   >