Re: [HACKERS] Small improvement to compactify_tuples

2017-11-08 Thread Юрий Соколов
2017-11-08 20:02 GMT+03:00 Tom Lane :
>
> Claudio Freire  writes:
> > What's perhaps not clear is whether there are better ideas. Like
> > rebuilding the page as Tom proposes, which doesn't seem like a bad
> > idea. Bucket sort already is O(bytes), just as memcopy, only it has a
> > lower constant factor (it's bytes/256 in the original patch), which
> > might make copying the whole page an extra time lose against bucket
> > sort in a few cases.
>
> > Deciding that last point does need more benchmarking. That doesn't
> > mean the other improvements can't be pursued in the meanwhile, right?
>
> Well, I doubt we're going to end up committing more than one of these
> ideas.  The question is which way is best.  If people are willing to
> put in the work to test all of them, let's do it.
>
> BTW, it strikes me that in considering the rebuild-the-page approach,
> we should not have blinders on and just measure the speed of
> PageRepairFragmentation.  Rather, we should take a look at what happens
> subsequently given a physically-ordered set of tuples.  I can recall
> Andres or someone moaning awhile ago about lack of locality of access in
> index page searches --- maybe applying that approach while vacuuming
> indexes will help?
>
> regards, tom lane

I'd like to add qsort_template.h as Claudio suggested, ie in a way close to
simplehash.h. With such template header, there will be no need in
qsort_tuple_gen.pl .

With regards,
Sokolov Yura


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Юрий Соколов
2017-11-08 1:11 GMT+03:00 Peter Geoghegan :
>
> The same is true of unique indexes vs. non-unique.

offtopic: recently I'd a look at setting LP_DEAD in indexes.
I didn't found huge difference between unique and non-unique indices.
There is codepath that works only for unique, but it is called less
frequently than common codepath that also sets LP_DEAD.


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Юрий Соколов
2017-11-07 17:15 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>
> On Mon, Nov 6, 2017 at 9:08 PM, Юрий Соколов <funny.fal...@gmail.com>
wrote:
> > 2017-11-07 1:14 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
> >>
> >> I haven't seen this trick used in postgres, nor do I know whether it
> >> would be well received, so this is more like throwing an idea to see
> >> if it sticks...
> >>
> >> But a way to do this without macros is to have an includable
> >> "template" algorithm that simply doesn't define the comparison
> >> function/type, it rather assumes it:
> >>
> >> qsort_template.h
> >>
> >> #define QSORT_NAME qsort_ ## QSORT_SUFFIX
> >>
> >> static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems)
> >> {
> >> ... if (ELEM_LESS(arr[a], arr[b]))
> >> ...
> >> }
> >>
> >> #undef QSORT_NAME
> >>
> >> Then, in "offset_qsort.h":
> >>
> >> #define QSORT_SUFFIX offset
> >> #define ELEM_TYPE offset
> >> #define ELEM_LESS(a,b) ((a) < (b))
> >>
> >> #include "qsort_template.h"
> >>
> >> #undef QSORT_SUFFIX
> >> #undef ELEM_TYPE
> >> #undef ELEM_LESS
> >>
> >> Now, I realize this may have its cons, but it does simplify
> >> maintainance of type-specific or parameterized variants of
> >> performance-critical functions.
> >>
> >> > I can do specialized qsort for this case. But it will be larger
bunch of
> >> > code, than
> >> > shell sort.
> >> >
> >> >> And I'd recommend doing that when there is a need, and I don't think
> >> >> this patch really needs it, since bucket sort handles most cases
> >> >> anyway.
> >> >
> >> > And it still needs insertion sort for buckets.
> >> > I can agree to get rid of shell sort. But insertion sort is
necessary.
> >>
> >> I didn't suggest getting rid of insertion sort. But the trick above is
> >> equally applicable to insertion sort.
> >
> > This trick is used in simplehash.h . I agree, it could be useful for
qsort.
> > This will not make qsort inlineable, but will reduce overhead much.
> >
> > This trick is too heavy-weight for insertion sort alone, though. Without
> > shellsort, insertion sort could be expressed in 14 line macros ( 8 lines
> > without curly braces). But if insertion sort will be defined together
with
> > qsort (because qsort still needs it), then it is justifiable.
>
> What do you mean by heavy-weight?


I mean, I've already made reusable sort implementation with macros
that is called like a function (with type parameter). If we are talking
only about insertion sort, then such macros looks much prettier than
including file.

But qsort is better implemented with included template-header.

BTW, there is example of defining many functions with call to template
macro instead of including template header:
https://github.com/attractivechaos/klib/blob/master/khash.h
But it looks ugly.

>
> Aside from requiring all that include magic, if you place specialized
> sort functions in a reusable header, using it is as simple as
> including the type-specific header (or declaring the type macros and
> including the template), and using them as regular functions. There's
> no runtime overhead involved, especially if you declare the comparison
> function as a macro or a static inline function. The sort itself can
> be declared static inline as well, and the compiler will decide
> whether it's worth inlining.

Ok, if no one will complain against another one qsort implementation,
I will add template header for qsort. Since qsort needs insertion sort,
it will be in a same file.
Do you approve of this?

With regards,
Sokolov Yura


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-06 Thread Юрий Соколов
2017-11-07 1:14 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>
> On Mon, Nov 6, 2017 at 6:58 PM, Юрий Соколов <funny.fal...@gmail.com>
wrote:
> >
> > 2017-11-06 17:55 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
> >>
> >> On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов <funny.fal...@gmail.com>
> >> wrote:
> >> >> Maybe leave a fallback to qsort if some corner case produces big
> >> >> buckets?
> >> >
> >> > For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is
at
> >> > most 1 heap-tuple per bucket, and for index pages it is at most 2
index
> >> > tuples per bucket. For 32kb pages it is 4 heap-tuples and 8
index-tuples
> >> > per bucket.
> >> > It will be unnecessary overhead to call non-inlineable qsort in this
> >> > cases
> >> >
> >> > So, I think, shell sort could be removed, but insertion sort have to
> >> > remain.
> >> >
> >> > I'd prefer shell sort to remain also. It could be useful in other
places
> >> > also,
> >> > because it is easily inlinable, and provides comparable to qsort
> >> > performance
> >> > up to several hundreds of elements.
> >>
> >> I'd rather have an inlineable qsort.
> >
> > But qsort is recursive. It is quite hard to make it inlineable. And
still it
> > will be
> > much heavier than insertion sort (btw, all qsort implementations uses
> > insertion
> > sort for small arrays). And it will be heavier than shell sort for small
> > arrays.
>
> I haven't seen this trick used in postgres, nor do I know whether it
> would be well received, so this is more like throwing an idea to see
> if it sticks...
>
> But a way to do this without macros is to have an includable
> "template" algorithm that simply doesn't define the comparison
> function/type, it rather assumes it:
>
> qsort_template.h
>
> #define QSORT_NAME qsort_ ## QSORT_SUFFIX
>
> static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems)
> {
> ... if (ELEM_LESS(arr[a], arr[b]))
> ...
> }
>
> #undef QSORT_NAME
>
> Then, in "offset_qsort.h":
>
> #define QSORT_SUFFIX offset
> #define ELEM_TYPE offset
> #define ELEM_LESS(a,b) ((a) < (b))
>
> #include "qsort_template.h"
>
> #undef QSORT_SUFFIX
> #undef ELEM_TYPE
> #undef ELEM_LESS
>
> Now, I realize this may have its cons, but it does simplify
> maintainance of type-specific or parameterized variants of
> performance-critical functions.
>
> > I can do specialized qsort for this case. But it will be larger bunch of
> > code, than
> > shell sort.
> >
> >> And I'd recommend doing that when there is a need, and I don't think
> >> this patch really needs it, since bucket sort handles most cases
> >> anyway.
> >
> > And it still needs insertion sort for buckets.
> > I can agree to get rid of shell sort. But insertion sort is necessary.
>
> I didn't suggest getting rid of insertion sort. But the trick above is
> equally applicable to insertion sort.

This trick is used in simplehash.h . I agree, it could be useful for qsort.
This will not make qsort inlineable, but will reduce overhead much.

This trick is too heavy-weight for insertion sort alone, though. Without
shellsort, insertion sort could be expressed in 14 line macros ( 8 lines
without curly braces). But if insertion sort will be defined together with
qsort (because qsort still needs it), then it is justifiable.


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-06 Thread Юрий Соколов
2017-11-06 17:55 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>
> On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов <funny.fal...@gmail.com>
wrote:
> >> Maybe leave a fallback to qsort if some corner case produces big
buckets?
> >
> > For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at
> > most 1 heap-tuple per bucket, and for index pages it is at most 2 index
> > tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples
> > per bucket.
> > It will be unnecessary overhead to call non-inlineable qsort in this
cases
> >
> > So, I think, shell sort could be removed, but insertion sort have to
remain.
> >
> > I'd prefer shell sort to remain also. It could be useful in other places
> > also,
> > because it is easily inlinable, and provides comparable to qsort
performance
> > up to several hundreds of elements.
>
> I'd rather have an inlineable qsort.

But qsort is recursive. It is quite hard to make it inlineable. And still
it will be
much heavier than insertion sort (btw, all qsort implementations uses
insertion
sort for small arrays). And it will be heavier than shell sort for small
arrays.

I can do specialized qsort for this case. But it will be larger bunch of
code, than
shell sort.

> And I'd recommend doing that when there is a need, and I don't think
> this patch really needs it, since bucket sort handles most cases
> anyway.

And it still needs insertion sort for buckets.
I can agree to get rid of shell sort. But insertion sort is necessary.


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-06 Thread Юрий Соколов
2017-11-05 20:44 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
>
> On Sat, Nov 4, 2017 at 8:07 PM, Юрий Соколов <funny.fal...@gmail.com>
wrote:
> > 2017-11-03 5:46 GMT+03:00 Tom Lane <t...@sss.pgh.pa.us>:
> >>
> >> Sokolov Yura <funny.fal...@postgrespro.ru> writes:
> >> > [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ]
> >>
> >> I went to check the shellsort algorithm against Wikipedia's entry,
> >> and found that this appears to be an incorrect implementation of
> >> shellsort: where pg_shell_sort_pass has
> >>
> >> for (_i = off; _i < _n; _i += off) \
> >>
> >> it seems to me that we need to have
> >>
> >> for (_i = off; _i < _n; _i += 1) \
> >>
> >> or maybe just _i++.
> >
> >
> > Shame on me :-(
> > I've wrote shell sort several times, so I forgot to recheck myself once
> > again.
> > And looks like best gap sequence from wikipedia is really best
> > ( {301, 132, 57, 23, 10 , 4} in my notation),
> >
> >
> > 2017-11-03 17:37 GMT+03:00 Claudio Freire <klaussfre...@gmail.com>:
> >> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >>> BTW, the originally given test case shows no measurable improvement
> >>> on my box.
> >>
> >> I did manage to reproduce the original test and got a consistent
> >> improvement.
> >
> > I've rechecked my self using my benchmark.
> > Without memmove, compactify_tuples comsumes:
> > - with qsort 11.66% cpu (pg_qsort + med3 + swapfunc + itemoffcompare +
> > compactify_tuples = 5.97 + 0.51 + 2.87 + 1.88 + 0.44)
> > - with just insertion sort 6.65% cpu (sort is inlined, itemoffcompare
also
> > inlined, so whole is compactify_tuples)
> > - with just shell sort 5,98% cpu (sort is inlined again)
> > - with bucket sort 1,76% cpu (sort_itemIds + compactify_tuples = 1.30 +
> > 0.46)
>
> Is that just insertion sort without bucket sort?

Yes. Just to show that inlined insertion sort is better than non-inlined
qsort
in this particular use-case.

> Because I think shell sort has little impact in your original patch
> because it's rarely exercised. With bucket sort, most buckets are very
> small, too small for shell sort to do any useful work.

Yes. In the patch, buckets are sorted with insertion sort. Shell sort is
used
only on full array if its size less than 48.
Bucket sort has constant overhead of traversing all buckets, even if they
are empty. That is why I think, shell sort for small arrays is better.
Though,
I didn't measure that carefully. And probably insertion sort for small
arrays
will be just enough.

> Maybe leave a fallback to qsort if some corner case produces big buckets?

For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at
most 1 heap-tuple per bucket, and for index pages it is at most 2 index
tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples
per bucket.
It will be unnecessary overhead to call non-inlineable qsort in this cases

So, I think, shell sort could be removed, but insertion sort have to remain.

I'd prefer shell sort to remain also. It could be useful in other places
also,
because it is easily inlinable, and provides comparable to qsort performance
up to several hundreds of elements.

With regards,
Sokolov Yura aka funny_falcon.


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-04 Thread Юрий Соколов
2017-11-03 5:46 GMT+03:00 Tom Lane :
>
> Sokolov Yura  writes:
> > [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ]
>
> I went to check the shellsort algorithm against Wikipedia's entry,
> and found that this appears to be an incorrect implementation of
> shellsort: where pg_shell_sort_pass has
>
> for (_i = off; _i < _n; _i += off) \
>
> it seems to me that we need to have
>
> for (_i = off; _i < _n; _i += 1) \
>
> or maybe just _i++.


Shame on me :-(
I've wrote shell sort several times, so I forgot to recheck myself once
again.
And looks like best gap sequence from wikipedia is really best
( {301, 132, 57, 23, 10 , 4} in my notation),


2017-11-03 17:37 GMT+03:00 Claudio Freire :
> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane  wrote:
>> BTW, the originally given test case shows no measurable improvement
>> on my box.
>
> I did manage to reproduce the original test and got a consistent
improvement.

I've rechecked my self using my benchmark.
Without memmove, compactify_tuples comsumes:
- with qsort 11.66% cpu (pg_qsort + med3 + swapfunc + itemoffcompare +
compactify_tuples = 5.97 + 0.51 + 2.87 + 1.88 + 0.44)
- with just insertion sort 6.65% cpu (sort is inlined, itemoffcompare also
inlined, so whole is compactify_tuples)
- with just shell sort 5,98% cpu (sort is inlined again)
- with bucket sort 1,76% cpu (sort_itemIds + compactify_tuples = 1.30 +
0.46)

(memmove consumes 1.29% cpu)

tps is also reflects changes:
~17ktps with qsort
~19ktps with bucket sort

Also vacuum of benchmark's table is also improved:
~3s with qsort,
~2.4s with bucket sort

Of course, this benchmark is quite synthetic: table is unlogged, and tuple
is small,
and synchronous commit is off. Though, such table is still useful in some
situations
(think of not-too-important, but useful counters, like "photo watch count").
And patch affects not only this synthetic benchmark. It affects restore
performance,
as Heikki mentioned, and cpu consumption of Vacuum (though vacuum is more io
bound).

> I think we should remove pg_shell_sort and just use pg_insertion_sort.

Using shell sort is just a bit safer. Doubtfully worst pattern (for
insertion sort) will
appear, but what if? Shellsort is a bit better on whole array (5.98% vs
6.65%).
Though on small array difference will be much smaller.

With regards,
Sokolov Yura aka funny_falcon