Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-15 Thread Peter Geoghegan
On 15 February 2012 06:16, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 [ new patch ]

 I spent quite a bit of time looking at this today - the patch
 specifically, and the issue of making quicksort fast more generally.
 It seemed to me that if we're going to have separate copies of the
 quicksort code for tuple sorting, we might as well go whole hog and
 specialize those copies to the particular needs of tuplesort.c as much
 as possible.  Accordingly, I whacked the code around so that it knows
 that it is sorting SortTuple objects and need not conditionalize at
 runtime on the size of the objects being swapped.  You suggested
 upthread that this might be worthwhile, and it seems that it is, so I
 think we should do it.

Cool. I agree that we should do this. It doesn't need to be justified
as a performance optimisation - it makes sense to refactor in this
way. If that makes things faster, then so much the better.

 Your patch removes the CHECK_FOR_INTERRUPTS() call from
 comparetup_heap, which is no good.  However, since I'd already decided
 to specialize the copies of quicksort intended for sorting as much as
 possible, it made sense to me to refactor things so that the qsort
 routine itself, rather than the comparator, is responsible for calling
 CHECK_FOR_INTERRUPTS().  This slightly reduces the number of times we
 CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
 before doing it.

Well, the idea of that was to have one and only CHECK_FOR_INTERRUPTS()
call in the leading comparator. I should perhaps have further stressed
that that patch was only intended to establish the idea of having that
function pointer call within an inlined leading-key's comparator
calling outer comparator.

 I find that your pg_always_inline macro is equivalent to just plain
 inline on my system (MacOS X v10.6.8, gcc 4.2.1).  It seems to need
 something like this:

 +#elif __GNUC__
 +#define pg_always_inline inline __attribute__((always_inline))

 ...but I'm not very happy about relying on that, because I don't know
 that it will work on every gcc version (never mind non-gcc compilers),
 and I'm not convinced it's actually improving performance even on this
 one.  The documentation seems to indicate that this is intended to
 force inlining even when not optimizing, which may have something to
 do with the lack of effect: that's not really the point here anyway.

There is a scant reference that suggests this, yes, but that's
contradicted by other scanty sources. The macro was observed to be
helpful on the multi-key case, and the effect was quite apparent and
reproducible. At any rate its appearance in my most recenty patch was
vestigial - I think that there ought to be no need for it, now that
the compiler cost/benefit analysis probably has things right by
inlining.

 What I did instead is to replace template_qsort_arg.h with a script
 called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
 that tuplesort.c then #includes.  This seems more flexible to me than
 the macro-based approach.  In particular, it allows me to generate
 versions of qsort with different call signatures.  The attached patch
 generates two:

 static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
 cmp_tuple, Tuplesortstate *state);
 static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);

 The first of these is a drop-in replacement for qsort_arg() - any
 tuplesort can use it, not just heap sorts.  But it is faster than
 qsort_arg() because of the specializations for the SortTuple data
 type.  The second handles the special case where we are sorting by a
 single key that has an associated SortSupport object.  In this case we
 don't need to carry the overhead of passing around the Tuplesortstate
 and dereferencing it, nor do we need the SortTupleComparator: we can
 just pass the SortSupport itself.  Maybe there's a way to get this
 effect using macros, but I couldn't figure it out.

I've seen the pattern where generic programming is aped with the
preprocessor in in the style of my patch in, of all things, wxWidgets,
where it continues to be used as part of pgAdmin, or was when I worked
on wxWidgets 3 support exactly one year ago. One thing that is
particularly bizarre about that code is that clients actually #include
a cpp file. This is, I'd guess, a legacy of having to target pre C++98
compilers without decent template support. MSVC 6, in particular, was
completely inadequate in this area.

I am inclined to agree that given that we already use Perl to generate
source code like this, it seems natural that we should prefer to do
that, if only to avoid paranoia about the inclusion of a dial-a-bloat
knob. I am at a disadvantage here, since I've never written a line of
Perl.

 With this patch, I get the following results, as compared with your
 2012-02-10 version and master, using the same test cases I tested
 before.

 select * 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-15 Thread Robert Haas
On Wed, Feb 15, 2012 at 8:29 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Cool. I agree that we should do this. It doesn't need to be justified
 as a performance optimisation - it makes sense to refactor in this
 way. If that makes things faster, then so much the better.

Well, maybe so, but I think if the performance had been the same I
might not have messed with it.

 I am inclined to agree that given that we already use Perl to generate
 source code like this, it seems natural that we should prefer to do
 that, if only to avoid paranoia about the inclusion of a dial-a-bloat
 knob. I am at a disadvantage here, since I've never written a line of
 Perl.

I think it's still dial-a-bloat, but I feel pretty comfortable about
how we've got that knob adjusted in this version.  It's almost as much
improvement as any previous version, it applies to more cases, and the
code footprint is the least of any version I've measured.

 Hmm. I'll run some tests myself when I get a chance, and attempt to
 determine what's going on here. I don't have immediate access to a
 good benchmarking server, which would be preferable, and I'd like to
 post a revision of pg_stat_statements normalisation today, as it's
 been too long. Nice work though.

Thanks.

 It strikes me that if we wanted to take this further, we could look at
 squeezing out ApplySortComparator.  For example, suppose that, upon
 discovering that we can do an in-memory quicksort on a single sort
 key, we make an initial pass over the data where we check whether it's
 sorted and, as we go, swap all the entries with isnull1 = true to the
 end of the memtuples array.  We then sort the isnull1 = true entries
 with the standard comparator, and the isnull1 = false entries with an
 optimized comparator that elides most of ApplySortComparator and
 instead just calls the comparison function directly.  We then decide
 on whether to emit the isnull1 = true entries first or last based on
 NULLS FIRST/LAST, and decide whether to emit the remaining entries in
 forward or reverse order based on ASC/DESC.  Or maybe not exactly that
 thing, but something like that, so that we sort the null and non-null
 entries separately.  The additional complexity in the read-out logic
 would probably be more than offset by being able to use a simpler
 comparator.

 While it's really easy to be wrong about these things, I'd venture to
 guess that branch prediction makes this a less than compelling win in
 practice. That said, if you want to know how good a win it might be,
 you need only knock out a quick-and-dirty prototype and see how fast a
 sort is - however well this does will be pretty close to your best
 case, but not quite, since presumably such a prototype wouldn't worry
 about the applicability of the optimisation.

You might be right.  But note that the swapcode improvements were also
vulnerable to branch prediction, too, though there could also be other
effects there.  One effect that should perhaps be considered is the
cost of saving and restoring registers.  Even if the branch itself
doesn't cost much because we know what the result will be, the
operands still have to be loaded into registers, which isn't free of
itself and also potentially forces those registers to be saved and
restored across function calls.  Still, I'm inclined not to poke at
this right now: there are a lot of patches that still need to be
looked at, and at the rate we're currently disposing of patches this
CommitFest will still be ongoing come Easter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-15 Thread Peter Geoghegan
On 15 February 2012 15:27, Robert Haas robertmh...@gmail.com wrote:
 I am inclined to agree that given that we already use Perl to generate
 source code like this, it seems natural that we should prefer to do
 that, if only to avoid paranoia about the inclusion of a dial-a-bloat
 knob. I am at a disadvantage here, since I've never written a line of
 Perl.

 I think it's still dial-a-bloat, but I feel pretty comfortable about
 how we've got that knob adjusted in this version.  It's almost as much
 improvement as any previous version, it applies to more cases, and the
 code footprint is the least of any version I've measured.

I'm happy that the principle that a dial-a-bloat knob isn't
necessarily a bad thing has been accepted, though that term is kind of
pejorative in that it implies that the knob necessarily adds bloat to
the binary.

I define bloat here as the addition of dead instructions to the
binary, or at least code that doesn't pull its weight. Clearly, that
isn't the case here, and I suspect that we will find that it isn't the
case in other places too.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-14 Thread Robert Haas
On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 [ new patch ]

I spent quite a bit of time looking at this today - the patch
specifically, and the issue of making quicksort fast more generally.
It seemed to me that if we're going to have separate copies of the
quicksort code for tuple sorting, we might as well go whole hog and
specialize those copies to the particular needs of tuplesort.c as much
as possible.  Accordingly, I whacked the code around so that it knows
that it is sorting SortTuple objects and need not conditionalize at
runtime on the size of the objects being swapped.  You suggested
upthread that this might be worthwhile, and it seems that it is, so I
think we should do it.

Your patch removes the CHECK_FOR_INTERRUPTS() call from
comparetup_heap, which is no good.  However, since I'd already decided
to specialize the copies of quicksort intended for sorting as much as
possible, it made sense to me to refactor things so that the qsort
routine itself, rather than the comparator, is responsible for calling
CHECK_FOR_INTERRUPTS().  This slightly reduces the number of times we
CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
before doing it.

I find that your pg_always_inline macro is equivalent to just plain
inline on my system (MacOS X v10.6.8, gcc 4.2.1).  It seems to need
something like this:

+#elif __GNUC__
+#define pg_always_inline inline __attribute__((always_inline))

...but I'm not very happy about relying on that, because I don't know
that it will work on every gcc version (never mind non-gcc compilers),
and I'm not convinced it's actually improving performance even on this
one.  The documentation seems to indicate that this is intended to
force inlining even when not optimizing, which may have something to
do with the lack of effect: that's not really the point here anyway.
What I did instead is to replace template_qsort_arg.h with a script
called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
that tuplesort.c then #includes.  This seems more flexible to me than
the macro-based approach.  In particular, it allows me to generate
versions of qsort with different call signatures.  The attached patch
generates two:

static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
cmp_tuple, Tuplesortstate *state);
static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);

The first of these is a drop-in replacement for qsort_arg() - any
tuplesort can use it, not just heap sorts.  But it is faster than
qsort_arg() because of the specializations for the SortTuple data
type.  The second handles the special case where we are sorting by a
single key that has an associated SortSupport object.  In this case we
don't need to carry the overhead of passing around the Tuplesortstate
and dereferencing it, nor do we need the SortTupleComparator: we can
just pass the SortSupport itself.  Maybe there's a way to get this
effect using macros, but I couldn't figure it out.  At any rate, at
least for the single-key case, this approach effectively forces the
comparator to be inlined without requiring pg_always_inline.

With this patch, I get the following results, as compared with your
2012-02-10 version and master, using the same test cases I tested
before.

select * from nodups order by g offset 10001;
tps on master: 289.471274, 289.967984, 289.595958
tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
tps on attached version: 388.212793, 386.085083, 386.867478

select * from twocol order by a, b offset 1;
tps on master: 261.676611, 260.440886, 259.529362
tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
tps on attached version: 283.146463, 278.344827, 280.727798

select * from f8 order by g offset 1;
tps on master: 228.299924, 222.650355, 227.408506
tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
tps on attached version: 276.985299, 275.341575, 274.428095

There's some variation (which I can't account for) between the results
on master now and the results on master before - possibly just code
shifting around between cache lines due to unrelated changes, or maybe
some inadvertent change in my test setup.  But it looks to me like
your 2012-02-10 version, without any type-specific optimizations, does
pretty much just as well on multi-key sorting as your previous
version, which had them - or if there is a difference, it's pretty
small.

Overall, I think the numbers for the version I'm attaching here look
pretty good: the single-key performance is clearly better than your
last version, and the multi-key performance is very slightly worse.  I
think that slight worsening is a good trade-off, though, because this
version can use qsort_tuple() for all kinds of tuplesorts, not just
heap tuplesorts.  Still, it seems like we ought to be able to do even
better: the multi-key specialization that you had in your patch can be
coded in this framework, too, and in theory those are ndependent of
the 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-10 Thread Peter Geoghegan
On 9 February 2012 14:51, Robert Haas robertmh...@gmail.com wrote:
 I'm not sure I entirely follow all this, but I'll look at the code
 once you have it.

I have attached a revision of the patch, with the adjustments already
described. Note that I have not made this support btree tuplesorting
yet, as there is an impedance mismatch that must be resolved,
particularly with the SortSupport stuff, and I wanted to know what you
think of the multiple key specialisation first. Arguably, we could get
away with only a single specialisation - I haven't really though about
it much.

You say Well, how often will we sort 10,000 integers?, and I think
that btree index creation is one very common and useful case, so I'd
like to look at that in more detail. I certainly don't see any reason
to not do it too.

This should give you performance for sorting multiple-keys that is
almost as good as the single-key optimisation that you found to be
more compelling. Obviously the need to actually call comparetup_heap
to look at non-leading sortkeys will vary from case to case, and this
is based on your test case, where there are no duplicates and thus no
need to ever do that. That isn't too far from representative, as I
think that in general, a majority of comparisons won't result in
equality.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 1452e8c..e040ace
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***
*** 113,118 
--- 113,119 
  #include utils/pg_rusage.h
  #include utils/rel.h
  #include utils/sortsupport.h
+ #include utils/template_qsort_arg.h
  #include utils/tuplesort.h
  
  
*** struct Tuplesortstate
*** 281,286 
--- 282,301 
  	int			currentRun;
  
  	/*
+ 	 * Will invocations of a tuple-class encapsulating comparator
+ 	 * (comparetup_heap, comparetup_index_btree, etc) skip the leading key,
+ 	 * because that has already been compared elsewhere?
+ 	 */
+ 	bool		skipLeadingkey;
+ 
+ 	/*
+ 	 * This specialization function pointer is sometimes used as an alternative
+ 	 * to the standard qsort_arg, when it has been determined that we can
+ 	 * benefit from various per number-of-sortkey performance optimizations.
+ 	 */
+ 	void (*qsort_arg_spec)(void *a, size_t n, size_t es, void *arg);
+ 
+ 	/*
  	 * Unless otherwise noted, all pointer variables below are pointers to
  	 * arrays of length maxTapes, holding per-tape data.
  	 */
*** static void readtup_datum(Tuplesortstate
*** 492,497 
--- 507,519 
  static void reversedirection_datum(Tuplesortstate *state);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  
+ /*
+  * A set of type-neutral specializations, for single sortkey and multiple
+  * sortkey sorts. These specialization are used for sorting both heap and
+  * btree index tuples that meet that criteria.
+  */
+ DO_TEMPLATE_QSORT_ARG(single, ONE_ADDITIONAL_CODE, single_comparetup_inline)
+ DO_TEMPLATE_QSORT_ARG(mult, MULTI_ADDITIONAL_CODE, mult_comparetup_inline)
  
  /*
   *		tuplesort_begin_xxx
*** tuplesort_begin_heap(TupleDesc tupDesc,
*** 631,636 
--- 653,661 
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
  
+ 	state-qsort_arg_spec =
+ 		nkeys==1?single_qsort_arg:mult_qsort_arg;
+ 	state-skipLeadingkey = true;
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
*** tuplesort_performsort(Tuplesortstate *st
*** 1222,1232 
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
  			if (state-memtupcount  1)
! qsort_arg((void *) state-memtuples,
! 		  state-memtupcount,
! 		  sizeof(SortTuple),
! 		  (qsort_arg_comparator) state-comparetup,
! 		  (void *) state);
  			state-current = 0;
  			state-eof_reached = false;
  			state-markpos_offset = 0;
--- 1247,1273 
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
  			if (state-memtupcount  1)
! 			{
! /* Use a sorting specialization if available */
! if (state-qsort_arg_spec)
! 	/* specialization available */
! 	state-qsort_arg_spec(
! (void *) state-memtuples,
! state-memtupcount,
! sizeof(SortTuple),
! (void *) state);
! else
! 	/*
! 	 * Fall back on regular qsort_arg, with function pointer
! 	 * comparator, making no compile-time assumptions about the
! 	 * number of sortkeys or the datatype(s) to be sorted.
! 	 */
! 	qsort_arg((void *) state-memtuples,
!   state-memtupcount,
!   sizeof(SortTuple),
!   (qsort_arg_comparator) state-comparetup,
!   (void *) state);
! 			}
  			state-current = 0;
  			state-eof_reached = false;
  			state-markpos_offset = 0;
*** make_bounded_heap(Tuplesortstate *state)
*** 2407,2412 
--- 2448,2454 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Noah Misch
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
 Second, there's a concern about binary bloat: duplicating lots of code
 with different comparators inlined generates, well, a lot of machine
 code.  Of course, an 0.8% increase in the size of the resulting binary
 is very unlikely to produce any measurable performance regression on
 its own, but that's not really the issue.  The issue is rather that we
 could easily go nuts applying this technique in lots of different
 places - or even just in one place to lots of different types.  Let's
 say that we find that even for types with very complex comparators, we
 can get a 5% speedup on quick-sorting speed using this technique.
 Should we, then, apply the technique to every type we have?  Perhaps
 the binary would grow by, I don't know, 10%.  Maybe that's still not
 enough to start hurting performance (or making packagers complain),
 but surely it's getting there, and what happens when we discover that
 similar speedups are possible using the same trick in five other
 scenarios?  I think it's a bit like going out to dinner and spending
 $50.  It's nicer than eating at home, and many people can afford to do
 it occasionally, but if you do it every night, it gets expensive (and
 possibly fattening).
 
 So we need some principled way of deciding how much inlining is
 reasonable, because I am 100% certain this is not going to be the last
 time someone discovers that a massive exercise in inlining can yield a
 nifty performance benefit in some case or another: index builds and
 COPY have already been mentioned on this thread, and I expect that
 people will find other cases as well.  I'm not really sure what the
 budget is - i.e. how much binary bloat we can afford to add - or how
 many cases there are that can benefit, but the first isn't infinite
 and the second is more than the first.

Having such a metric would resolve this discussion, but formulating it will be
all but impossible.  We don't know how much time PostgreSQL installations now
spend sorting or how much additional time they would spend on cache misses,
TLB misses, page faults, etc.  That data won't show up on this thread.

You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
lookup table.  I would not feel bad about rejecting that, because it can live
quite comfortably as an external module.  Indeed, I think the best thing we
can do to constrain long-term bloat in the postgres executable is to improve
pluggability.  Let a wider range of features live outside the postmaster
binary.  For example, if we had the infrastructure to let hash, GIN, GiST and
SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
today), I would support doing that.  Extensibility is a hallmark of
PostgreSQL.  It's a bad day when we reject a minority-interest feature or
optimization that has no other way to exist.

Better pluggability can't ease the decision process for Peter's patch, because
its specializations essentially must live in the postgres executable to live
at all.  Nominally, we could let external modules inject sort specializations
for core types, and Peter could publish an all_fast_sorts extension.  That
would be useless punting: if we can't make a principled decision on whether
these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
the module make that decision?

This patch has gotten more than its fair share of attention for bloat, and I
think that's mostly because there's a dial-a-bloat-level knob sticking out and
begging to be frobbed.  On my system, fastpath_sort_2012_01_19.patch adds 85
KiB to the postgres binary.  A recent commit added 57 KiB and bloat never came
up in discussions thereof.  I appreciate your concern over a slippery slope of
inlining proposals, but I also don't wish to see the day when every patch
needs to declare and justify its binary byte count.  Unless, of course, we
discover that elusive metric for evaluating it fairly.

If we'd like to start taking interest in binary bloat, how about having the
buildfarm log capture an ls -l on the bin directory?  We'll then have data
to mine if we ever wish to really take action.


All that being said, I'd want to see a 15-30% (depending on how contrived)
improvement to a microbenchmark or a 5% improvement to a generic benchmark
(pgbench, DBT-N) before adopting an optimization of this complexity.  Based
on your measurements[2], per-type inlining gives us a 7% microbenchmark
improvement.  I'd therefore excise the per-type inlining.  (For the record,
given a 20% improvement to the same benchmark, I'd vote yea on the submission
and perhaps even suggest int2 and uuid support.)

nm

[1] 
http://archives.postgresql.org/message-id/ca+tgmozo1xsz+yiqz2mrokmcmqtb+jir0lz43cne6de7--q...@mail.gmail.com
[2] 
http://archives.postgresql.org/message-id/ca+tgmobfz8sqtgtaaryerywuzfodgetprbuygbhsplr4+li...@mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Robert Haas
On Thu, Feb 9, 2012 at 7:24 AM, Noah Misch n...@leadboat.com wrote:
 On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
 Second, there's a concern about binary bloat: duplicating lots of code
 with different comparators inlined generates, well, a lot of machine
 code.  Of course, an 0.8% increase in the size of the resulting binary
 is very unlikely to produce any measurable performance regression on
 its own, but that's not really the issue.  The issue is rather that we
 could easily go nuts applying this technique in lots of different
 places - or even just in one place to lots of different types.  Let's
 say that we find that even for types with very complex comparators, we
 can get a 5% speedup on quick-sorting speed using this technique.
 Should we, then, apply the technique to every type we have?  Perhaps
 the binary would grow by, I don't know, 10%.  Maybe that's still not
 enough to start hurting performance (or making packagers complain),
 but surely it's getting there, and what happens when we discover that
 similar speedups are possible using the same trick in five other
 scenarios?  I think it's a bit like going out to dinner and spending
 $50.  It's nicer than eating at home, and many people can afford to do
 it occasionally, but if you do it every night, it gets expensive (and
 possibly fattening).

 So we need some principled way of deciding how much inlining is
 reasonable, because I am 100% certain this is not going to be the last
 time someone discovers that a massive exercise in inlining can yield a
 nifty performance benefit in some case or another: index builds and
 COPY have already been mentioned on this thread, and I expect that
 people will find other cases as well.  I'm not really sure what the
 budget is - i.e. how much binary bloat we can afford to add - or how
 many cases there are that can benefit, but the first isn't infinite
 and the second is more than the first.

 Having such a metric would resolve this discussion, but formulating it will be
 all but impossible.  We don't know how much time PostgreSQL installations now
 spend sorting or how much additional time they would spend on cache misses,
 TLB misses, page faults, etc.  That data won't show up on this thread.

 You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
 lookup table.  I would not feel bad about rejecting that, because it can live
 quite comfortably as an external module.  Indeed, I think the best thing we
 can do to constrain long-term bloat in the postgres executable is to improve
 pluggability.  Let a wider range of features live outside the postmaster
 binary.  For example, if we had the infrastructure to let hash, GIN, GiST and
 SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
 today), I would support doing that.  Extensibility is a hallmark of
 PostgreSQL.  It's a bad day when we reject a minority-interest feature or
 optimization that has no other way to exist.

 Better pluggability can't ease the decision process for Peter's patch, because
 its specializations essentially must live in the postgres executable to live
 at all.  Nominally, we could let external modules inject sort specializations
 for core types, and Peter could publish an all_fast_sorts extension.  That
 would be useless punting: if we can't make a principled decision on whether
 these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
 the module make that decision?

 This patch has gotten more than its fair share of attention for bloat, and I
 think that's mostly because there's a dial-a-bloat-level knob sticking out and
 begging to be frobbed.  On my system, fastpath_sort_2012_01_19.patch adds 85
 KiB to the postgres binary.  A recent commit added 57 KiB and bloat never came
 up in discussions thereof.  I appreciate your concern over a slippery slope of
 inlining proposals, but I also don't wish to see the day when every patch
 needs to declare and justify its binary byte count.  Unless, of course, we
 discover that elusive metric for evaluating it fairly.

 If we'd like to start taking interest in binary bloat, how about having the
 buildfarm log capture an ls -l on the bin directory?  We'll then have data
 to mine if we ever wish to really take action.


 All that being said, I'd want to see a 15-30% (depending on how contrived)
 improvement to a microbenchmark or a 5% improvement to a generic benchmark
 (pgbench, DBT-N) before adopting an optimization of this complexity.  Based
 on your measurements[2], per-type inlining gives us a 7% microbenchmark
 improvement.  I'd therefore excise the per-type inlining.  (For the record,
 given a 20% improvement to the same benchmark, I'd vote yea on the submission
 and perhaps even suggest int2 and uuid support.)

That's all well-said and I mostly agree with all of it, including your
suggested percentages in the last paragraph (but does anyone really
sort int2s or uuids?).  What I think is different about inlining 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Peter Geoghegan
On 9 February 2012 13:50, Robert Haas robertmh...@gmail.com wrote:
 I'm also more than slightly concerned that we are losing sight of the
 forest for the trees.  I have heard reports that sorting large amounts
 of data is many TIMES slower for us than for a certain other database
 product.  I frankly wonder if this entire patch is a distraction.
 When inlining is the tool of choice for further performance
 improvement, it suggests that we're pretty much out of ideas, and that
 whatever we get out of inlining - be it 6% or 30% - will be the end.
 I don't like that idea very much.

If you're talking about the product I think you're talking about,
well, their contracts forbid any actual benchmarks, so we won't have
any hard figures, but I find it intuitively difficult to believe that
the difference is that large, mostly because I've demonstrated that
the performance of qsort is a pretty good proxy for the performance of
a query like select * from foo order by bar, and qsort can't be too
far from optimal. Besides, didn't you yourself say I've been
skeptical of this patch for a number of reasons, the major one of
which is that most workloads spend only a very small amount of time in
doing qucksorts?

I have a new plan. I think you'll like it:

* drop the type specific specialisations entirely.

* drop qsort_arg (the original, unspecialised version). We now have
exactly the same number of source code copies of qsort as before: 2.

* gut the comparison of the leading sort key only from
comparetup_heap, comparetup_index_btree, etc (I collectively refer to
these functions as tuple-class encapsulating comparators, distinct
from their comparator proper).

* Instantiate two specialisations of qsort_arg: single key and
multiple key (exactly the same number as has already been agreed to,
since we lost the generic version). However, there is one key
difference now: we call one of the tuple class encapsulating functions
through a function pointer if the first comparison gives equality - .
Early indications are that this is almost as much of a win as the
single-key case. So the code effectively does this (but through a
function pointer):

#define MULTI_ADDITIONAL_CODE(COMPAR)   
\
{   
\
return comparetup_heap(aT, bT, state);  
\
}

This works because most comparisons won't actually need to look at the
second key, and because the cut-down tuple-class encapsulating
comparator that is used in the last revision of the patch doesn't
actually make any assumption about the particular tuple-class
encapsulating comparator in use.

It may not even be worth-while having a separate copy of the qsort
function for the multiple key case, so the question of binary bloat
becomes entirely irrelevant, as there would be a net gain of zero
copies of qsort_arg.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Robert Haas
On Thu, Feb 9, 2012 at 9:37 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 9 February 2012 13:50, Robert Haas robertmh...@gmail.com wrote:
 I'm also more than slightly concerned that we are losing sight of the
 forest for the trees.  I have heard reports that sorting large amounts
 of data is many TIMES slower for us than for a certain other database
 product.  I frankly wonder if this entire patch is a distraction.
 When inlining is the tool of choice for further performance
 improvement, it suggests that we're pretty much out of ideas, and that
 whatever we get out of inlining - be it 6% or 30% - will be the end.
 I don't like that idea very much.

 If you're talking about the product I think you're talking about,
 well, their contracts forbid any actual benchmarks, so we won't have
 any hard figures, but I find it intuitively difficult to believe that
 the difference is that large, mostly because I've demonstrated that
 the performance of qsort is a pretty good proxy for the performance of
 a query like select * from foo order by bar, and qsort can't be too
 far from optimal. Besides, didn't you yourself say I've been
 skeptical of this patch for a number of reasons, the major one of
 which is that most workloads spend only a very small amount of time in
 doing qucksorts?

Emphasis on large amounts of data.

I've said from the beginning that the small-sort case is less
interesting to me: it's nice to make it faster, but nobody's going to
quit using PostgreSQL because a quicksort takes 8ms instead of 6ms.
It's when we start talking about operations that take minutes or hours
that the differences become material.

 I have a new plan. I think you'll like it:

 * drop the type specific specialisations entirely.

Check.

 * drop qsort_arg (the original, unspecialised version). We now have
 exactly the same number of source code copies of qsort as before: 2.

This may be useful elsewhere some day, but I suppose we can always
pull it back out of git if we need it.

 * gut the comparison of the leading sort key only from
 comparetup_heap, comparetup_index_btree, etc (I collectively refer to
 these functions as tuple-class encapsulating comparators, distinct
 from their comparator proper).

 * Instantiate two specialisations of qsort_arg: single key and
 multiple key (exactly the same number as has already been agreed to,
 since we lost the generic version). However, there is one key
 difference now: we call one of the tuple class encapsulating functions
 through a function pointer if the first comparison gives equality - .
 Early indications are that this is almost as much of a win as the
 single-key case. So the code effectively does this (but through a
 function pointer):

 #define MULTI_ADDITIONAL_CODE(COMPAR)                                         
                                   \
 {                                                                             
                                                                           \
        return comparetup_heap(aT, bT, state);                                 
                                  \
 }

 This works because most comparisons won't actually need to look at the
 second key, and because the cut-down tuple-class encapsulating
 comparator that is used in the last revision of the patch doesn't
 actually make any assumption about the particular tuple-class
 encapsulating comparator in use.

 It may not even be worth-while having a separate copy of the qsort
 function for the multiple key case, so the question of binary bloat
 becomes entirely irrelevant, as there would be a net gain of zero
 copies of qsort_arg.

I'm not sure I entirely follow all this, but I'll look at the code
once you have it.  Are you saying that all the comparetup_yadda
functions are redundant to each other in the single-key case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Bruce Momjian
On Thu, Feb 09, 2012 at 07:24:49AM -0500, Noah Misch wrote:
 This patch has gotten more than its fair share of attention for bloat, and I
 think that's mostly because there's a dial-a-bloat-level knob sticking out and
 begging to be frobbed.

I already emailed Peter privately stating that he shouldn't feel bad
about the many critiques of his patch --- this dial-a-bloat-level knob
is new for us, and we have to get used to the idea.  We are more
discussing the idea of a dial-a-bloat-level knob rather than his patch.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Peter Geoghegan
On 9 February 2012 14:51, Robert Haas robertmh...@gmail.com wrote:
 I'm not sure I entirely follow all this, but I'll look at the code
 once you have it.  Are you saying that all the comparetup_yadda
 functions are redundant to each other in the single-key case?

Yes, I am. The main reason that the loops exist in those functions
(which is the only way that they substantially differ) is because they
each have to get the other keys through various ways that characterise
the tuple class that they encapsulate (index_getattr(),
heap_getattr(), etc).

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Bruce Momjian
On Thu, Feb 09, 2012 at 03:36:23PM +, Peter Geoghegan wrote:
 On 9 February 2012 14:51, Robert Haas robertmh...@gmail.com wrote:
  I'm not sure I entirely follow all this, but I'll look at the code
  once you have it.  Are you saying that all the comparetup_yadda
  functions are redundant to each other in the single-key case?
 
 Yes, I am. The main reason that the loops exist in those functions
 (which is the only way that they substantially differ) is because they
 each have to get the other keys through various ways that characterise
 the tuple class that they encapsulate (index_getattr(),
 heap_getattr(), etc).

Does this help all types for sorting, including strings?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Peter Geoghegan
On 9 February 2012 17:16, Bruce Momjian br...@momjian.us wrote:
 Yes, I am. The main reason that the loops exist in those functions
 (which is the only way that they substantially differ) is because they
 each have to get the other keys through various ways that characterise
 the tuple class that they encapsulate (index_getattr(),
 heap_getattr(), etc).

 Does this help all types for sorting, including strings?

Yes, it does, though of course that's not expected to make too much
difference with text when the C locale isn't used, because the
comparisons are inherently much more expensive, and there isn't a
whole lot we can do about that, at least here.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
It doesn't necessarily matter if we increase the size of the postgres
binary by 10%, precisely because most of that is not going to be in
play from one instant to the next. I'm thinking, in particular, of
btree index specialisations, where it could make perfect sense to go
crazy. You cannot have a reasonable discussion about such costs
without considering that they will perhaps never be paid, given any
plausible workload. That's why the percentage that the postgres binary
size has been shown to increase by really isn't pertinent at all. At
best, it's a weak proxy for such costs, assuming you don't have a
demonscene-like preoccupation with reducing binary size, and I don't
believe that we should.

It would be difficult for me to measure such things objectively, but
I'd speculate that the proprietary databases have much larger binaries
than ours, while having far fewer features, precisely because they
started applying tricks like this a long time ago. You could counter
that their code bases probably look terrible, and you'd have a point,
but so do I.

On 8 February 2012 02:38, Robert Haas robertmh...@gmail.com wrote:
 I've been skeptical of this patch for a number of reasons, the major
 one of which is that most workloads spend only a very small amount of
 time in doing qucksorts, and most quicksorts are of very small amounts
 of data and therefore fast anyway.   It is easy to construct an
 artificial test case that does lots and lots of in-memory sorting, but
 in real life I think that's not the great part of what people use
 databases for.

Fair enough, but if that's true, then it's also true that the cost due
to cache marginalisation - the only cost that I think is worth
considering at all - is correspondingly a small fraction of that very
small amount of sorting.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 It doesn't necessarily matter if we increase the size of the postgres
 binary by 10%, precisely because most of that is not going to be in
 play from one instant to the next.

You've heard of swapping, no?  Basically what I'm hearing from you is
total denial that binary bloat costs anything, and that just does not
square with reality.  Even if the costs in performance terms are
negligible in many common situations (something that you've asserted
but without offering any proof), there are other costs; software
distribution for instance is definitely sensitive to size.  As a Red Hat
person I've had to spend time fighting to keep Postgres included in the
minimum does it fit on a DVD distribution of RHEL/Fedora.  That case
gets harder to make every year, and yet it's pretty critical to the
success of the project --- if you don't get distributed, you lose users.

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added.  I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts.  The other
cases aren't going to offer enough of a win to justify the code space.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 8:33 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 It doesn't necessarily matter if we increase the size of the postgres
 binary by 10%, precisely because most of that is not going to be in
 play from one instant to the next.

As Tom says, that doesn't jive with my experience.  If you add on
enough binary bloat, you will have more page faults.  It's true (as I
think you pointed out upthread) that packing all the copies of
quicksort into the binary one after the other minimizes the effect on
other things, but it doesn't eliminate it entirely.  If you've got
this:

random other stuff ... a zillion copies of quicksort  more other stuff

...then a branch from the random other stuff section of the binary
to the more other stuff section of the binary may cost more.  For
example, suppose the OS does X amount of readahead.  By stuffing all
those copies of quicksort into the middle there, you increase the
chances that the page you need was beyond the readahead window.  Or,
if it wasn't going to be in the readahead window either way, then you
increase the distance that the disk head needs to move to find the
required block.

These costs are very small, no question about it.  They are almost
impossible to measure individually, in much the same way that the cost
of pollution or greenhouse gas emissions is difficult to measure.  But
it's an error to assume that because the costs are individually small
that they will never add up to anything material.  As long as you keep
insisting on that, it's hard to have a rational conversation.  We can
reasonably debate the magnitude of the costs, but to assert that they
don't exist gets us nowhere.  Suppose we could get a 10% speedup on
sin(numeric) by adding 40GB to the binary size.  Would you be in favor
of that?  Do you think it would hurt performance on any other
workloads?  Would our packagers complain at all?  Surely your same
argument would apply to that case in spades: anyone who is not using
the gigantic hard-coded lookup table will not pay any portion of the
cost of it.

 It would be difficult for me to measure such things objectively, but
 I'd speculate that the proprietary databases have much larger binaries
 than ours, while having far fewer features, precisely because they
 started applying tricks like this a long time ago. You could counter
 that their code bases probably look terrible, and you'd have a point,
 but so do I.

That might be true; I have no idea.  There are probably lots of
reasons why their code bases look terrible, including a long history
of backward compatibility with now-defunct versions, a variety of
commercial pressures, and the fact that they don't have to take flak
in the public square for what their code looks like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IMO this patch is already well past the point of diminishing returns in
 value-per-byte-added.  I'd like to see it trimmed back to provide a fast
 path for just single-column int4/int8/float4/float8 sorts.  The other
 cases aren't going to offer enough of a win to justify the code space.

I'm curious about how much we're gaining from the single-column
specializations vs. the type-specific specializations.  I think I'm
going to go try to characterize that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 01:33:30PM +, Peter Geoghegan wrote:
 It doesn't necessarily matter if we increase the size of the postgres
 binary by 10%, precisely because most of that is not going to be in
 play from one instant to the next. I'm thinking, in particular, of
 btree index specialisations, where it could make perfect sense to go
 crazy. You cannot have a reasonable discussion about such costs
 without considering that they will perhaps never be paid, given any
 plausible workload. That's why the percentage that the postgres binary
 size has been shown to increase by really isn't pertinent at all. At
 best, it's a weak proxy for such costs, assuming you don't have a
 demonscene-like preoccupation with reducing binary size, and I don't
 believe that we should.

When you start a binary, your executable is mapped to a file system
binary, and you page-fault in the pages you need.  Now, if your
optimization was alone in its own 4k (x86) virtual pages, and you never
called the functions, odds are you would not pay a penalty, aside from
distribution penalty, and perhaps a small penalty if useful code was
before and after your block.

The sort code expansion, however, is done in-line, in the middle of the
sort code, you are clearly are filling in 64-byte (x86) CPU cache lines
with type-specific expansion code for every sort case, whether we use
the code or not.  Now, I don't think it is a big problem, and I think
the speedup is worth it for common data types, but we can't say the cost
is zero.  

Saying it another way, having a binary in your file system that you
never call is not overhead except for storage, but in this case, the
sort code expansion is inside critical functions we are already calling.

Frankly, it is the cost that has kept use away from using such
optimizations for a long time.  I recently posted that the zero-cost
optimizations are mostly completed for sort and COPY, and we have to
start considering non-zero-cost optimizations --- sad, but true.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:
 On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  IMO this patch is already well past the point of diminishing returns in
  value-per-byte-added.  I'd like to see it trimmed back to provide a fast
  path for just single-column int4/int8/float4/float8 sorts.  The other
  cases aren't going to offer enough of a win to justify the code space.
 
 I'm curious about how much we're gaining from the single-column
 specializations vs. the type-specific specializations.  I think I'm
 going to go try to characterize that.

Yes, please.  That would be a big help.   Is there no optimization for
strings?  I assume they are sorted a lot.  

We can alway add more data types later.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 Yes, please.  That would be a big help.   Is there no optimization for
 strings?  I assume they are sorted a lot.  

It seems unlikely that it'd be worth including strings, especially if
your locale is not C.  This whole thing only makes sense for datatypes
that are comparable in approximately 1 machine instruction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 11:35:46AM -0500, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Yes, please.  That would be a big help.   Is there no optimization for
  strings?  I assume they are sorted a lot.  
 
 It seems unlikely that it'd be worth including strings, especially if
 your locale is not C.  This whole thing only makes sense for datatypes
 that are comparable in approximately 1 machine instruction.

Ah, OK, interesting.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 15:17, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IMO this patch is already well past the point of diminishing returns in
 value-per-byte-added.  I'd like to see it trimmed back to provide a fast
 path for just single-column int4/int8/float4/float8 sorts.  The other
 cases aren't going to offer enough of a win to justify the code space.

 I'm curious about how much we're gaining from the single-column
 specializations vs. the type-specific specializations.  I think I'm
 going to go try to characterize that.

I think it might make more sense to lose the type-specific
specialisations for the multi-key case while adding a generic
multi-key specialisation, than to lose all multi-key specialisations,
though I have not considered that question at length, and would think
that we'd still want to keep an int4 version in that case. Note that I
*did not* include a generic multi-key specialisation, though only
because I saw little point, having already covered by far the most
common cases.

While you're at it, I'd like to suggest that you perform a benchmark
on a multi-key specialisation, so we can see just what we're throwing
away before we do so. Better to have those numbers come from you.

I continue to maintain that the most appropriate course of action is
to provisionally commit all specialisations. If it's hard to know what
effect this is going to have on real workloads, let's defer to beta
testers, who presumably try the new release out with their
application. It's a question you could squarely put to them, without
gradually rolling back from that initial position being much of a
problem.

The mysql-server package is 45 MB on Fedora 16. That 1% of Postgres
binary figure is for my earlier patch with btree specialisations,
right? I'm not asking you to look at that right now. I also don't
think that where do we eventually draw the line with specialisations
like this in Postgres generally? is a question that you should expect
me to answer, though I will say that we should look at each case on
its merits.

I have not totally denied binary bloat costs. I have attempted to
quantify them, while acknowledging that such a task is difficult, as
was evident from the fact that Robert wasn't suprised that I could
not demonstrate any regression. Granted, my definition of a regression
is that there is very clearly no net loss in performance at some
reasonable granularity, which is a very practical definition. You can
quite easily contrive a case that HOT handles really badly. Some
people did, I believe, but HOT won out because it was clearly very
useful in the real world.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 10:59 AM, Bruce Momjian br...@momjian.us wrote:
 On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:
 On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  IMO this patch is already well past the point of diminishing returns in
  value-per-byte-added.  I'd like to see it trimmed back to provide a fast
  path for just single-column int4/int8/float4/float8 sorts.  The other
  cases aren't going to offer enough of a win to justify the code space.

 I'm curious about how much we're gaining from the single-column
 specializations vs. the type-specific specializations.  I think I'm
 going to go try to characterize that.

 Yes, please.  That would be a big help.   Is there no optimization for
 strings?  I assume they are sorted a lot.

They are, but the value of the optimization is expected to diminish
for more complex comparators.

I did a quick test of this using the same setup I mentioned upthread:

create table nodups (g) as select (g%10)*1000+g/10 from
generate_series(1,1) g;

This was the actual test query:

select * from nodups order by g offset 10001;

I did a ten-second warmup on after starting each postmaster, followed
by three one-minute test runs:

master:
tps = 298.824321 (including connections establishing)
tps = 301.741619 (including connections establishing)
tps = 302.238016 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 357.553690 (including connections establishing)
tps = 358.765875 (including connections establishing)
tps = 358.825051 (including connections establishing)

Peter's patch, intact:
tps = 394.566994 (including connections establishing)
tps = 394.228667 (including connections establishing)
tps = 394.492677 (including connections establishing)

So that's about a 19% speedup for just the single sort-key
optimizations, and about a 30% speedup in total.  Compared to a
baseline that includes the single sort-key optimizations, the
additional type-specific optimization is buying about 10% on this
test.

I tried changing the test query to return the data, like this:

select * from nodups order by g;

master:
tps = 181.301129 (including connections establishing)
tps = 180.348333 (excluding connections establishing)
tps = 179.600825 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 201.728269 (including connections establishing)
tps = 198.022527 (including connections establishing)
tps = 199.663082 (including connections establishing)

Peter's patch, intact:
tps = 214.037387 (including connections establishing)
tps = 212.895145 (including connections establishing)
tps = 211.596558 (including connections establishing)

So, with the overhead of actually returning the sorted data, we get
10% from the generic single-sortkey optimizations and 18% from the two
combined.  Compared to a baseline that includes the single-sortkey
optimizations, the incremental improvement from adding the
type-specific optimizations is about 6.6%.  It would be less on a
wider table, presumably.

I also tested a two-column sort:

create table twocol (a, b) as select g%101, g%17 from
generate_series(1,1) g;
select * from twocol order by a, b offset 1;

master:
tps = 270.218604 (including connections establishing)
tps = 265.634964 (including connections establishing)
tps = 268.520805 (including connections establishing)

Peter's patch, intact:
tps = 285.528626 (including connections establishing)
tps = 285.140345 (including connections establishing)
tps = 282.388457 (including connections establishing)

So here the type-specific optimizations are buying us even less: only
about 6%, and that's with no client overhead.

And I tested float8:

create table f8 (g) as select random() from generate_series(1,1);
select * from f8 order by g offset 1;

master:
tps = 228.373239 (including connections establishing)
tps = 225.117704 (including connections establishing)
tps = 225.196197 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 263.060820 (including connections establishing)
tps = 262.788926 (including connections establishing)
tps = 263.041186 (including connections establishing)

Peter's patch, intact:
tps = 283.274797 (including connections establishing)
tps = 280.597965 (including connections establishing)
tps = 280.151349 (including connections establishing)

That's +17% for the single-sortkey stuff, +25% for everything, and the
incremental improvement of the type-specific optimizations over the
generic single-key optimizations is about 6.7%.

It seems clear that the single sort-key optimizations are a much
better value per byte of code than the type-specific optimizations.
Ignoring client overhead, we get between half and two-thirds of 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 [ lots of numbers ]

 ... I just can't get excited about that.  However, I
 find the single-key optimizations much more compelling, for the
 reasons stated above, and feel we ought to include those.

This conclusion seems sound to me, for the reasons you stated and one
more: optimizations for a single sort key are going to be applicable
to a very wide variety of queries, whereas all the other cases are
necessarily less widely applicable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 17:58, Robert Haas robertmh...@gmail.com wrote:
 It seems clear that the single sort-key optimizations are a much
 better value per byte of code than the type-specific optimizations.
 Ignoring client overhead, we get between half and two-thirds of the
 benefit, and it costs us just one extra copy of the quicksort logic
 for all data types.

That was clear from an early stage, and is something that I
acknowledged way back in September - it's pretty obvious that chasing
diminishing returns is the nature of this sort of thing, which is
fine, provided that they are not chased beyond some point that doesn't
make sense. However, I do not see why we should not at least include
full integer specialisations as well (for single-keys at least), given
that we get the additional, not inconsiderable improvements. I do not
accept the premise that we need to find the optimal bytes added to
performance increase ratio, because, as I've already stressed, adding
bytes to the application binary does not have a linear cost.

Here's what might be a good compromise:

singlekey generic, singlekey int4, singlekey int8, multikey generic.

That is a total number of 4 contiguous copies of the qsort, because
we've obsoleted the original qsort_arg (accept for btree index
sorting, which likely matters far less). We get the most benefit for 5
very common types - int4, int8, date, timestamp and timestamptz. A 30%
improvement has to be better than the 19% speedup for just the single
sort-key optimisations, considering that in practice these types are
very commonly used.

I think that there may be additional benefits from making the
qsort_arg specialisation look less like a c stdlib one, like refining
the swap logic to have compile-time knowledge of the type it is
sorting. I'm thinking that we could usefully trim quite a bit from
this:

#define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
(es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;

#define thistype_vecswap(a, b, n)   
\
if ((n)  0) inl_swapfunc((a), (b), (size_t)(n), swaptype)

#define thistype_swap(a, b) 
\
if (swaptype == 0) {
\
long t = *(long *)(void *)(a);  
\
*(long *)(void *)(a) = *(long *)(void *)(b);\
*(long *)(void *)(b) = t;   
\
} else  
\
inl_swapfunc(a, b, es, swaptype)

inline static void
inl_swapfunc(char *a, char *b, size_t n, int swaptype)
{
if (swaptype = 1)
swapcode(long, a, b, n);
else
swapcode(char, a, b, n);
}


-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 18:48, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I think that there may be additional benefits from making the
 qsort_arg specialisation look less like a c stdlib one, like refining
 the swap logic to have compile-time knowledge of the type it is
 sorting. I'm thinking that we could usefully trim quite a bit from
 this:

It seems like while we cannot get any better performance, no doubt
because the code is already using all manner of optimisations,
including perhaps constant propagation, we can simplify this part of
the code quite a bit. That seems fairly incidental though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Robert Haas
On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 That was clear from an early stage, and is something that I
 acknowledged way back in September

OK, so why didn't/don't we do and commit that part first, and then
proceed to argue about the remainder once it's in?

 I think that there may be additional benefits from making the
 qsort_arg specialisation look less like a c stdlib one, like refining
 the swap logic to have compile-time knowledge of the type it is
 sorting. I'm thinking that we could usefully trim quite a bit from
 this:

That's an interesting idea, which seems worth pursuing, though
possibly not for 9.2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-08 Thread Peter Geoghegan
On 8 February 2012 23:33, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 That was clear from an early stage, and is something that I
 acknowledged way back in September

 OK, so why didn't/don't we do and commit that part first, and then
 proceed to argue about the remainder once it's in?

I have no objection to that. I'm not sure that it's going to be
possible to agree to anything beyond what has already been agreed. We
don't seem to be covering new ground.

What would it take to convince you to be more inclusive, and have at
least a few full specialisations, in particular, int4 and int8 single
key full specialisations? I'm sure that you'll agree that they're the
second most compelling case. There doesn't seem to be a standard that
I can meet that you can even /describe/ to have those additional
specialisations committed, which is a shame, since they appear to be
pretty decent wins above and beyond the generic single key
specialisation, all things considered.

I suppose that the standard refrain here is it's your patch; you must
prove the case for committing it. That would be fine by me, but this
isn't just about me, and it seems to be a practical impossibility in
this case. It may be quite impossible even with far greater resources
than I can afford to apply here, since it seems like you're hinting at
some kind of rigorous proof that this cannot cause a regression for a
single client, even though in order to see that regression multiple
other clients would have to be benefiting. I cannot provide such a
proof, since it would probably have to consider all commercially
available CPUs and all possible workloads - I doubt that anyone can.
That being the case, I'm eager not to waste any more time on this. I
bear no ill will; that just seems to be the situation that we find
ourselves in.

That said, if you can describe the standard, I'll try and meet it -
you seem to be suggesting that months and months of benchmarks
wouldn't really change things, since this is the first step on the
road to binary bloat ruin. The only fundamentally new argument that I
can offer is that by applying the standard that the community does
here, it severely limits the number of performance improvements that
can be added, and the aggregate effect is that we're quite certainly
worse off. It's a pity this isn't a really easy decision for you to
make, but that's the nature of these things, and we should expect that
to increasingly be the case. Is it really so unacceptable for us to
risk doing a little worse under some rare circumstances in order to do
better under some common circumstances? Why should it be an easy
decision - when are important decisions ever easy?

 I think that there may be additional benefits from making the
 qsort_arg specialisation look less like a c stdlib one, like refining
 the swap logic to have compile-time knowledge of the type it is
 sorting. I'm thinking that we could usefully trim quite a bit from
 this:

 That's an interesting idea, which seems worth pursuing, though
 possibly not for 9.2.

Well, it's really just a code clean-up. I suspect that qsort_arg is a
minimally modified version of the NetBSD one that wasn't necessarily
thoroughly understood when it was initially added (not that I'm
suggesting that it ought to have been). Then again, you might prefer
to keep it as consistent as possible with qsort (the other copy of
that sort function that we already have).

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Jim Decibel! Nasby

On 2/6/12 3:19 PM, Bruce Momjian wrote:

  While we're waiting for anyone else to weigh in with an opinion on the
  right place to draw the line here, do you want to post an updated
  patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them.
I think it would be few and perhaps should be eliminated.
I agree that it's probably pretty unusual to index floats. My objection 
was on the assumption that float8 is valid but float4 isn't. If we are 
going to provide a fast-path for one then we should do it for both if 
for no other reason than least surprise.


--
Jim C. Nasby, Database architect...@nasby.net
512.569.9461 (cell)http://jim.nasby.net



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Jay Levitt

Jim Decibel! Nasby wrote:

I agree that it's probably pretty unusual to index floats.


FWIW: Cubes and points are floats, right? So would spatial indexes benefit 
from this optimization, or is it only raw floats?


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Robert Haas
On Tue, Feb 7, 2012 at 2:39 PM, Jay Levitt jay.lev...@gmail.com wrote:
 Jim Decibel! Nasby wrote:

 I agree that it's probably pretty unusual to index floats.

 FWIW: Cubes and points are floats, right? So would spatial indexes benefit
 from this optimization, or is it only raw floats?

Cubes are not floats, nor are points.

In general, what's being proposed here is to make a separate copy of
quicksort for each of N datatypes, with the comparator function
inlined.  In order for this to benefit multiple types, they'd have to
use the same set of machine instructions to compare values on disk.
So in general you can make this apply to as many datatypes as you
want: it's just that each one will require another copy of the
quicksort code in the binary.  The more complex the comparator is, the
less you'll save, because the savings presumably come largely from
things like saving and resoring registers and pushing/popping stack
frames, and that's really only going to be material if the underlining
comparator is pretty darn cheap.

Actually, to be more precise, the patch is proposing to create TWO
copies of the quicksort code for each of N datatypes, one for the case
where there is but a single sort key and the other for the case where
there are multiple sort keys.  Having a separate code for the single
sort key case saves more because it lets you get rid of a control loop
- you don't have to iterate down the list of keys, because there's
only one.

I've been skeptical of this patch for a number of reasons, the major
one of which is that most workloads spend only a very small amount of
time in doing qucksorts, and most quicksorts are of very small amounts
of data and therefore fast anyway.   It is easy to construct an
artificial test case that does lots and lots of in-memory sorting, but
in real life I think that's not the great part of what people use
databases for.  The time gets spent on things like groveling through
big tables or doing joins, and then maybe we sort the rows at the end,
but that's not where most of the time goes.  It may be rightly pointed
out, of course, that a penny saved is a penny earned: the fact that
most people don't spend much time quicksorting doesn't mean that we
shouldn't try to make quicksorting as fast as possible.  But there are
a couple of additional considerations.

First, the code is, at present, uglier than I would like.  I mean to
spend some time trying to clean that up, but there are 102 patches in
this CommitFest (the number seems to keep slowly growing, despite the
deadline being three weeks gone) and this isn't the only one I care
about, so I haven't quite gotten there yet.  But hopefully soon.
Second, there's a concern about binary bloat: duplicating lots of code
with different comparators inlined generates, well, a lot of machine
code.  Of course, an 0.8% increase in the size of the resulting binary
is very unlikely to produce any measurable performance regression on
its own, but that's not really the issue.  The issue is rather that we
could easily go nuts applying this technique in lots of different
places - or even just in one place to lots of different types.  Let's
say that we find that even for types with very complex comparators, we
can get a 5% speedup on quick-sorting speed using this technique.
Should we, then, apply the technique to every type we have?  Perhaps
the binary would grow by, I don't know, 10%.  Maybe that's still not
enough to start hurting performance (or making packagers complain),
but surely it's getting there, and what happens when we discover that
similar speedups are possible using the same trick in five other
scenarios?  I think it's a bit like going out to dinner and spending
$50.  It's nicer than eating at home, and many people can afford to do
it occasionally, but if you do it every night, it gets expensive (and
possibly fattening).

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well.  I'm not really sure what the
budget is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having said all that, I am inclined to commit at least some portion of
this, but I wanted to knock off a few other patches that have been
lingering for a while first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Bruce Momjian
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
 So we need some principled way of deciding how much inlining is
 reasonable, because I am 100% certain this is not going to be the last
 time someone discovers that a massive exercise in inlining can yield a
 nifty performance benefit in some case or another: index builds and
 COPY have already been mentioned on this thread, and I expect that
 people will find other cases as well.  I'm not really sure what the
 budget is - i.e. how much binary bloat we can afford to add - or how
 many cases there are that can benefit, but the first isn't infinite
 and the second is more than the first.
 
 Having said all that, I am inclined to commit at least some portion of
 this, but I wanted to knock off a few other patches that have been
 lingering for a while first.

One approach would be to only do a few types now, e.g. integers and
strings, and perhaps leave the others for later. 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Fri, Jan 27, 2012 at 09:37:37AM -0500, Robert Haas wrote:
 On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
  Well, I don't think it's all that subjective - it's more the case that
  it is just difficult, or it gets that way as you consider more
  specialisations.
 
 Sure it's subjective.  Two well-meaning people could have different
 opinions without either of them being wrong.  If you do a lot of
 small, in-memory sorts, more of this stuff is going to seem worthwhile
 than if you don't.
 
  As for what types/specialisations may not make the cut, I'm
  increasingly convinced that floats (in the following order: float4,
  float8) should be the first to go. Aside from the fact that we cannot
  use their specialisations for anything like dates and timestamps,
  floats are just way less useful than integers in the context of
  database applications, or at least those that I've been involved with.
  As important as floats are in the broad context of computing, it's
  usually only acceptable to store data in a database as floats within
  scientific applications, and only then when their limitations are
  well-understood and acceptable. I think we've all heard anecdotes at
  one time or another, involving their limitations not being well
  understood.
 
 While we're waiting for anyone else to weigh in with an opinion on the
 right place to draw the line here, do you want to post an updated
 patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them. 
I think it would be few and perhaps should be eliminated.

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process.  He has done enough
analysis that I am hard-pressed to see how we would get similar
improvement using a different method, so I think it comes down to
whether we want the 28% speedup by adding 55k (1%) to the binary.

I think Peter has shown us how to get that, and what it will cost --- we
just need to decide now whether it is worth it.  What I am saying is
there probably isn't a cheaper way to get that speedup, either now or in
the next few years.  (COPY might need similar help for speedups.)

I believe this is a big win and well worth the increased binary size
because the speed up is significant, and because it is of general
usefulness for a wide range of queries.  Either of these would be enough
to justify the additional 1% size, but both make it an easy decision for
me.  

FYI, I believe COPY needs similar optimizations; we have gotten repeated
complaints about its performance and this method of optmization might
also be our only option.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 04:19:07PM -0500, Bruce Momjian wrote:
 Peter Geoghegan obviously has done some serious work in improving
 sorting, and worked well with the community process.  He has done enough
 analysis that I am hard-pressed to see how we would get similar
 improvement using a different method, so I think it comes down to
 whether we want the 28% speedup by adding 55k (1%) to the binary.
 
 I think Peter has shown us how to get that, and what it will cost --- we
 just need to decide now whether it is worth it.  What I am saying is
 there probably isn't a cheaper way to get that speedup, either now or in
 the next few years.  (COPY might need similar help for speedups.)
 
 I believe this is a big win and well worth the increased binary size
 because the speed up is significant, and because it is of general
 usefulness for a wide range of queries.  Either of these would be enough
 to justify the additional 1% size, but both make it an easy decision for
 me.  
 
 FYI, I believe COPY needs similar optimizations; we have gotten repeated
 complaints about its performance and this method of optmization might
 also be our only option.

Sorry, that was wordy.  What I am saying is that years ago we did
hot-spot optimization for storage and tuple access using macros.  We
have looked for cheap optimizations for sort (and COPY) for years, but
haven't found it.  If we want optimizations in these areas, we are
probably going to need to do the same sort of hot-spot optimization we
did earlier, and Peter's work is an excellent candidate for that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Peter Geoghegan
On 6 February 2012 21:19, Bruce Momjian br...@momjian.us wrote:
 Peter Geoghegan obviously has done some serious work in improving
 sorting, and worked well with the community process.

Thank you for acknowledging that.

It's unfortunate that C does not support expressing these kinds of
ideas in a more natural way.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote:
 On 6 February 2012 21:19, Bruce Momjian br...@momjian.us wrote:
  Peter Geoghegan obviously has done some serious work in improving
  sorting, and worked well with the community process.
 
 Thank you for acknowledging that.
 
 It's unfortunate that C does not support expressing these kinds of
 ideas in a more natural way.

Yes, it is a problem, and a benefit.  We have avoided C++ because these
types of trade-offs that we are discussing are often done invisibly, so
we can't make the decision ourselves.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 06:43:04PM -0500, Bruce Momjian wrote:
 On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote:
  On 6 February 2012 21:19, Bruce Momjian br...@momjian.us wrote:
   Peter Geoghegan obviously has done some serious work in improving
   sorting, and worked well with the community process.
  
  Thank you for acknowledging that.
  
  It's unfortunate that C does not support expressing these kinds of
  ideas in a more natural way.
 
 Yes, it is a problem, and a benefit.  We have avoided C++ because these
 types of trade-offs that we are discussing are often done invisibly, so
 we can't make the decision ourselves.

Let me add that while it is fine for languages like C++ to make these
decisions for application code automatically, operating systems and
high-performance databases developers prefer to make such decisions
explicitly, which is what we are discussing now.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-02 Thread k...@rice.edu
On Wed, Feb 01, 2012 at 04:12:58PM -0600, Jim Nasby wrote:
 On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
  But if we want to put it on a diet, the first thing I'd probably be
  inclined to lose is the float4 specialization.  Some members of the
  audience will recall that I take dim view of floating point arithmetic
  generally, but I'll concede that there are valid reasons for using
  float8.  I have a harder time coming up with a good reason to use
  float4 - ever, for anything you care about.  So I would be inclined to
  think that if we want to trim this back a bit, maybe that's the one to
  let go.  If we want to be even more aggressive, the next thing I'd
  probably lose is the optimization of multiple sortkey cases, on the
  theory that single sort keys are probably by far the most common
  practical case.
 
 I do find float4 to be useful, though it's possible that my understanding is 
 flawed…
 
 We end up using float to represent ratios in our database; things that 
 really, honest to God do NOT need to be exact.
 
 In most cases, 7 digits of precision (which AFAIK is what you're guaranteed 
 with float4) is plenty, so we use float4 rather than bloat the database 
 (though, since we're on 64bit hardware I guess that distinction is somewhat 
 moot…).
 
 Is there something I'm missing that would make float4 useless as compared to 
 float8?
 --
 Jim C. Nasby, Database Architect   j...@nasby.net
 512.569.9461 (cell) http://jim.nasby.net
 
If the values stored are float4, it would be nice to have that fast-path
sort available too. The cases where I have used float4 values in the past,
I absolutely did not need any of the float8 baggage and in my case, using
the actual float4 comparison operator resulted in a significant time savings
over the normal float8. This could be processor specific, but it would be
worth testing before throwing it out.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-01 Thread Peter Geoghegan
On 31 January 2012 19:47, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 Patch is attached. I have not changed the duplicate functions. This is
 because I concluded that it was the lesser of two evils to have to get
 the template to generate both declarations in the header file, and
 definitions in the .c file - that seemed particularly obscure. We're
 never going to have to expose/duplicate any more comparators anyway.
 Do you agree?

 Not really.  You don't really need macros to generate the prototypes;
 you could just write them out longhand.

Well, since that would have the advantage of forcing the client of
those macros to explicitly acknowledge what specialisations they were
getting in order to get a warning-free build, that might be a better
approach (though uncalled static functions are discarded anyway).
However, I don't really see how you expect me to make the
comparator-proper functions within nbtcompare.c inline without some
duplication. How can I have inline functions in that .c file, while
still having those functions callable from other TUs, while avoiding
duplication/moving the functions? I suppose that I could do some trick
with macros, but once again I believe that the cure is worse than the
disease - it it really such a big deal to duplicate a handful of
trivial functions, given than we're never going to have to expand that
number?

 I think there's a mess of naming confusion in here, though, as perhaps
 best illlustrated by this macro definition:

 #define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR)                               \
 DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap,                                \
        SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline)               \
 DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap,                                \
        MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc),                         \
            TYPE##regheapcomparetup_inline)

 The idea here is that when we have only a single sort key, we include
 SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we
 have more than one, we instead include MULT_ADDITIONAL_CODE.  Right
 there, I think we have a naming problem, because abbreviating single
 to sing and multiple to mult is less than entirely clear.  For a
 minute or two I was trying to figure out whether our sorting code was
 musically inclined, and I'm a native english speaker.  But then we
 switch to another set of terminology completely for the generated
 functions: inlheap for the single-key case, and regheap for the
 multiple-key case.   I find that even more confusing.

I'm happy to clean that up, though I think that when you try and
browbeat C into the role of supporting generic programming, confusing
code is more or less inevitable.

 I think we ought to get rid of this:

 +typedef enum TypeCompar
 +{
 +       TYPE_COMP_OTHER,
 +       TYPE_COMP_INT4,
 +       TYPE_COMP_INT8,
 +       TYPE_COMP_FLOAT4,
 +       TYPE_COMP_FLOAT8,
 +       TYPE_COMP_FULL_SPECIALISATION
 +} TypeCompar;

 Instead, just modify SortSupportData to have two function pointers as
 members, one for the single-key case and another for the multiple-key
 case, and have the sortsupport functions initialize them to the actual
 functions that should be called.  The layer of indirection, AFAICS,
 serves no purpose.

It anticipates a time when the system will have both btree and heap
variants of the specialisations. It also serves to usefully document
the intent of the optimisation - the most direct interface isn't
necessarily the best. That said, I really am looking for the path of
least resistance towards getting this committed at this stage, so if
you still think it's a bad idea, I won't complain if you modify the
patch to have multiple function pointers as you described.

 It's pretty easy to remove a specialisation at any time - just remove
 less than 10 lines of code. It's also pretty difficult to determine,
 with everyone's absolute confidence, where the right balance lies.
 Perhaps the sensible thing to do is to not be so conservative in what
 we initially commit, while clearly acknowledging that we may not have
 the balance right, and that it may have to change. We then have the
 entire beta part of the cycle in which to decide to roll back from
 that position, without any plausible downside. If, on the other hand,
 we conservatively lean towards fewer specialisations in the initial
 commit, no one will complain about the lack of an improvement in
 performance that they never had.

 Eh, really?  Typically when we do something good, the wolves are
 howling at the door to make it work in more cases.

Well, if anyone was complaining about a regression, because the
optimisation represented a net loss for their reasonable use-case,
then clearly we wouldn't be doing well, so we'd have to reconsider our
position, and no sensible wolf is going to argue with that. Obviously

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-01 Thread Jim Nasby
On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
 But if we want to put it on a diet, the first thing I'd probably be
 inclined to lose is the float4 specialization.  Some members of the
 audience will recall that I take dim view of floating point arithmetic
 generally, but I'll concede that there are valid reasons for using
 float8.  I have a harder time coming up with a good reason to use
 float4 - ever, for anything you care about.  So I would be inclined to
 think that if we want to trim this back a bit, maybe that's the one to
 let go.  If we want to be even more aggressive, the next thing I'd
 probably lose is the optimization of multiple sortkey cases, on the
 theory that single sort keys are probably by far the most common
 practical case.

I do find float4 to be useful, though it's possible that my understanding is 
flawed…

We end up using float to represent ratios in our database; things that really, 
honest to God do NOT need to be exact.

In most cases, 7 digits of precision (which AFAIK is what you're guaranteed 
with float4) is plenty, so we use float4 rather than bloat the database 
(though, since we're on 64bit hardware I guess that distinction is somewhat 
moot…).

Is there something I'm missing that would make float4 useless as compared to 
float8?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-01 Thread Alvaro Herrera

Excerpts from Jim Nasby's message of mié feb 01 19:12:58 -0300 2012:
 On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:
  But if we want to put it on a diet, the first thing I'd probably be
  inclined to lose is the float4 specialization.  Some members of the
  audience will recall that I take dim view of floating point arithmetic
  generally, but I'll concede that there are valid reasons for using
  float8.  I have a harder time coming up with a good reason to use
  float4 - ever, for anything you care about.  So I would be inclined to
  think that if we want to trim this back a bit, maybe that's the one to
  let go.  If we want to be even more aggressive, the next thing I'd
  probably lose is the optimization of multiple sortkey cases, on the
  theory that single sort keys are probably by far the most common
  practical case.
 
 I do find float4 to be useful, though it's possible that my understanding is 
 flawed…
 
 We end up using float to represent ratios in our database; things that 
 really, honest to God do NOT need to be exact.

But then, do you sort using those ratios?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-31 Thread Robert Haas
On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Patch is attached. I have not changed the duplicate functions. This is
 because I concluded that it was the lesser of two evils to have to get
 the template to generate both declarations in the header file, and
 definitions in the .c file - that seemed particularly obscure. We're
 never going to have to expose/duplicate any more comparators anyway.
 Do you agree?

Not really.  You don't really need macros to generate the prototypes;
you could just write them out longhand.

I think there's a mess of naming confusion in here, though, as perhaps
best illlustrated by this macro definition:

#define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR)   \
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap,\
SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline)   \
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap,\
MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc), \
TYPE##regheapcomparetup_inline)

The idea here is that when we have only a single sort key, we include
SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we
have more than one, we instead include MULT_ADDITIONAL_CODE.  Right
there, I think we have a naming problem, because abbreviating single
to sing and multiple to mult is less than entirely clear.  For a
minute or two I was trying to figure out whether our sorting code was
musically inclined, and I'm a native english speaker.  But then we
switch to another set of terminology completely for the generated
functions: inlheap for the single-key case, and regheap for the
multiple-key case.   I find that even more confusing.

I think we ought to get rid of this:

+typedef enum TypeCompar
+{
+   TYPE_COMP_OTHER,
+   TYPE_COMP_INT4,
+   TYPE_COMP_INT8,
+   TYPE_COMP_FLOAT4,
+   TYPE_COMP_FLOAT8,
+   TYPE_COMP_FULL_SPECIALISATION
+} TypeCompar;

Instead, just modify SortSupportData to have two function pointers as
members, one for the single-key case and another for the multiple-key
case, and have the sortsupport functions initialize them to the actual
functions that should be called.  The layer of indirection, AFAICS,
serves no purpose.

 It's pretty easy to remove a specialisation at any time - just remove
 less than 10 lines of code. It's also pretty difficult to determine,
 with everyone's absolute confidence, where the right balance lies.
 Perhaps the sensible thing to do is to not be so conservative in what
 we initially commit, while clearly acknowledging that we may not have
 the balance right, and that it may have to change. We then have the
 entire beta part of the cycle in which to decide to roll back from
 that position, without any plausible downside. If, on the other hand,
 we conservatively lean towards fewer specialisations in the initial
 commit, no one will complain about the lack of an improvement in
 performance that they never had.

Eh, really?  Typically when we do something good, the wolves are
howling at the door to make it work in more cases.

 I think that possibly the one remaining blocker to tentatively
 committing this with all specialisations intact is that I haven't
 tested this on Windows, as I don't currently have access to a Windows
 development environment. I have set one up before, but it's a huge
 pain. Can anyone help me out?

This doesn't strike me as terribly OS-dependent, unless by that we
mean compiler-dependent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-27 Thread Robert Haas
On Thu, Jan 26, 2012 at 11:36 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 I'm not surprised that you weren't able to measure a performance
 regression from the binary bloat.  Any such regression is bound to be
 very small and probably quite difficult to notice most of the time;
 it's really the cumulative effect of many binary-size-increasing
 changes we're worried about, not each individual one.  Certainly,
 trying to shrink the binary is micro-optimimzation at its finest, but
 then so is inlining comparators.  I don't think it can be
 realistically argued that we can increasing the size of the binary
 arbitrarily will never get us in trouble, much like (for a typical
 American family) spending $30 to have dinner at a cheap resteraunt
 will never break the budget.  But if you do it every day, it gets
 expensive (and fattening).

 Sure. At the risk of stating the obvious, and of repeating myself, I
 will point out that the true cost of increasing the size of the binary
 is not necessarily linear - it's a complex equation. I hope that this
 doesn't sound flippant, but if some naive person were to look at just
 the increasing binary size of Postgres and its performance in each
 successive release, they might conclude that there was a positive
 correlation between the two (since we didn't add flab to the binary,
 but muscle that pulls its own weight and then some).

I completely agree.  So the point is that, when faced a patch that
adds an atypically large number of CPU instructions, we ought to ask
ourselves whether those instructions are pulling their weight.  By way
of comparison, the latest posted version of Tom's generalized
index-only paths patch adds 14kB to the resulting executable (on my
Mac).  Yours adds 55kB.  We might ask ourselves whether the benefits
of this patch are four times greater than the benefits of Tom's patch.
 It's pretty hard to compare them directly since they're doing very
different things, and all of this is completely subjective, but I
doubt it.  On the other hand, there is no rule that every byte of code
that gets committed must be made of solid gold, either.  So, once
again: judgement call.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-27 Thread Peter Geoghegan
Uh, obviously I meant causal relationship and not correlation.

On 27 January 2012 13:37, Robert Haas robertmh...@gmail.com wrote:
 I completely agree.  So the point is that, when faced a patch that
 adds an atypically large number of CPU instructions, we ought to ask
 ourselves whether those instructions are pulling their weight.  By way
 of comparison, the latest posted version of Tom's generalized
 index-only paths patch adds 14kB to the resulting executable (on my
 Mac).  Yours adds 55kB.  We might ask ourselves whether the benefits
 of this patch are four times greater than the benefits of Tom's patch.
  It's pretty hard to compare them directly since they're doing very
 different things, and all of this is completely subjective, but I
 doubt it.  On the other hand, there is no rule that every byte of code
 that gets committed must be made of solid gold, either.  So, once
 again: judgement call.

Well, I don't think it's all that subjective - it's more the case that
it is just difficult, or it gets that way as you consider more
specialisations.

As for what types/specialisations may not make the cut, I'm
increasingly convinced that floats (in the following order: float4,
float8) should be the first to go. Aside from the fact that we cannot
use their specialisations for anything like dates and timestamps,
floats are just way less useful than integers in the context of
database applications, or at least those that I've been involved with.
As important as floats are in the broad context of computing, it's
usually only acceptable to store data in a database as floats within
scientific applications, and only then when their limitations are
well-understood and acceptable. I think we've all heard anecdotes at
one time or another, involving their limitations not being well
understood.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-27 Thread Robert Haas
On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Well, I don't think it's all that subjective - it's more the case that
 it is just difficult, or it gets that way as you consider more
 specialisations.

Sure it's subjective.  Two well-meaning people could have different
opinions without either of them being wrong.  If you do a lot of
small, in-memory sorts, more of this stuff is going to seem worthwhile
than if you don't.

 As for what types/specialisations may not make the cut, I'm
 increasingly convinced that floats (in the following order: float4,
 float8) should be the first to go. Aside from the fact that we cannot
 use their specialisations for anything like dates and timestamps,
 floats are just way less useful than integers in the context of
 database applications, or at least those that I've been involved with.
 As important as floats are in the broad context of computing, it's
 usually only acceptable to store data in a database as floats within
 scientific applications, and only then when their limitations are
 well-understood and acceptable. I think we've all heard anecdotes at
 one time or another, involving their limitations not being well
 understood.

While we're waiting for anyone else to weigh in with an opinion on the
right place to draw the line here, do you want to post an updated
patch with the changes previously discussed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-26 Thread Robert Haas
On Thu, Jan 19, 2012 at 1:47 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Thoughts?

I generated some random data using this query:

create table nodups (g) as select (g%10)*1000+g/10 from
generate_series(1,1) g;

And then used pgbench to repeatedly sort it using this query:

select * from nodups order by g offset 10001;

The patch came out about 28% faster than master.  Admittedly, that's
with no client overhead, but still: not bad.

I don't like the API you've designed, though: as you have it,
PrepareSortSupportFromOrderingOp does this after calling the sort
support function:

+   ssup-usable_compar = ResolveComparatorProper(sortFunction);

I think it would be better to simply have the sort support functions
set usable_compar themselves.  That would allow any user-defined
functions that happen to have the same binary representation and
comparison rules as one of the types for which we supply a custom
qsort() to use initialize it to still make use of the optimization.
There's no real reason to have a separate switch to decide how to
initialize that field: the sort support trampoline already does that,
and I don't see any reason to introduce a second way of doing the same
thing.

I am also a little unhappy that we have to duplicate code the fastcmp
functions from nbtcompare.c in builtins.h.  Wouldn't it make more
sense to declare those functions as inline in nbtcompare.c, and then
call the qsort-generating macro from that file?

There were a couple of comment updates in tuplesort.c that looked
independent from the reset of the patch, so I've committed those
separately.  I also committed your change to downgrade the
belt-and-suspenders check for self-comparison to an assert, with some
rewording of your proposed comment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-26 Thread Peter Geoghegan
On 26 January 2012 19:45, Robert Haas robertmh...@gmail.com wrote:
 The patch came out about 28% faster than master.  Admittedly, that's
 with no client overhead, but still: not bad.

Thanks. There was a 28% reduction in the time it took to execute the
query, but there would have also been a larger reduction in the time
that the backend held that all of that locally-allocated memory. That
might also be worth instrumenting directly, by turning on trace_sort
- can you report numbers on that, please?

 I don't like the API you've designed, though: as you have it,
 PrepareSortSupportFromOrderingOp does this after calling the sort
 support function:

 +       ssup-usable_compar = ResolveComparatorProper(sortFunction);

 I think it would be better to simply have the sort support functions
 set usable_compar themselves.  That would allow any user-defined
 functions that happen to have the same binary representation and
 comparison rules as one of the types for which we supply a custom
 qsort() to use initialize it to still make use of the optimization.
 There's no real reason to have a separate switch to decide how to
 initialize that field: the sort support trampoline already does that,
 and I don't see any reason to introduce a second way of doing the same
 thing.

Hmm. You're right. I can't believe that that didn't occur to me. In
practice, types that use the SortSupport API are all going to be
façades on scalar types anyway, much like date and timestamp, and of
those a good proportion will surely have the same comparator
representation as the specialisations introduced by this patch. It
might be that virtually all third-party types that end up using the
API can avail of some specialisation.

 I am also a little unhappy that we have to duplicate code the fastcmp
 functions from nbtcompare.c in builtins.h.  Wouldn't it make more
 sense to declare those functions as inline in nbtcompare.c, and then
 call the qsort-generating macro from that file?

Maybe it would, but since the meta-qsort_arg introduced includes
partial duplicates of code from tuplesort.c, it kind of felt right to
instantiate specialisations there. It may be that doing it in
nbtcompare.c is the best option available to us. Off the top of my
head, I'm pretty sure that that's a good bit less code.

 There were a couple of comment updates in tuplesort.c that looked
 independent from the reset of the patch, so I've committed those
 separately.  I also committed your change to downgrade the
 belt-and-suspenders check for self-comparison to an assert, with some
 rewording of your proposed comment.

That seems reasonable.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-26 Thread Robert Haas
On Thu, Jan 26, 2012 at 4:09 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 26 January 2012 19:45, Robert Haas robertmh...@gmail.com wrote:
 The patch came out about 28% faster than master.  Admittedly, that's
 with no client overhead, but still: not bad.

 Thanks. There was a 28% reduction in the time it took to execute the
 query, but there would have also been a larger reduction in the time
 that the backend held that all of that locally-allocated memory. That
 might also be worth instrumenting directly, by turning on trace_sort
 - can you report numbers on that, please?

Apparently not.  The sort is too short to register in the trace_sort
output.  I just get:

LOG:  begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG:  internal sort ended, 853 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec

 I don't like the API you've designed, though: as you have it,
 PrepareSortSupportFromOrderingOp does this after calling the sort
 support function:

 +       ssup-usable_compar = ResolveComparatorProper(sortFunction);

 I think it would be better to simply have the sort support functions
 set usable_compar themselves.  That would allow any user-defined
 functions that happen to have the same binary representation and
 comparison rules as one of the types for which we supply a custom
 qsort() to use initialize it to still make use of the optimization.
 There's no real reason to have a separate switch to decide how to
 initialize that field: the sort support trampoline already does that,
 and I don't see any reason to introduce a second way of doing the same
 thing.

 Hmm. You're right. I can't believe that that didn't occur to me. In
 practice, types that use the SortSupport API are all going to be
 façades on scalar types anyway, much like date and timestamp, and of
 those a good proportion will surely have the same comparator
 representation as the specialisations introduced by this patch. It
 might be that virtually all third-party types that end up using the
 API can avail of some specialisation.

Possibly.  At a minimum it keeps the door open.

 I am also a little unhappy that we have to duplicate code the fastcmp
 functions from nbtcompare.c in builtins.h.  Wouldn't it make more
 sense to declare those functions as inline in nbtcompare.c, and then
 call the qsort-generating macro from that file?

 Maybe it would, but since the meta-qsort_arg introduced includes
 partial duplicates of code from tuplesort.c, it kind of felt right to
 instantiate specialisations there. It may be that doing it in
 nbtcompare.c is the best option available to us. Off the top of my
 head, I'm pretty sure that that's a good bit less code.

I was hoping so...

 There were a couple of comment updates in tuplesort.c that looked
 independent from the reset of the patch, so I've committed those
 separately.  I also committed your change to downgrade the
 belt-and-suspenders check for self-comparison to an assert, with some
 rewording of your proposed comment.

 That seems reasonable.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-26 Thread Peter Geoghegan
Alright, so while I agree with everything you've asked for, the fact
is that there is a controversy in relation to binary bloat, and that's
the blocker here. How can we satisfactorily resolve that, or is that
question adequately addressed by the benchmark that I produced?

What if third party modules could include the template_qsort_arg.h
header, and instantiate their own specialisation? The sort support
function could either set an enum, or set a pointer to a qsort_arg
specialisation, if there comparator was a little different, but still
just a few instructions, as with float-based timestamps (I don't care
enough about them to provide one in core, though). It would also
essentially allow for user-defined sort functions, provided they
fulfilled a basic interface. They may not even have to be
comparison-based. I know that I expressed scepticism about the weird
and wonderful ideas that some people have put forward in that area,
but that's mostly because I don't think that GPU based sorting in a
database is going to be practical.

Why not add this capability to the SortSupport API, given that it
wouldn't be very hard? The user would set the enum too, so it would
appear in explain analyze output as custom sort or something like
that.

While a sorting specialisation of our qsort_arg is actually pretty
close to optimal for scalar types and façades thereof, there could be
a type that would benefit from using radix sort, for example, which
isn't even comparison-based, and a user couldn't very well do that
with the existing SortSupport API.

I certainly don't care about this capability enough to defend it
against any objections that anyone may have, especially at this late
stage in the cycle. I just think that we might as well have it.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-26 Thread Robert Haas
On Thu, Jan 26, 2012 at 5:10 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Alright, so while I agree with everything you've asked for, the fact
 is that there is a controversy in relation to binary bloat, and that's
 the blocker here. How can we satisfactorily resolve that, or is that
 question adequately addressed by the benchmark that I produced?

I could go either way on that, honestly.  I took a look today and
found out that on my machine, a postgres 9.0.x executable was about
282kB larger than an 8.4 postgres executable.  9.1 was about 386kB
larger than that, and 9.2 current is 239kB larger still.  So it's not
as if your patch is the first one to enlarge the size of the binary -
it's obviously been growing steadily from release to release for
years.  I found that your patch adds 55kB to the executable size,
which is clearly a lot more than what a typical patch adds, but still
under 1%.  So I wouldn't be shocked and appalled if we decided to
commit this as-is.

But if we want to put it on a diet, the first thing I'd probably be
inclined to lose is the float4 specialization.  Some members of the
audience will recall that I take dim view of floating point arithmetic
generally, but I'll concede that there are valid reasons for using
float8.  I have a harder time coming up with a good reason to use
float4 - ever, for anything you care about.  So I would be inclined to
think that if we want to trim this back a bit, maybe that's the one to
let go.  If we want to be even more aggressive, the next thing I'd
probably lose is the optimization of multiple sortkey cases, on the
theory that single sort keys are probably by far the most common
practical case.

I'm not surprised that you weren't able to measure a performance
regression from the binary bloat.  Any such regression is bound to be
very small and probably quite difficult to notice most of the time;
it's really the cumulative effect of many binary-size-increasing
changes we're worried about, not each individual one.  Certainly,
trying to shrink the binary is micro-optimimzation at its finest, but
then so is inlining comparators.  I don't think it can be
realistically argued that we can increasing the size of the binary
arbitrarily will never get us in trouble, much like (for a typical
American family) spending $30 to have dinner at a cheap resteraunt
will never break the budget.  But if you do it every day, it gets
expensive (and fattening).

 What if third party modules could include the template_qsort_arg.h
 header, and instantiate their own specialisation?

Seems reasonable to me.

 The sort support
 function could either set an enum, or set a pointer to a qsort_arg
 specialisation, if there comparator was a little different, but still

The latter seems better.

 just a few instructions, as with float-based timestamps (I don't care
 enough about them to provide one in core, though). It would also
 essentially allow for user-defined sort functions, provided they
 fulfilled a basic interface. They may not even have to be
 comparison-based. I know that I expressed scepticism about the weird
 and wonderful ideas that some people have put forward in that area,
 but that's mostly because I don't think that GPU based sorting in a
 database is going to be practical.

A question for another day.

 Why not add this capability to the SortSupport API, given that it
 wouldn't be very hard? The user would set the enum too, so it would
 appear in explain analyze output as custom sort or something like
 that.

I'm unenthused about going to any trouble to change the EXPLAIN format.

 While a sorting specialisation of our qsort_arg is actually pretty
 close to optimal for scalar types and façades thereof, there could be
 a type that would benefit from using radix sort, for example, which
 isn't even comparison-based, and a user couldn't very well do that
 with the existing SortSupport API.

 I certainly don't care about this capability enough to defend it
 against any objections that anyone may have, especially at this late
 stage in the cycle. I just think that we might as well have it.

I don't see any reason not too, assuming it's not a lot of code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-26 Thread Peter Geoghegan
On 27 January 2012 03:32, Robert Haas robertmh...@gmail.com wrote:
 But if we want to put it on a diet, the first thing I'd probably be
 inclined to lose is the float4 specialization.  Some members of the
 audience will recall that I take dim view of floating point arithmetic
 generally, but I'll concede that there are valid reasons for using
 float8.  I have a harder time coming up with a good reason to use
 float4 - ever, for anything you care about.  So I would be inclined to
 think that if we want to trim this back a bit, maybe that's the one to
 let go.  If we want to be even more aggressive, the next thing I'd
 probably lose is the optimization of multiple sortkey cases, on the
 theory that single sort keys are probably by far the most common
 practical case.

Obviously I don't think that we should let anything go, as the
improvement in performance is so large that we're bound to be ahead -
we only really pay for what we use anyway, and when we use a
specialisation the difference is really quite big, particularly when
you look at sorting in isolation.  If a specialisation is never used,
it is more or less never paid for, so there's no point in worrying
about that. That said, float4 is obviously the weakest link. I'm
inclined to think that float8 is the second weakest though, mostly
because we get both dates and timestamps for free with the integer
specialisations.

 I'm not surprised that you weren't able to measure a performance
 regression from the binary bloat.  Any such regression is bound to be
 very small and probably quite difficult to notice most of the time;
 it's really the cumulative effect of many binary-size-increasing
 changes we're worried about, not each individual one.  Certainly,
 trying to shrink the binary is micro-optimimzation at its finest, but
 then so is inlining comparators.  I don't think it can be
 realistically argued that we can increasing the size of the binary
 arbitrarily will never get us in trouble, much like (for a typical
 American family) spending $30 to have dinner at a cheap resteraunt
 will never break the budget.  But if you do it every day, it gets
 expensive (and fattening).

Sure. At the risk of stating the obvious, and of repeating myself, I
will point out that the true cost of increasing the size of the binary
is not necessarily linear - it's a complex equation. I hope that this
doesn't sound flippant, but if some naive person were to look at just
the increasing binary size of Postgres and its performance in each
successive release, they might conclude that there was a positive
correlation between the two (since we didn't add flab to the binary,
but muscle that pulls its own weight and then some).

At the continued risk of stating the obvious, CPUs don't just cache
instructions - they cache data too.  If we spend less than half the
time sorting data, which is the level of improvement I was able to
demonstrate against pre-SortSupport Postgres, that will surely very
often have the aggregate effect of ameliorating cache contention
between cores.

 just a few instructions, as with float-based timestamps (I don't care
 enough about them to provide one in core, though). It would also
 essentially allow for user-defined sort functions, provided they
 fulfilled a basic interface. They may not even have to be
 comparison-based. I know that I expressed scepticism about the weird
 and wonderful ideas that some people have put forward in that area,
 but that's mostly because I don't think that GPU based sorting in a
 database is going to be practical.

 A question for another day.

Fair enough.

 I certainly don't care about this capability enough to defend it
 against any objections that anyone may have, especially at this late
 stage in the cycle. I just think that we might as well have it.

 I don't see any reason not too, assuming it's not a lot of code.

Good.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-21 Thread Peter Geoghegan
I decided to take advantage of my ongoing access to a benchmarking
server to see how I could get on with a query with an especially large
sort. Now, that box has 16GB of ram, and some people have that much
memory in their laptops these days, so I was somewhat limited in how
far I could push things.

I eventually decided upon much the same benchmark as I'd made in my
previous e-mail (the query SELECT * FROM pgbench_accounts ORDER BY
bid;, but primed with random data, while being sure to subsequently
vacuum).

I stress that I have not made any attempt to artificially remove
client overhead. I have, however, used a faster disk (caches were not
warmed in a seperate run of pgbench or anything like that for either
of my two most recent benchmarks, so there may have been and indeed
may still be some small bias there), in addition to connecting pgbench
with the -h option. Apparently a known problem with Linux kernels
using the Completely Fair Scheduler introduced in 2.6.23 is that it
does not schedule the pgbench program very well when it's connecting
to the database usingUnix-domain sockets, as I have been until now.
I'm not sure that this had all that much potential to spoil my
results, but didn't want to take the chance .

Anyway, the results of running this latest benchmark, with 20
successive runs of each large query, were:

Patch:

pghost: localhost pgport:  nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.030924 (including connections establishing)
tps = 0.030924 (excluding connections establishing)
statement latencies in milliseconds:
31163.957400SELECT * FROM pgbench_accounts ORDER BY bid;

Master:

pghost: localhost pgport:  nclients: 1 nxacts: 20 dbName: pgbench
pghost: localhost pgport:  nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.026769 (including connections establishing)
tps = 0.026769 (excluding connections establishing)
statement latencies in milliseconds:
36179.508750SELECT * FROM pgbench_accounts ORDER BY bid;

That's a larger proportional improvement than reported on Thursday,
and at significantly greater scale.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-09 Thread Josh Berkus

 Obviously, many indexes are unique and thus won't have duplicates at
 all.  But if someone creates an index and doesn't make it unique, odds
 are very high that it has some duplicates.  Not sure how many we
 typically expect to see, but more than zero...

Peter may not, but I personally admin lots of databases which have
indexes on values like category or city which have 100's or 1000's
of duplicates per value.  I don't think this is uncommon at all.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-09 Thread Peter Geoghegan
On 9 January 2012 19:45, Josh Berkus j...@agliodbs.com wrote:
 Obviously, many indexes are unique and thus won't have duplicates at
 all.  But if someone creates an index and doesn't make it unique, odds
 are very high that it has some duplicates.  Not sure how many we
 typically expect to see, but more than zero...

 Peter may not, but I personally admin lots of databases which have
 indexes on values like category or city which have 100's or 1000's
 of duplicates per value.  I don't think this is uncommon at all.

Uh, then all the more reason to do what I recommend, I imagine. There
is most definitely a large overhead to creating such indexes, at least
for scalar types. As far as I can tell, Tom's complaint is quite
speculative.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-07 Thread Peter Geoghegan
On 6 January 2012 21:14, Tom Lane t...@sss.pgh.pa.us wrote:
 When there are lots of duplicates of a particular indexed value, the
 existing code will cause an indexscan to search them in physical order,
 whereas if we remove the existing logic it'll be random --- in
 particular, accesses to the same heap page can no longer be expected to
 be clustered.

Isn't it possible to get them in physical order anyway, by reading
them into memory in that order? Efficient quick sort implementations
are not stable, and ours is no exception, but we could perhaps come up
with a cheaper tie-breaker value at that stage, if you're determined
to maintain this behaviour. We have sufficient incentive to, as I
describe below.

 Admittedly, I don't have any numbers quantifying just how useful that
 might be, but on the other hand you've not presented any evidence
 justifying removing the behavior either.  If we believe your position
 that indexes don't generally have lots of duplicates, then the code in
 question will seldom be reached and therefore there would be no
 performance benefit in removing it.

I ran the same btree benchmark on master, but without the cheap
insurance. The results were interesting, to say the least.

The columns indexed were the same columns and data that I've been
using all along. Initially this made sense, as the performance of
multi sort key sorts often largely hinged on being able to get away
with doing one comparison per pair of tuples - with many duplicates, I
could avoid cheating and show something closer to worst case for the
patch.

I didn't think it mattered that indexing the same columns would
produce what happened to be a not so useful index in the real world,
due to having so many duplicates - better to have figures that were
somewhat comparable for btree tuple sorting and heap tuple sorting.

When I ran the same benchmark on a server that differed from master
only in that their was no insurance, it momentarily appeared that
*all* of the gains for btree index creation came from being able to
elide the cheap insurance, but only where it would have to be paid
for a high number of times.

I soon realised that I'd made a blunder: the code (that is, the patch
that I posed most recently) wasn't even using my specialisation for
qsorting, because the SortSupport pointer was null! I did not have
tuplesort_begin_index_btree initialise the SortSupport struct as
tuplesort_begin_heap did, so my earlier benchmark was effectively
meaningless, except that it indicated the benefits of eliding the
cheap insurance alone, if only for that not so compelling case. You
should note that the benefits of not paying for the insurance can be
very significant indeed.

Attached are figures for an identical run of the same btree python
script, but with a version of the patch that actually uses my
specialisations. Granted, these numbers are still partially predicated
on the index in question having a large number of duplicates, but it's
worth noting:

1. The gain from specialisation isn't bad; not as good as the
improvements we saw for heap tuples, but not so bad either, especially
considering that binary bloat should be much less controversial for
btree tuples.

2. The index that results from the tests is still useful; the planner
is perfectly willing to use it rather than than performing an
in-memory sort. It will also use it to satisfy a query like select *
from orderlines where prod_id = 5, albeit via a bitmap index scan. I
took the precaution of increasing default_statistics_target to its
maximum value, as well an performing an analyze on orderlines in
advance of checking this.

Revision to this patch that fixes the bug to follow - I produced these
new numbers from a rough cut of that.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


results_server_btree_w_specializations.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Robert Haas
On Thu, Jan 5, 2012 at 5:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 There is no compiler anywhere that implements always inline, unless
 you are talking about a macro.  inline is a hint and nothing more,
 and if you think you can force it you are mistaken.  So this controversy
 is easily resolved: we do not need any such construct.

That's basically in direct contradiction to the experimental evidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Peter Geoghegan
On 5 January 2012 20:23, Robert Haas robertmh...@gmail.com wrote:
 I don't have a problem with the idea of a pg_always_inline, but I'm
 wondering what sort of fallback mechanism you propose.  It seems to me
 that if the performance improvement is conditioned on inlining be done
 and we're not confident that we can force the compiler to inline, we
 ought to fall back to not making the specialization available, rather
 than to making something available that may not work well.  Of course
 if it's only a question of a smaller performance gain, rather than an
 actual loss, then that's not as much of an issue...

pg_always_inline is more a less a neat adjunct to what I've already
done. You can take it or leave it, but it would be irrational to
dismiss it out of hand, because it is demonstrably effective in this
case. Using non-portable things like function attributes is fairly
well precedented - we use __attribute__((format(*) )) frequently. We
don't need a fallback mechanism other than #else #define
pg_always_inline inline #endif. I think these facilities are
available to well over 99% of our users, who use GCC, GCC compatible
compiler and MSVC builds.

I have basically no sympathy for anyone who uses a compiler that can't
inline functions. Do such people actually exist?

 ISTM that protecting
 against that outside the qsort implementation (but only for index
 sorts) is wrong-headed.

 Assuming you mean that qsort_arg() will protect against this stuff so
 that callers don't need to do it separately, +1.

That's exactly what I mean. Going quadratic in the face of lot of
duplicates is something that, remarkably, was a problem well into the
1990s, apparently because some guy noticed it one day, though I think
that lots of popular implementations happened to be unaffected anyway.

As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the duplicate protection for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection. As I've said, we
already lack any such protection for heap tuples. We are unaffected
now anyway, in particular, because we do this, as apparently
recommended by both Sedgewick and Knuth:

while (pb = pc  (r = cmp(pb, a)) = 0)
{
if (r == 0)
{
swap(pa, pb);
pa += es;
}
pb += es;
}

Note that the partition swap occurs *because* the pivot and element
are equal. You might imagine that this is superfluous. It turns out
that it protects against the duplicates, resulting in sub-partitions
about the same size (though it occurs to me that it does so at the
expense of precluding parallelism). In short, Sedgewick would approve
of our qsort implementation.

As for the compare a tuple to itself thing, that's just silly, any
was probably only ever seen in some homespun implementation at some
point a relatively long time ago, but I've asserted against it anyway.

 I can't find any description of how this actually works... obviously,
 we'd like to find a close-to-median element, but how do we actually do
 that cheaply?

Uh, it works whatever way you want it to work. The implementation has
to figure out how to get the median ahead of time.

 I doubt that this would be worthwhile -- the pivot that we pick at the
 toplevel doesn't really matter much; even if it's bad, we can recover
 just fine if the pivot we pick one level down is good, or the level
 after that.  We only really have a problem if we keep systematically
 picking bad pivots every time.  And if we do have that problem,
 improving the toplevel choice of pivot is only going to help modestly,
 because we'll still be O(n^2) on each partition.

 Can I get a straw poll on how much of a problem worst-case performance
 of qsort is believed to be?

 I'd be reluctant to remove any of the protections we currently have,
 because I bet they were all put in to fix problems that people hit in
 the field.  But I don't know that we need more.  Of course,
 consolidating them into qsort() itself rather than its callers
 probably makes sense, as noted above.

I was merely proposing something that would compliment the med3 method
and our existing protections.

 In a perfect world, if it were somehow possible to know the perfect
 pivots ahead of time from a histogram or something, we'd have a quick
 sort variant with worst-case performance of O(n log(n)). That, or the
 limited approximation that I've sketched would perhaps be worthwhile,
 even if it was subject to a number of caveats. Wikipedia claims that
 the worst case for quick sort is O(n log(n)) with the refinements
 recommended by Sedgewick's 1978 paper, but this seems like nonsense to
 me - the med3 technique is a good heuristic in 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Robert Haas
On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 As you know, all queries tested have lots and lots of duplicates (a
 ~1.5GB table that contains the same number of distinct elements as a
 ~10MB table once did), and removing the duplicate protection for
 btrees, on top of everything else, sees the qsort do an awful lot
 better than HEAD does with the psuedo-protection.

Does that also win vs. unpatched master?  If so we may as well pull
that part out and commit it separately.

 I can't find any description of how this actually works... obviously,
 we'd like to find a close-to-median element, but how do we actually do
 that cheaply?

 Uh, it works whatever way you want it to work. The implementation has
 to figure out how to get the median ahead of time.

Oh.  I'd be inclined to think that would be pretty inefficient, unless
we only did it for very large partitions or when we observed that some
less costly strategy was tanking.

 I gather from a quick read that this is supposed to be especially good
 when the data is already somewhat sorted.  Would there be any merit in
 trying to guess when that's likely to be the case (e.g. based on the
 correlation statistic)?   That seems like a stretch, but OTOH a GUC
 doesn't feel great either: what is the poor user to do with a query
 that does 2 sorts, one of which is faster with QS and the other of
 which is faster with Timsort?  Ugh.

 I'd imagined that it might work a bit like default_statistics_target.
 Of course, I was just thinking out loud. Also, we do a very good job
 on *perfectly* pre-sorted input because we check for pre-sorted input.

We occasionally have people who want to do SELECT * FROM foo WHERE a 
100 AND a  200 ORDER BY a, b where foo has an index on (a) but not
(a, b).  This tends to suck, because we fail to realize that we can
batch the sort.  We either seqscan and filter the table then sort the
results, or else we scan the index on (a) and then ignore the fact
that we have data which is already partially sorted.  It's
particularly annoying if a is mostly unique and needs b only
occasionally to break ties.  It would be nice to improve this, but it
may be more of a planner/executor problem that something we can fix
directly inside the sort implementation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Peter Geoghegan
On 6 January 2012 17:29, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 As you know, all queries tested have lots and lots of duplicates (a
 ~1.5GB table that contains the same number of distinct elements as a
 ~10MB table once did), and removing the duplicate protection for
 btrees, on top of everything else, sees the qsort do an awful lot
 better than HEAD does with the psuedo-protection.

 Does that also win vs. unpatched master?  If so we may as well pull
 that part out and commit it separately.

I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.

 I'd imagined that it might work a bit like default_statistics_target.
 Of course, I was just thinking out loud. Also, we do a very good job
 on *perfectly* pre-sorted input because we check for pre-sorted input.

 We occasionally have people who want to do SELECT * FROM foo WHERE a 
 100 AND a  200 ORDER BY a, b where foo has an index on (a) but not
 (a, b).  This tends to suck, because we fail to realize that we can
 batch the sort.  We either seqscan and filter the table then sort the
 results, or else we scan the index on (a) and then ignore the fact
 that we have data which is already partially sorted.  It's
 particularly annoying if a is mostly unique and needs b only
 occasionally to break ties.  It would be nice to improve this, but it
 may be more of a planner/executor problem that something we can fix
 directly inside the sort implementation.

That sounds like an interesting problem. Something to look into for
9.3, perhaps.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 I didn't bother isolating that, because it doesn't really make sense
 to (not having it is probably only of particular value when doing what
 I'm doing anyway, but who knows). Go ahead and commit something to
 remove that code (get it in both comparetup_index_btree and
 comparetup_index_hash), as well as the tuple1 != tuple2 test now if
 you like. It's patently clear that it is unnecessary, and so doesn't
 have to be justified as a performance enhancement - it's a simple case
 of refactoring for clarity. As I've said, we don't do this for heap
 tuples and we've heard no complaints in all that time. It was added in
 commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
 problems with system qsorts came to light.

Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time.  If you don't like the
comment, we can change that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Peter Geoghegan
On 6 January 2012 18:45, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Geoghegan pe...@2ndquadrant.com writes:
 I didn't bother isolating that, because it doesn't really make sense
 to (not having it is probably only of particular value when doing what
 I'm doing anyway, but who knows). Go ahead and commit something to
 remove that code (get it in both comparetup_index_btree and
 comparetup_index_hash), as well as the tuple1 != tuple2 test now if
 you like. It's patently clear that it is unnecessary, and so doesn't
 have to be justified as a performance enhancement - it's a simple case
 of refactoring for clarity. As I've said, we don't do this for heap
 tuples and we've heard no complaints in all that time. It was added in
 commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
 problems with system qsorts came to light.

 Actually, I'm going to object to reverting that commit, as I believe
 that having equal-keyed index entries in physical table order may offer
 some performance benefits at access time.  If you don't like the
 comment, we can change that.

Please explain your position. When is this supposed to be useful? Even
if you can present an argument for keeping it, it has to weigh the
fact that people don't tend to have much use for indexes with lots of
duplicates anyway. Prior to 2004, we did not do this - it was
justified as insurance against bad qsort implementations only.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 On 6 January 2012 18:45, Tom Lane t...@sss.pgh.pa.us wrote:
 Actually, I'm going to object to reverting that commit, as I believe
 that having equal-keyed index entries in physical table order may offer
 some performance benefits at access time.  If you don't like the
 comment, we can change that.

 Please explain your position. When is this supposed to be useful?

When there are lots of duplicates of a particular indexed value, the
existing code will cause an indexscan to search them in physical order,
whereas if we remove the existing logic it'll be random --- in
particular, accesses to the same heap page can no longer be expected to
be clustered.

Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either.  If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-06 Thread Robert Haas
On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Admittedly, I don't have any numbers quantifying just how useful that
 might be, but on the other hand you've not presented any evidence
 justifying removing the behavior either.  If we believe your position
 that indexes don't generally have lots of duplicates, then the code in
 question will seldom be reached and therefore there would be no
 performance benefit in removing it.

Obviously, many indexes are unique and thus won't have duplicates at
all.  But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates.  Not sure how many we
typically expect to see, but more than zero...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-05 Thread Robert Haas
On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 3. Resolve two anticipated controversies that are, respectively,
 somewhat orthogonal and completely orthogonal to the binary bloat
 controversy. The first (controversy A) is that I have added a new
 piece of infrastructure, pg_always_inline, which, as the name
 suggests, is a portable way of insisting that a function should be
 invariably inlined. Portable libraries like libc have been doing this
 for a long time now, and I actually use the libc macro (that expands
 to __attribute__((always_inline)) ) where possible.

I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose.  It seems to me
that if the performance improvement is conditioned on inlining be done
and we're not confident that we can force the compiler to inline, we
ought to fall back to not making the specialization available, rather
than to making something available that may not work well.  Of course
if it's only a question of a smaller performance gain, rather than an
actual loss, then that's not as much of an issue...

 The second
 possible point of contention (controversy B) is that I have jettisoned
 various protections against bad qsort implementations that I believe
 are a legacy of when we used the system qsort pre-2006, that can no
 longer be justified. For example, quick sort performing badly in the
 face of lots of duplicates is a well understood problem today
 (partitioning should stop on equal keys), and ISTM that protecting
 against that outside the qsort implementation (but only for index
 sorts) is wrong-headed.

Assuming you mean that qsort_arg() will protect against this stuff so
that callers don't need to do it separately, +1.

 I had another idea when writing this patch that I haven't developed at
 all but I'll share anyway. That is, it might be a good idea to use a
 balance quicksort:

 http://xlinux.nist.gov/dads//HTML/balancedqsrt.html

I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?

 It might be possible to get a reasonable approximation of the actual
 median value of a given column with existing statistics, which could
 be hinted to qsort_arg.

I doubt that this would be worthwhile -- the pivot that we pick at the
toplevel doesn't really matter much; even if it's bad, we can recover
just fine if the pivot we pick one level down is good, or the level
after that.  We only really have a problem if we keep systematically
picking bad pivots every time.  And if we do have that problem,
improving the toplevel choice of pivot is only going to help modestly,
because we'll still be O(n^2) on each partition.

 Can I get a straw poll on how much of a problem worst-case performance
 of qsort is believed to be?

I'd be reluctant to remove any of the protections we currently have,
because I bet they were all put in to fix problems that people hit in
the field.  But I don't know that we need more.  Of course,
consolidating them into qsort() itself rather than its callers
probably makes sense, as noted above.

 In a perfect world, if it were somehow possible to know the perfect
 pivots ahead of time from a histogram or something, we'd have a quick
 sort variant with worst-case performance of O(n log(n)). That, or the
 limited approximation that I've sketched would perhaps be worthwhile,
 even if it was subject to a number of caveats. Wikipedia claims that
 the worst case for quick sort is O(n log(n)) with the refinements
 recommended by Sedgewick's 1978 paper, but this seems like nonsense to
 me - the med3 technique is a good heuristic in practice, but it's
 perfectly possible in theory for it's sampling to always get things
 completely wrong (this is not an unfamiliar problem).

I agree.  After reading
http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined
to think that things are still O(n lg n) even if we always partition
so that any fixed percentage of the data -- is on one side of the
partition element and the remainder is on the other side.  So for
example if all of our partition elements are greater than 1% of the
elements in their partition and less than the other 99%, we'll still
be O(n lg n), though with a considerably higher constant factor, I
think.  However, if our partition elements are always a fixed NUMBER
of elements from the end - whether it's 1 or 100 - then we'll be
O(n^2), though again the constant factor will depend on how many
elements you knock off each time.  In practice I'm not sure whether an
algorithmic analysis matters much, because in real life the constant
factors are probably pretty important, hence the median-of-medians
approach with n  40.

 While I'm thinking out loud, here's another idea: Have a GUC which,
 when enabled (perhaps potentially at various different granularities),
 makes Timsort the in-memory sorting algorithm, as it now 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-05 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 The first (controversy A) is that I have added a new
 piece of infrastructure, pg_always_inline, which, as the name
 suggests, is a portable way of insisting that a function should be
 invariably inlined.

 I don't have a problem with the idea of a pg_always_inline, but I'm
 wondering what sort of fallback mechanism you propose.

There is no compiler anywhere that implements always inline, unless
you are talking about a macro.  inline is a hint and nothing more,
and if you think you can force it you are mistaken.  So this controversy
is easily resolved: we do not need any such construct.

The real question is whether we should accept a patch that is a
performance loss when the compiler fails to inline some reasonably
simple function.  I think that would depend on the actual numbers
involved, so we'd need to see data before making a decision.

 The second
 possible point of contention (controversy B) is that I have jettisoned
 various protections against bad qsort implementations that I believe
 are a legacy of when we used the system qsort pre-2006, that can no
 longer be justified.

No objection to that one, I think, as long as you've verified that our
implementation is in fact okay about these things.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-01-05 Thread Peter Geoghegan
On 5 January 2012 22:27, Tom Lane t...@sss.pgh.pa.us wrote:
 There is no compiler anywhere that implements always inline, unless
 you are talking about a macro.  inline is a hint and nothing more,
 and if you think you can force it you are mistaken.  So this controversy
 is easily resolved: we do not need any such construct.

I'm slightly puzzled by your remarks here. GCC documentation on this
is sparse (although, as I've demonstrated, I can get better numbers
using the always_inline attribute on GCC 4.3), but take a look at
this:

http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx

While it is not strictly true to say that pg_always_inline could
inline every function under every set of circumstances, it's pretty
close to the truth. I do accept that this facility could quite easily
be abused if its use isn't carefully measured. I also accept that
C99/GNU C inline functions are generally just requests to the compiler
that may be ignored (and indeed the compiler may inline without being
asked to).

It's short sighted to see this as a case of inlining itself making a
big difference, so much as it making a big difference as an enabling
transformation.

 The real question is whether we should accept a patch that is a
 performance loss when the compiler fails to inline some reasonably
 simple function.  I think that would depend on the actual numbers
 involved, so we'd need to see data before making a decision.

Who said anything about a performance loss? Since the raw improvement
to qsort_arg is so large, it's pretty difficult to imagine a
confluence of circumstances in which this results in a net loss. See
the figures at http://archives.postgresql.org/pgsql-hackers/2011-11/msg01316.php
, for example.

The pg_always_inline idea is relatively recent. It just serves to
provide additional encouragement to the compiler to inline, insofar as
that is possible on the platform in question. The compiler's
cost/benefit analysis cannot possibly appreciate how hot a code path
qsort_arg is, because it has a set of generic heuristics that are
quite fallible, and very probably are on quite conservative. We can
appreciate such things though.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2011-12-30 Thread Merlin Moncure
On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 * A spreadsheet that shows the results of re-running my earlier heap
 tuple sorting benchmark with this new patch. The improvement in the
 query that orders by 2 columns is all that is pertinent there, when
 considering the value of (1) and the sense in standing still for
 controversy A.

 * A spreadsheet that shows the difference in index creation times,
 generated with the help of the new python script.

very nice.  let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2011-12-30 Thread Peter Geoghegan
On 30 December 2011 19:46, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 * A spreadsheet that shows the results of re-running my earlier heap
 tuple sorting benchmark with this new patch. The improvement in the
 query that orders by 2 columns is all that is pertinent there, when
 considering the value of (1) and the sense in standing still for
 controversy A.

 * A spreadsheet that shows the difference in index creation times,
 generated with the help of the new python script.

 very nice.  let me save everyone the effort of opening his
 spreadsheets (which by the way both show 'HEAD/unoptimized' --
 probably not what you meant): he's showing a consistent ~50% reduction
 in running time of sort driven queries -- that's money.

Sorry, I think you may have misinterpreted the results, which is my
fault - I introduced a formatting error. In the case of the btree
spreadsheet, the first query on each sheet should be create index
test on orderlines (prod_id);, and not select * from orderlines
order by prod_id. The idea is to compare the results from each set of
binaries across pages of the spreadsheet (note that there are two
tabs). You should not compare anything between the two spreadsheets.
Revised btree results attached. The heap results that I posted do not
have any formatting errors, so they have not been revised.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


results_server_btree_revised.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2011-12-30 Thread Merlin Moncure
On Fri, Dec 30, 2011 at 2:30 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 30 December 2011 19:46, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan pe...@2ndquadrant.com 
 wrote:
 * A spreadsheet that shows the results of re-running my earlier heap
 tuple sorting benchmark with this new patch. The improvement in the
 query that orders by 2 columns is all that is pertinent there, when
 considering the value of (1) and the sense in standing still for
 controversy A.

 * A spreadsheet that shows the difference in index creation times,
 generated with the help of the new python script.

 very nice.  let me save everyone the effort of opening his
 spreadsheets (which by the way both show 'HEAD/unoptimized' --
 probably not what you meant): he's showing a consistent ~50% reduction
 in running time of sort driven queries -- that's money.

 Sorry, I think you may have misinterpreted the results, which is my
 fault - I introduced a formatting error. In the case of the btree
 spreadsheet, the first query on each sheet should be create index
 test on orderlines (prod_id);, and not select * from orderlines
 order by prod_id. The idea is to compare the results from each set of
 binaries across pages of the spreadsheet (note that there are two
 tabs). You should not compare anything between the two spreadsheets.
 Revised btree results attached. The heap results that I posted do not
 have any formatting errors, so they have not been revised.

right-- my bad. still, that's 31-37% -- still pretty nice.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers