Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-02-04 Thread Jon Nelson
What - if anything - do I need to do to get this on the commitfest
list for the next commitfest?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-02-04 Thread Michael Paquier
On Wed, Feb 5, 2014 at 10:33 AM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 What - if anything - do I need to do to get this on the commitfest
 list for the next commitfest?
The list of instructions is here:
http://wiki.postgresql.org/wiki/Submitting_a_Patch#Patch_submission
Then the next commit fest (#1 for 9.5), will be in June and is here:
https://commitfest.postgresql.org/action/commitfest_view?id=22
Note that you will need an account on postgresql.org to register your patch.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-26 Thread Jon Nelson
What are my next steps here?

I believe the concept is sound, the code is appears to work and
doesn't crash, and the result does show a performance win in most
cases (sometimes a big win).  It's also incomplete, at least insofar
as it doesn't interface with the cost models at all, etc...

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Jeremy Harris

On 22/01/14 03:16, Jon Nelson wrote:

Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons.  Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.



Dedup-in-sort is also done by my WIP internal merge sort, and
extended (in much the same ways as Jon's) to the external merge.

https://github.com/j47996/pgsql_sorb


I've not done a cost model either, but the dedup capability is
exposed from tuplesort.c to the executor, and downstream uniq
nodes removed.

I've not worked out yet how to eliminate upstream hashagg nodes,
which would be worthwhile from testing results.

--
Cheers,
   Jeremy


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Jeremy Harris

On 22/01/14 03:53, Tom Lane wrote:

Jon Nelson jnelson+pg...@jamponi.net writes:

A rough summary of the patch follows:



- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
   distinct columns is equal to the number of sort columns (and both are
   greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
   creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
   + a new macro, DISCARDTUP and a new structure member, discardtup, are
 both defined and operate similar to COMPARETUP, COPYTUP, etc...
   + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
 throw out the new tuple if it compares as identical to the tuple at
 the top of the heap. Since we're already performing this comparison,
 this is essentially free.
   + in mergeonerun, we may discard a tuple if it compares as identical
 to the *last written tuple*. This is a comparison that did not take
 place before, so it's not free, but it saves a write I/O.
   + We perform the same logic in dumptuples


[ raised eyebrow ... ]  And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?


I don't think Jon was suggesting that the planner drop the unique step.
--
Cheers,
   Jeremy



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Tom Lane
Jeremy Harris j...@wizmail.org writes:
 On 22/01/14 03:53, Tom Lane wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 - in createplan.c, eliding duplicate tuples is enabled if we are
 creating a unique plan which involves sorting first

 [ raised eyebrow ... ]  And what happens if the planner drops the
 unique step and then the sort doesn't actually go to disk?

 I don't think Jon was suggesting that the planner drop the unique step.

Hm, OK, maybe I misread what he said there.  Still, if we've told
tuplesort to remove duplicates, why shouldn't we expect it to have
done the job?  Passing the data through a useless Unique step is
not especially cheap.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-22 Thread Jon Nelson
On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeremy Harris j...@wizmail.org writes:
 On 22/01/14 03:53, Tom Lane wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 - in createplan.c, eliding duplicate tuples is enabled if we are
 creating a unique plan which involves sorting first

 [ raised eyebrow ... ]  And what happens if the planner drops the
 unique step and then the sort doesn't actually go to disk?

 I don't think Jon was suggesting that the planner drop the unique step.

 Hm, OK, maybe I misread what he said there.  Still, if we've told
 tuplesort to remove duplicates, why shouldn't we expect it to have
 done the job?  Passing the data through a useless Unique step is
 not especially cheap.

That's correct - I do not propose to drop the unique step. Duplicates
are only dropped if it's convenient to do so. In one case, it's a
zero-cost drop (no extra comparison is made). In most other cases, an
extra comparison is made, typically right before writing a tuple to
tape. If it compares as identical to the previously-written tuple,
it's thrown out instead of being written.

The output of the modified code is still sorted, still *might* (and in
most cases, probably will) contain duplicates, but will (probably)
contain fewer duplicates.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-21 Thread Jon Nelson
Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons.  Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.

Being very new to hacking PostgreSQL, I've probably done a bunch of
things wrong, but I've tried to test it thoroughly.

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
  distinct columns is equal to the number of sort columns (and both are
  greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
  creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
  + a new macro, DISCARDTUP and a new structure member, discardtup, are
both defined and operate similar to COMPARETUP, COPYTUP, etc...
  + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
throw out the new tuple if it compares as identical to the tuple at
the top of the heap. Since we're already performing this comparison,
this is essentially free.
  + in mergeonerun, we may discard a tuple if it compares as identical
to the *last written tuple*. This is a comparison that did not take
place before, so it's not free, but it saves a write I/O.
  + We perform the same logic in dumptuples

I have tested this patch with a bunch of different data, and the
results are summarized below.

Test 1:
I used a corpus of approximately 1.5TB of textual data, consisting of
7.4 billion input rows, 12.6% of which is unique. This is a 'real-world'
test.

Before the patch: 44.25 hours using 274GB of temporary files
After the patch: 36.44 hours using 125GB of temporary files
Difference: -7.6 hours, 18% faster

Test 2:
Acquire the unique set of strings from a totally random set that
contains no duplicates. This is a 'worst case'.

Input: 700 million rows, 32GB.
Before: 18,796 seconds.
After: 19,358 seconds
Difference: +562 seconds, *slower* by 3%.
In this particular case, the performance varies from faster to slower, and I'm
not sure why. Statistical noise?

Test 3:
Acquire the unique set of integers from a set of 100 million, all
happen to have the value '1'. This is a 'best case'.

Input: 100 million rows, 3.4GB on-disk
Before: 26.1 seconds using 1.3GB of temporary files
After: 8.9 seconds using 8KB of disk
Difference: -17.2 seconds, 65% faster

Test 4:
Similar to test 3, but using 1 billion rows.
Before: 1450 seconds, 13 GB of temporary files
After: 468 seconds, 8 KB of temporary files
Difference: -982 seconds, 68% faster

See below for details on test 4.

Lastly, 'make check' passes without issues (modulo rangefuncs grumping about
changes in pg_settings).

Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of
early September 2013.  I have rebased the patch against
PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests
2, 3, and 4 with similar results.

Regarding the patch: I've tried to conform to the indenting and style
rules, including using pg_bsd_indent and pg_indent, but I've not had
100% success.

The details on test 4:

A simple example, using 1 billion rows all with the value '1' (integer):
This is a fabricated *best case*.

postgres=# set enable_hashagg = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
 QUERY PLAN
-
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=468188.900..468188.901 rows=1 loops=1)
   Output: i
   -  Sort  (cost=245942793.27..248442793.27 rows=10 width=4)
(actual time=468188.897..468188.897 rows=1 loops=1)
 Output: i
 Sort Key: i.i
 Sort Method: external sort  Disk: 8kB
 -  Seq Scan on public.i  (cost=0.00..14424779.00
rows=10 width=4) (actual time=34.765..254595.990
rows=10 loops=1)
   Output: i
 Total runtime: 468189.125 ms
(9 rows)

Time: 468189.755 ms
postgres=# set enable_opportunistic_tuple_elide = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
 QUERY PLAN
-
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=1047226.451..1449371.632 rows=1 loops=1)
   Output: i
   -  

Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-21 Thread Tom Lane
Jon Nelson jnelson+pg...@jamponi.net writes:
 A rough summary of the patch follows:

 - a GUC variable enables or disables this capability
 - in nodeAgg.c, eliding duplicate tuples is enabled if the number of
   distinct columns is equal to the number of sort columns (and both are
   greater than zero).
 - in createplan.c, eliding duplicate tuples is enabled if we are
   creating a unique plan which involves sorting first
 - ditto planner.c
 - all of the remaining changes are in tuplesort.c, which consist of:
   + a new macro, DISCARDTUP and a new structure member, discardtup, are
 both defined and operate similar to COMPARETUP, COPYTUP, etc...
   + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
 throw out the new tuple if it compares as identical to the tuple at
 the top of the heap. Since we're already performing this comparison,
 this is essentially free.
   + in mergeonerun, we may discard a tuple if it compares as identical
 to the *last written tuple*. This is a comparison that did not take
 place before, so it's not free, but it saves a write I/O.
   + We perform the same logic in dumptuples

[ raised eyebrow ... ]  And what happens if the planner drops the
unique step and then the sort doesn't actually go to disk?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

2014-01-21 Thread Jon Nelson
On Tue, Jan 21, 2014 at 9:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jon Nelson jnelson+pg...@jamponi.net writes:
 A rough summary of the patch follows:

 - a GUC variable enables or disables this capability
 - in nodeAgg.c, eliding duplicate tuples is enabled if the number of
   distinct columns is equal to the number of sort columns (and both are
   greater than zero).
 - in createplan.c, eliding duplicate tuples is enabled if we are
   creating a unique plan which involves sorting first
 - ditto planner.c
 - all of the remaining changes are in tuplesort.c, which consist of:
   + a new macro, DISCARDTUP and a new structure member, discardtup, are
 both defined and operate similar to COMPARETUP, COPYTUP, etc...
   + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
 throw out the new tuple if it compares as identical to the tuple at
 the top of the heap. Since we're already performing this comparison,
 this is essentially free.
   + in mergeonerun, we may discard a tuple if it compares as identical
 to the *last written tuple*. This is a comparison that did not take
 place before, so it's not free, but it saves a write I/O.
   + We perform the same logic in dumptuples

 [ raised eyebrow ... ]  And what happens if the planner drops the
 unique step and then the sort doesn't actually go to disk?

I'm not familiar enough with the code to be able to answer your
question with any sort of authority, but I believe that if the state
TSS_BUILDRUNS is never hit, then basically nothing new happens.

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers