Re: RIGHT/FULL OUTER hash joins (was Re: [HACKERS] small table left outer join big table)
On Thu, Dec 30, 2010 at 11:50 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Dec 30, 2010 at 10:45 AM, Tom Lane t...@sss.pgh.pa.us wrote: I had an epiphany about this topic, or actually two of them. 1. Whether or not you think there's a significant performance reason to support hash right joins, there's a functionality reason. The infrastructure for right join could just as easily do full joins. And AFAICS, a hash full join would only require one hashable join clause --- the other FULL JOIN ON conditions could be anything at all. This is unlike the situation for merge join, where all the JOIN ON conditions have to be mergeable or it doesn't work right. So we could greatly reduce the scope of the dreaded FULL JOIN is only supported with merge-joinable join conditions error. (Well, okay, it's not *that* dreaded, but people complain about it occasionally.) Yeah, that would be neat. It might be a lot faster in some cases, too. Yeah, PostgreSQL should have this great feature. Actually Oracle 10g already has the right hash join, http://dbcrusade.blogspot.com/2008/01/oracle-hash-join-right-outer.html And Oracle 11g has the full hash join. http://www.dba-oracle.com/oracle11g/oracle_11g_full_hash_join.htm Haven't checked whether other DBMS have this feature. Thanks, Li Jie
Re: [HACKERS] small table left outer join big table
On Wed, Dec 29, 2010 at 3:58 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-12-29 at 09:59 -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Dec 29, 2010 at 7:34 AM, Simon Riggs si...@2ndquadrant.com wrote: It's not a bug, that's the way it currently works. We don't need a test case for that. Oh, you're right. I missed the fact that it's a left join. The only thing that struck me as curious about it was that the OP didn't get a nestloop-with-inner-indexscan plan. That would be explainable if there was no index on the large table's id column ... but columns named like that usually have indexes. I can't get all *that* excited about complicating hash joins as proposed. The query is still fundamentally going to be slow because you won't get out of having to seqscan the large table. The only way to make it really fast is to not read all of the large table, and nestloop-with-inner-indexscan is the only plan type with a hope of doing that. Seq scanning the big table isn't bad... we've gone to a lot of trouble to make it easy to do this, especially with many users. Maintaining many large indexes is definitely bad, all that random I/O is going to suck badly. Seems like an interesting and relatively optimisation to me. Not sure if this is a request for feature, or a proposal to write the optimisation. I hope its the latter. Thanks for your comments. Yeah I'm excited to write code for PostgreSQL, but I'm new here and not familiar with the code routine or patch submission. I will try to learn in near future. So for the moment, it is a request for feature, and I'm looking forward to any pgsql-hackers working on this. Thanks, Li Jie
[HACKERS] small table left outer join big table
Hi, Please see the following plan: postgres=# explain select * from small_table left outer join big_table using (id); QUERY PLAN Hash Left Join (cost=126408.00..142436.98 rows=371 width=12) Hash Cond: (small_table.id = big_table.id) - Seq Scan on small_table (cost=0.00..1.09 rows=9 width=8) - Hash (cost=59142.00..59142.00 rows=410 width=8) - Seq Scan on big_table (cost=0.00..59142.00 rows=410 width=8) (5 rows) Here I have a puzzle, why not choose the small table to build hash table? It can avoid multiple batches thus save significant I/O cost, isn't it? We can perform this query in two phases: 1) inner join, using the small table to build hash table. 2) check whether each tuple in the hash table has matches before, which can be done with another flag bit The only compromise is the output order, due to the two separate phases. Not sure whether the SQL standard requires it. Thanks, Li Jie
Re: [HACKERS] Why is sorting on two columns so slower thansortingon one column?
On Thu, Dec 23, 2010 at 10:26 AM, Tom Lane t...@sss.pgh.pa.us wrote: Kenneth Marshall k...@rice.edu writes: On Thu, Dec 23, 2010 at 10:42:26PM +0800, Li Jie wrote: But in the last query that sorts on id, since the query selects all the columns for output, the actual sorted size is the same, and the only difference is the comparison cost. The query sorting on two columns needs to do twice the comparison. Am I right? I think you are right. Sorry for the confusion. I doubt the cost of comparing two integers is the issue here; rather it's more likely one of how many merge passes were needed. You could find out instead of just speculating by turning on trace_sort and comparing the log outputs. regards, tom lane I follow your advice and set the trace_sort, here is the two queries result, the log outputs looks similar and they seem to use the same number of tapes. But the time is different. Could you have a look: postgres=# explain analyze select * from big_wf order by id; QUERY PLAN - Sort (cost=565525.45..575775.45 rows=410 width=8) (actual time=28102.985..39351.309 rows=410 loops=1) Sort Key: id Sort Method: external merge Disk: 72048kB - Seq Scan on big_wf (cost=0.00..59142.00 rows=410 width=8) (actual time=9.190..7262.789 rows=410 loops=1) Total runtime: 42953.855 ms (5 rows) STATEMENT: explain analyze select * from big_wf order by id; LOG: begin tuple sort: nkeys = 1, workMem = 20480, randomAccess = f STATEMENT: explain analyze select * from big_wf order by id; LOG: switching to external sort with 74 tapes: CPU 0.29s/0.28u sec elapsed 0.71 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 1 to tape 0: CPU 0.68s/2.12u sec elapsed 3.41 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 2 to tape 1: CPU 1.22s/4.24u sec elapsed 6.53 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 3 to tape 2: CPU 1.67s/6.38u sec elapsed 9.67 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 4 to tape 3: CPU 2.22s/8.61u sec elapsed 12.93 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 5 to tape 4: CPU 2.72s/10.80u sec elapsed 16.09 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 6 to tape 5: CPU 3.23s/12.86u sec elapsed 19.12 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 7 to tape 6: CPU 3.81s/15.02u sec elapsed 23.84 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 8 to tape 7: CPU 4.36s/17.13u sec elapsed 27.05 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: performsort starting: CPU 4.38s/17.20u sec elapsed 27.15 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing run 9 to tape 8: CPU 4.39s/17.88u sec elapsed 27.97 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: finished writing final run 10 to tape 9: CPU 4.39s/17.88u sec elapsed 27.97 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: performsort done (except 10-way final merge): CPU 4.39s/17.98u sec elapsed 28.10 sec STATEMENT: explain analyze select * from big_wf order by id; LOG: external sort ended, 9006 disk blocks used: CPU 8.01s/27.02u sec elapsed 42.92 sec STATEMENT: explain analyze select * from big_wf order by id; postgres=# explain analyze select * from big_wf order by age,id; Sort (cost=565525.45..575775.45 rows=410 width=8) (actual time=43709.851..57048.645 rows=410 loops=1) Sort Key: age, id Sort Method: external merge Disk: 72048kB - Seq Scan on big_wf (cost=0.00..59142.00 rows=410 width=8) (actual time=10.090..7075.208 rows=410 loops=1) Total runtime: 60721.824 ms STATEMENT: explain analyze select * from big_wf order by age,id; LOG: begin tuple sort: nkeys = 2, workMem = 20480, randomAccess = f STATEMENT: explain analyze select * from big_wf order by age,id; LOG: switching to external sort with 74 tapes: CPU 0.28s/0.30u sec elapsed 0.67 sec STATEMENT: explain analyze select * from big_wf order by age,id; LOG: finished writing run 1 to tape 0: CPU 0.71s/3.63u sec elapsed 4.90 sec STATEMENT: explain analyze select * from big_wf order by age,id; LOG: finished writing run 2 to tape 1: CPU 1.30s/7.53u sec elapsed 10.30 sec STATEMENT: explain analyze select * from big_wf order by age,id; LOG: finished writing run 3 to tape 2: CPU 1.88s/11.36u sec elapsed 15.39 sec STATEMENT: explain analyze select * from big_wf order by age,id; LOG: finished writing run 4 to tape 3: CPU 2.43s/15.20u sec elapsed 20.51 sec STATEMENT:
[HACKERS] Why is sorting on two columns so slower than sorting on one column?
Hi, Here is the test table, postgres=# \d big_wf Table public.big_wf Column | Type | Modifiers +-+--- age| integer | id | integer | postgres=# \dt+ big_wf List of relations Schema | Name | Type | Owner | Size | Description ++---+--++- public | big_wf | table | workshop | 142 MB | The first query sorting on one column: postgres=# explain analyze select * from big_wf order by age; QUERY PLAN - Sort (cost=565525.45..575775.45 rows=410 width=8) (actual time=11228.155..16427.149 rows=410 loops=1) Sort Key: age Sort Method: external sort Disk: 72112kB - Seq Scan on big_wf (cost=0.00..59142.00 rows=410 width=8) (actual time=6.196..4797.620 rows=410 loops=1) Total runtime: 19530.452 ms (5 rows) The second query sorting on two columns: postgres=# explain analyze select * from big_wf order by age,id; QUERY PLAN - Sort (cost=565525.45..575775.45 rows=410 width=8) (actual time=37544.779..48206.702 rows=410 loops=1) Sort Key: age, id Sort Method: external merge Disk: 72048kB - Seq Scan on big_wf (cost=0.00..59142.00 rows=410 width=8) (actual time=6.796..5518.663 rows=410 loops=1) Total runtime: 51258.000 ms (5 rows) The verision is 9.0.1 and the work_mem is 20MB. One special thing is, the first column(age) of all the tuples are of the same value, so the second column(id) is always needed for comparison. While the first sorting takes about only 6 seconds, the second one takes over 30 seconds, Is this too much than expected? Is there any possible optimization ? Thanks, Li Jie
[HACKERS] Why percent_rank is so slower than rank?
Hi all, I'm new to window functions. Recently I run some simple queries but surprised to find percent_rank is so slower than rank, could anybody tell me why? The table schema: test=# \d inventory1 Table public.inventory1 Column| Type | Modifiers --+-+--- inv_date_sk | integer | not null inv_item_sk | integer | not null inv_warehouse_sk | integer | not null inv_quantity_on_hand | integer | test=# \dt+ inventory1 List of relations Schema |Name| Type | Owner | Size | Description ++---+--+-+- public | inventory1 | table | workshop | 8880 kB | The rank query result: test=# explain analyze select inv_date_sk,inv_item_sk, rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1; QUERY PLAN --- WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual time=631.947..1361.158 rows=189000 loops=1) - Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual time=631.924..771.990 rows=189000 loops=1) Sort Key: inv_date_sk, inv_item_sk Sort Method: quicksort Memory: 12218kB - Seq Scan on inventory1 (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.055..198.948 rows=189000 loops=1) Total runtime: 1500.193 ms (6 rows) The percent_rank result: test=# explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1; QUERY PLAN --- WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual time=766.432..32924.804 rows=189000 loops=1) - Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual time=756.320..905.407 rows=189000 loops=1) Sort Key: inv_date_sk, inv_item_sk Sort Method: quicksort Memory: 12218kB - Seq Scan on inventory1 (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.102..224.607 rows=189000 loops=1) Total runtime: 33152.188 ms (6 rows) One special thing is that all the values of the partition key(inv_date_sk) are the same, that is, there is only one window partition. I find that percent_rank needs to buffer all the tuples to get the total number of rows. But why is it so expensive? I use 8.4.4. And I only increase the work_mem to 100M and leave other parameters untouched. Thanks, Li Jie