Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-11-13 Thread Robert Haas
On Fri, Nov 12, 2010 at 10:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bruce Momjian br...@momjian.us writes:
 FYI, I always wondered if the rare use of mergejoins justified the extra
 planning time of carrying around all those joinpaths.

 They're hardly rare.

They fairly rare in the sorts of queries I normally issue, but I'd
quibble with the statement on other grounds: IME, we generate far more
nest loops paths than anything else.  The comment in
match_unsorted_outer() says it all:

 * We always generate a nestloop path for each available outer path.
 * In fact we may generate as many as five: one on the cheapest-total-cost
 * inner path, one on the same with materialization, one on the
 * cheapest-startup-cost inner path (if different), one on the
 * cheapest-total inner-indexscan path (if any), and one on the
 * cheapest-startup inner-indexscan path (if different).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-11-12 Thread Bruce Momjian
Tom Lane wrote:
 Leonardo Francalanci m_li...@yahoo.it writes:
  Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
  per child table, only one of which was relevant to the query.   In your
  test case there are no irrelevant indexes, which is why the  runtime
  didn't change.
 
  Mmh... I must be doing something wrong. It looks to me it's not just
  the irrelevant indexes: it's the order by that counts.
 
 Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
 any mergejoinable join clauses, then the possibly_useful_pathkeys test
 in find_usable_indexes figures out that we aren't interested in the sort
 ordering of *any* indexes, so the whole thing gets short-circuited.
 You need at least the possibility of interest in sorted output from an
 indexscan before any of this code runs.

FYI, I always wondered if the rare use of mergejoins justified the extra
planning time of carrying around all those joinpaths.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-11-12 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 FYI, I always wondered if the rare use of mergejoins justified the extra
 planning time of carrying around all those joinpaths.

They're hardly rare.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-11-04 Thread Tom Lane
[ for the archives' sake ]

I wrote:
 I had a thought about how to make get_tabstat_entry() faster without
 adding overhead: what if we just plain remove the search, and always
 assume that a new entry has to be added to the tabstat array?

I spent some time looking into this idea.  It doesn't really work,
because there are places that will break if a transaction has more than
one tabstat entry for the same relation.  The one that seems most
difficult to fix is that pgstat_recv_tabstat() clamps its n_live_tuples
and n_dead_tuples values to be nonnegative after adding in each delta
received from a backend.  That is a good idea because it prevents insane
results if some messages get lost --- but if a transaction's updates get
randomly spread into several tabstat items, the intermediate counts
might get clamped, resulting in a wrong final answer even though nothing
was lost.

I also added some instrumentation printouts and found that in our
regression tests:
* about 10% of get_tabstat_entry() calls find an existing entry
  for the relation OID.  This seems to happen only when a
  relcache entry gets flushed mid-transaction, but that does
  happen, and not so infrequently either.
* about half of the transactions use as many as 20 tabstats,
  and 10% use 50 or more; but it drops off fast after that.
  Almost no transactions use as many as 100 tabstats.
It's not clear that these numbers are representative of typical
database applications, but they're something to start with anyway.

I also did some testing to compare the cost of get_tabstat_entry's
linear search against a dynahash.c table with OID key.  As I suspected,
a hash table would make this code a *lot* slower for small numbers of
tabstat entries: about a factor of 10 slower.  You need upwards of 100
tabstats touched in a transaction before the hash table begins to pay
for itself.  This is largely because dynahash doesn't have any cheap way
to reset a hashtable to empty, so you have to initialize and destroy the
table for each transaction.  By the time you've eaten that overhead,
you've already expended as many cycles as the linear search takes to
handle several dozen entries.

I conclude that if we wanted to do something about this, the most
practical solution would be the one of executing linear searches until
we get to 100+ tabstat entries in a transaction, and then building a
hashtable for subsequent searches.  However, it's exceedingly unclear
that it will ever be worth the effort or code space to do that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-11-01 Thread Tom Lane
I wrote:
 samples  %symbol name
 447433   47.1553  get_tabstat_entry
 185458   19.5456  find_all_inheritors
 53064 5.5925  SearchCatCache
 33864 3.5690  pg_strtok

 get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
 the number of tables they have to deal with.  However, the constant
 factors are small enough that you need a heck of a lot of tables
 before they become significant consumers of runtime.  I'm not convinced
 that we should be optimizing for 9000-child-table cases.

 It'd be worth fixing these if we can do it without either introducing a
 lot of complexity, or slowing things down for typical cases that have
 only a few tables.  Offhand not sure about how to do either.

I had a thought about how to make get_tabstat_entry() faster without
adding overhead: what if we just plain remove the search, and always
assume that a new entry has to be added to the tabstat array?

The existing code seems to be designed to make no assumptions about
how it's being used, but that's a bit silly.  We know that the links are
coming from the relcache, which will have only one entry per relation,
and that the relcache is designed to hang onto the links for (at least)
the life of a transaction.  So rather than optimizing for the case where
the relcache fails to remember the tabstat link, maybe we should
optimize for the case where it does remember.

The worst-case consequence AFAICS would be multiple tabstat entries for
the same relation, which seems pretty noncritical anyway.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
I wrote:
 the right way to make this faster is to refactor things so that we
 don't generate useless equivalence classes in the first place, or
 at least don't keep them around in the planner's lists once we realize
 they're useless.

After a bit of hacking, I propose the attached patch.

 I like Heikki's hack to cut down on searching in make_canonical_pathkey,
 but I think that complicating the data structure searching beyond that
 is just a band-aid.

With the given test case and this patch, we end up with exactly two
canonical pathkeys referencing a single EquivalenceClass.  So as far
as I can tell there's not a lot of point in refining the pathkey
searching.  Now, the EquivalenceClass has got 483 members, which
means that there's still some O(N^2) behavior in
get_eclass_for_sort_expr.  There might be some use in refining the
search for a matching eclass member.  It's not sticking out in
profiling like it did before though.

regards, tom lane

diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index d6402cf911817b1b8c17da91019a1fac19bf051a..5c0786f2fe6dea9a72ad66ba93aa8833ab0e26ba 100644
*** a/src/backend/optimizer/README
--- b/src/backend/optimizer/README
*** sort ordering was important; and so usin
*** 632,640 
  orderings doesn't create any real problem.
  
  
  
- Though Bob Devine bob.dev...@worldnet.att.net was not involved in the 
- coding of our optimizer, he is available to field questions about
- optimizer topics.
  
  -- bjm  tgl
--- 632,670 
  orderings doesn't create any real problem.
  
  
+ Order of processing for EquivalenceClasses and PathKeys
+ ---
+ 
+ As alluded to above, there is a specific sequence of phases in the
+ processing of EquivalenceClasses and PathKeys during planning.  During the
+ initial scanning of the query's quals (deconstruct_jointree followed by
+ reconsider_outer_join_clauses), we construct EquivalenceClasses based on
+ mergejoinable clauses found in the quals.  At the end of this process,
+ we know all we can know about equivalence of different variables, so
+ subsequently there will be no further merging of EquivalenceClasses.
+ At that point it is possible to consider the EquivalenceClasses as
+ canonical and build canonical PathKeys that reference them.  Before
+ we reach that point (actually, before entering query_planner at all)
+ we also ensure that we have constructed EquivalenceClasses for all the
+ expressions used in the query's ORDER BY and related clauses.  These
+ classes might or might not get merged together, depending on what we
+ find in the quals.
+ 
+ Because all the EquivalenceClasses are known before we begin path
+ generation, we can use them as a guide to which indexes are of interest:
+ if an index's column is not mentioned in any EquivalenceClass then that
+ index's sort order cannot possibly be helpful for the query.  This allows
+ short-circuiting of much of the processing of create_index_paths() for
+ irrelevant indexes.
+ 
+ There are some cases where planner.c constructs additional
+ EquivalenceClasses and PathKeys after query_planner has completed.
+ In these cases, the extra ECs/PKs are needed to represent sort orders
+ that were not considered during query_planner.  Such situations should be
+ minimized since it is impossible for query_planner to return a plan
+ producing such a sort order, meaning a explicit sort will always be needed.
+ Currently this happens only for queries involving multiple window functions
+ with different orderings, so extra sorts are needed anyway.
  
  
  -- bjm  tgl
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index e44e960b5454d4698ed82e4e857794ffe2a9adf2..c101c272a14b2f1b9d92a54670688df057d84a13 100644
*** a/src/backend/optimizer/path/equivclass.c
--- b/src/backend/optimizer/path/equivclass.c
*** static bool reconsider_full_join_clause(
*** 78,83 
--- 78,87 
   * join.  (This is the reason why we need a failure return.  It's more
   * convenient to check this case here than at the call sites...)
   *
+  * On success return, we have also initialized the clause's left_ec/right_ec
+  * fields to point to the EquivalenceClass built from it.  This saves lookup
+  * effort later.
+  *
   * Note: constructing merged EquivalenceClasses is a standard UNION-FIND
   * problem, for which there exist better data structures than simple lists.
   * If this code ever proves to be a bottleneck then it could be sped up ---
*** process_equivalence(PlannerInfo *root, R
*** 106,111 
--- 110,119 
  			   *em2;
  	ListCell   *lc1;
  
+ 	/* Should not already be marked as having generated an eclass */
+ 	Assert(restrictinfo-left_ec == NULL);
+ 	Assert(restrictinfo-right_ec == NULL);
+ 
  	/* Extract info from given clause */
  	Assert(is_opclause(clause));
  	opno = ((OpExpr *) clause)-opno;
*** 

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 On the other hand, if I use a similar test case to my original  one
 (i.e. the tables are much wider) then the query planning takes
 1.42  seconds in 9.1 with this patch instead of about 4.7 seconds
 as we observed it  using PostgreSQL 9.0.0. The beginning of the gprof
 output now looks like  this:

Hi,

I'm really interested in this patch.

I tried a simple test case:

create table t (a integer, b text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..9000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a ' || i*10 || ' 
and a = ' || (i+1)*10 || ' ) ) INHERITS (t)';
   EXECUTE 'create index tidx' || i || ' ON t' || i || '  (a)';
END LOOP;
END$$;

explain select * from t where a  1060 and a  1090;

but I don't get any gain from the patch... explain time is still around 250 ms.

Tried with 9000 partitions, time is still 2 secs.


Maybe I've missed completely the patch purpose?


(I tried the test case at

http://archives.postgresql.org/message-id/4cbd9ddc.4040...@cybertec.at

and that, in fact, gets a boost with this patch).



Leonardo




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 but I don't get any gain from the patch... explain time is still  around 250 
ms.
 Tried with 9000 partitions, time is still 2  secs.


Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
and got 250ms with both versions, with 9000 partitions 2 secs (again
no gain from the patch)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
Leonardo Francalanci m_li...@yahoo.it writes:
 I tried a simple test case:

 create table t (a integer, b text);

 DO $$DECLARE i int;
 BEGIN
 FOR i IN 0..9000 LOOP
 EXECUTE 'create table t' || i || ' ( CHECK (a ' || i*10 || ' 
 and a = ' || (i+1)*10 || ' ) ) INHERITS (t)';
EXECUTE 'create index tidx' || i || ' ON t' || i || '  (a)';
 END LOOP;
 END$$;

 explain select * from t where a  1060 and a  1090;

This is going to be dominated by constraint exclusion checking.  There's
basically no fix for that except a more explicit representation of the
partitioning rules.  If the planner has to make 8999 theorem proofs to
remove the 8999 unwanted partitions from the plan, it's gonna take
awhile.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 This is going to be dominated by constraint exclusion  checking.  There's
 basically no fix for that except a more explicit  representation of the
 partitioning rules.  

Damn, I knew that was going to be more complicated :)

So in which case does this patch help? I guess in a multi-index
scenario? childtables.sql is kind of hard to read (I think a FOR loop
would have been more auto-explaining?).




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
Boszormenyi Zoltan z...@cybertec.at writes:
 On the other hand, if I use a similar test case to my original one
 (i.e. the tables are much wider) then the query planning takes
 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
 as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
 output now looks like this:

   %   cumulative   self  self total  
  time   seconds   secondscalls   s/call   s/call  name   
  21.13  0.30 0.30   235091 0.00 0.00  SearchCatCache
   7.04  0.40 0.10  1507206 0.00 0.00  
 hash_search_with_hash_value
   3.52  0.45 0.05  2308219 0.00 0.00  AllocSetAlloc

Yeah, for me it looks even worse: oprofile shows about 77% of time in
SearchCatCache.  I poked around a little and it seems that probably most
of the time is going into searches of the STATRELATTINH syscache, which
looks like this:

$13 = {id = 41, cc_next = 0x2b43a60, 
  cc_relname = 0x7f6bc6ed2218 pg_statistic, cc_reloid = 2619, 
  cc_indexoid = 2696, cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8, 
  cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key = {1, 2, 3, 0}, 
  ...

Most of those entries are negative cache entries, since we don't have
any actual stats in this toy example.

I think that we probably should be very circumspect about believing that
this example is still a good guide to what to optimize next; in
particular, in a real-world example with real stats, I'm not sure that
the hot spots will still be in the same places.  I'd advise loading up
some real data and doing more profiling.

However, if the hot spot does stay in SearchCatCache, I can't help
noticing that those bucket chains are looking a bit overloaded ---
sixty-plus entries per bucket ain't good.  Maybe it's time to teach
catcache.c how to reorganize its hashtables once the load factor
exceeds a certain level.  Or more drastically, maybe it should lose
its private hashtable logic and use dynahash.c; I'm not sure at the
moment if the private implementation has any important characteristics
dynahash hasn't got.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
I wrote:
 This is going to be dominated by constraint exclusion checking.

Hmm, maybe I spoke too soon.  With 9000 child tables I get a profile
like this:

samples  %symbol name
447433   47.1553  get_tabstat_entry
185458   19.5456  find_all_inheritors
53064 5.5925  SearchCatCache
33864 3.5690  pg_strtok
26301 2.7719  hash_search_with_hash_value
22577 2.3794  AllocSetAlloc
6696  0.7057  MemoryContextAllocZeroAligned
6250  0.6587  expression_tree_walker
5141  0.5418  LockReleaseAll
4779  0.5037  get_relation_info
4506  0.4749  MemoryContextAlloc
4467  0.4708  expression_tree_mutator
4136  0.4359  pgstat_initstats
3914  0.4125  relation_excluded_by_constraints

get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
the number of tables they have to deal with.  However, the constant
factors are small enough that you need a heck of a lot of tables
before they become significant consumers of runtime.  I'm not convinced
that we should be optimizing for 9000-child-table cases.

It'd be worth fixing these if we can do it without either introducing a
lot of complexity, or slowing things down for typical cases that have
only a few tables.  Offhand not sure about how to do either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Robert Haas
On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Boszormenyi Zoltan z...@cybertec.at writes:
 On the other hand, if I use a similar test case to my original one
 (i.e. the tables are much wider) then the query planning takes
 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
 as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
 output now looks like this:

   %   cumulative   self              self     total
  time   seconds   seconds    calls   s/call   s/call  name
  21.13      0.30     0.30   235091     0.00     0.00  SearchCatCache
   7.04      0.40     0.10  1507206     0.00     0.00  
 hash_search_with_hash_value
   3.52      0.45     0.05  2308219     0.00     0.00  AllocSetAlloc

 Yeah, for me it looks even worse: oprofile shows about 77% of time in
 SearchCatCache.  I poked around a little and it seems that probably most
 of the time is going into searches of the STATRELATTINH syscache, which
 looks like this:

 $13 = {id = 41, cc_next = 0x2b43a60,
  cc_relname = 0x7f6bc6ed2218 pg_statistic, cc_reloid = 2619,
  cc_indexoid = 2696, cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8,
  cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key = {1, 2, 3, 0},
  ...

 Most of those entries are negative cache entries, since we don't have
 any actual stats in this toy example.

 I think that we probably should be very circumspect about believing that
 this example is still a good guide to what to optimize next; in
 particular, in a real-world example with real stats, I'm not sure that
 the hot spots will still be in the same places.  I'd advise loading up
 some real data and doing more profiling.

 However, if the hot spot does stay in SearchCatCache, I can't help
 noticing that those bucket chains are looking a bit overloaded ---
 sixty-plus entries per bucket ain't good.  Maybe it's time to teach
 catcache.c how to reorganize its hashtables once the load factor
 exceeds a certain level.  Or more drastically, maybe it should lose
 its private hashtable logic and use dynahash.c; I'm not sure at the
 moment if the private implementation has any important characteristics
 dynahash hasn't got.

I'm not sure what's happening in this particular case, but I seem to
remember poking at a case a while back where we were doing a lot of
repeated statistics lookups for the same columns.  If that's also the
the case here and if there is some way to avoid it (hang a pointer to
the stats off the node tree somewhere?) we might be able to cut down
on the number of hash probes, as an alternative to or in addition to
making them faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 Hmm, maybe I spoke too soon.  With 9000 child tables I get  a profile
 like this:


Well, the 9000-table-test-case was meant to check the difference in
performance with/without the patch... I don't see the reason for trying
to optimize such an unrealistic case.

BTW can someone explain to me which are the cases where the
patch actually helps?


 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
Leonardo Francalanci m_li...@yahoo.it writes:
 BTW can someone explain to me which are the cases where the
 patch actually helps?

Cases with lots of irrelevant indexes.  Zoltan's example had 4 indexes
per child table, only one of which was relevant to the query.  In your
test case there are no irrelevant indexes, which is why the runtime
didn't change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 However, if the hot spot does stay in SearchCatCache, I can't help
 noticing that those bucket chains are looking a bit overloaded ---
 sixty-plus entries per bucket ain't good.  Maybe it's time to teach
 catcache.c how to reorganize its hashtables once the load factor
 exceeds a certain level.  Or more drastically, maybe it should lose
 its private hashtable logic and use dynahash.c; I'm not sure at the
 moment if the private implementation has any important characteristics
 dynahash hasn't got.

 I'm not sure what's happening in this particular case, but I seem to
 remember poking at a case a while back where we were doing a lot of
 repeated statistics lookups for the same columns.  If that's also the
 the case here and if there is some way to avoid it (hang a pointer to
 the stats off the node tree somewhere?) we might be able to cut down
 on the number of hash probes, as an alternative to or in addition to
 making them faster.

I think there are already layers of caching in the planner to avoid
fetching the same stats entries more than once per query.  The problem
here is that there are so many child tables that even fetching stats
once per table per query starts to add up.  (Also, as I said, I'm
worried that we're being misled by the fact that there are no stats to
fetch --- so we're not seeing the costs of actually doing something with
the stats if they existed.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Alvaro Herrera
Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
 I wrote:
  This is going to be dominated by constraint exclusion checking.
 
 Hmm, maybe I spoke too soon.  With 9000 child tables I get a profile
 like this:
 
 samples  %symbol name
 447433   47.1553  get_tabstat_entry

Is there a reason for keeping the pgstat info in plain lists?  We could
use dynahash there too, I think.  It would have more palloc overhead
that way, though (hmm, but perhaps that can be fixed by having a smart
alloc function for it, preallocating a bunch of entries instead of one
by one).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
 samples  %symbol name
 447433   47.1553  get_tabstat_entry

 Is there a reason for keeping the pgstat info in plain lists?

Yeah: anything else loses for small numbers of tables per query, which
is the normal case.  I'd guess you'd need ~100 tables touched in
a single transaction before a hashtable is even worth thinking about.

We could possibly adopt a solution similar to the planner's approach for
joinrels: start with a simple list, and switch over to hashing if the
list gets too long.  But I'm really doubtful that it's worth the code
space.  Even with Zoltan's 500-or-so-table case, this wasn't on the
radar screen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Leonardo Francalanci
 Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
 per child table, only one of which was relevant to the query.   In your
 test case there are no irrelevant indexes, which is why the  runtime
 didn't change.



Mmh... I must be doing something wrong. It looks to me it's not just
the irrelevant indexes: it's the order by that counts. If I remove that
times are the same with and without the patch:

using the test case:

explain select * from inh_parent
where timestamp1 between '2010-04-06' and '2010-06-25'

this one runs in the same time with the patch; but adding:


order by timestamp2

made the non-patched version run 3 times slower.

My test case:

create table t (a integer, b integer, c integer, d integer, e text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..2000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a ' || i*10 || ' 
and a = ' || (i+1)*10 || ' ) ) INHERITS (t)';
EXECUTE 'create index taidx' || i || ' ON t' || i || '  (a)';
EXECUTE 'create index tbidx' || i || ' ON t' || i || '  (b)';
EXECUTE 'create index tcidx' || i || ' ON t' || i || '  (c)';
EXECUTE 'create index tdidx' || i || ' ON t' || i || '  (d)';
END LOOP;
END$$;


explain select * from t where a  1060 and a  109000

this runs in 1.5 secs with and without the patch. But if I add

 order by b 

the non-patched version runs in 10 seconds.

Am I getting it wrong?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-29 Thread Tom Lane
Leonardo Francalanci m_li...@yahoo.it writes:
 Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
 per child table, only one of which was relevant to the query.   In your
 test case there are no irrelevant indexes, which is why the  runtime
 didn't change.

 Mmh... I must be doing something wrong. It looks to me it's not just
 the irrelevant indexes: it's the order by that counts.

Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
any mergejoinable join clauses, then the possibly_useful_pathkeys test
in find_usable_indexes figures out that we aren't interested in the sort
ordering of *any* indexes, so the whole thing gets short-circuited.
You need at least the possibility of interest in sorted output from an
indexscan before any of this code runs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Boszormenyi Zoltan
Heikki Linnakangas írta:
 On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
 thank you very much for pointing me to dynahash, here is the
 next version that finally seems to work.

 Two patches are attached, the first is the absolute minimum for
 making it work, this still has the Tree type for canon_pathkeys
 and eq_classes got the same treatment as join_rel_list/join_rel_hash
 has in the current sources: if the list grows larger than 32, a hash
 table
 is created. It seems to be be enough for doing in for
   get_eclass_for_sort_expr()
 only, the other users of eq_classes aren't bothered by this change.

 That's better, but can't you use dynahash for canon_pathkeys as well?

Here's a purely dynahash solution. It's somewhat slower than
the tree version, 0.45 vs 0.41 seconds in the cached case for the
previously posted test case.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/

diff -durpN postgresql.orig/src/backend/optimizer/path/equivclass.c postgresql.1/src/backend/optimizer/path/equivclass.c
--- postgresql.orig/src/backend/optimizer/path/equivclass.c	2010-10-15 10:31:47.0 +0200
+++ postgresql.1/src/backend/optimizer/path/equivclass.c	2010-10-26 17:01:57.0 +0200
@@ -24,6 +24,7 @@
 #include optimizer/planmain.h
 #include optimizer/prep.h
 #include optimizer/var.h
+#include utils/hsearch.h
 #include utils/lsyscache.h
 
 
@@ -360,75 +361,103 @@ add_eq_member(EquivalenceClass *ec, Expr
 
 
 /*
- * get_eclass_for_sort_expr
- *	  Given an expression and opfamily info, find an existing equivalence
- *	  class it is a member of; if none, build a new single-member
- *	  EquivalenceClass for it.
- *
- * sortref is the SortGroupRef of the originating SortGroupClause, if any,
- * or zero if not.	(It should never be zero if the expression is volatile!)
- *
- * This can be used safely both before and after EquivalenceClass merging;
- * since it never causes merging it does not invalidate any existing ECs
- * or PathKeys.
- *
- * Note: opfamilies must be chosen consistently with the way
- * process_equivalence() would do; that is, generated from a mergejoinable
- * equality operator.  Else we might fail to detect valid equivalences,
- * generating poor (but not incorrect) plans.
+ * eq_classes_match - matching function for eq_classes_hash in PlannerInfo
  */
-EquivalenceClass *
-get_eclass_for_sort_expr(PlannerInfo *root,
-		 Expr *expr,
-		 Oid expr_datatype,
-		 List *opfamilies,
-		 Index sortref)
+static int
+eq_classes_match(const void *key1, const void *key2, Size keysize)
 {
-	EquivalenceClass *newec;
-	EquivalenceMember *newem;
+	EquivalenceClass *ec1 = (EquivalenceClass *) key1; /* this is in the hashtable */
+	EquivalenceClass *ec2 = (EquivalenceClass *) key2; /* this is the new matched entry */
 	ListCell   *lc1;
-	MemoryContext oldcontext;
+	ListCell   *lc2;
 
 	/*
-	 * Scan through the existing EquivalenceClasses for a match
+	 * Never match to a volatile EC, except when we are looking at another
+	 * reference to the same volatile SortGroupClause.
 	 */
-	foreach(lc1, root-eq_classes)
+	if (ec1-ec_has_volatile 
+		(ec2-ec_sortref == 0 || ec2-ec_sortref != ec1-ec_sortref))
+		return 1;
+
+	if (!equal(ec1-ec_opfamilies, ec2-ec_opfamilies))
+		return 1;
+
+	foreach(lc1, ec1-ec_members)
 	{
-		EquivalenceClass *cur_ec = (EquivalenceClass *) lfirst(lc1);
-		ListCell   *lc2;
+		EquivalenceMember *em1 = (EquivalenceMember *) lfirst(lc1);
 
 		/*
-		 * Never match to a volatile EC, except when we are looking at another
-		 * reference to the same volatile SortGroupClause.
+		 * If below an outer join, don't match constants: they're not as
+		 * constant as they look.
 		 */
-		if (cur_ec-ec_has_volatile 
-			(sortref == 0 || sortref != cur_ec-ec_sortref))
-			continue;
-
-		if (!equal(opfamilies, cur_ec-ec_opfamilies))
+		if (ec1-ec_below_outer_join 
+			em1-em_is_const)
 			continue;
 
-		foreach(lc2, cur_ec-ec_members)
+		foreach(lc2, ec2-ec_members)
 		{
-			EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc2);
-
-			/*
-			 * If below an outer join, don't match constants: they're not as
-			 * constant as they look.
-			 */
-			if (cur_ec-ec_below_outer_join 
-cur_em-em_is_const)
-continue;
+			EquivalenceMember *em2 = (EquivalenceMember *) lfirst(lc2);
 
-			if (expr_datatype == cur_em-em_datatype 
-equal(expr, cur_em-em_expr))
-return cur_ec;	/* Match! */
+			if (em1-em_datatype == em2-em_datatype 
+equal(em1-em_expr, em2-em_expr))
+return 0;
 		}
 	}
 
+	return 1;
+}
+
+
+/*
+ * build_eq_classes_hash
+ *	Build the initial contents of PlannerInfo.eq_classes_hash
+ *	for faster search in PlannerInfo.eq_classes. This is used
+ *	to  make   get_eclass_for_sort_expr()  faster  for  large
+ *	inheritance trees.
+ */
+static void

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Boszormenyi Zoltan
Boszormenyi Zoltan írta:
 Heikki Linnakangas írta:
   
 On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
 
 thank you very much for pointing me to dynahash, here is the
 next version that finally seems to work.

 Two patches are attached, the first is the absolute minimum for
 making it work, this still has the Tree type for canon_pathkeys
 and eq_classes got the same treatment as join_rel_list/join_rel_hash
 has in the current sources: if the list grows larger than 32, a hash
 table
 is created. It seems to be be enough for doing in for
   get_eclass_for_sort_expr()
 only, the other users of eq_classes aren't bothered by this change.
   
 That's better, but can't you use dynahash for canon_pathkeys as well?
 

 Here's a purely dynahash solution. It's somewhat slower than
 the tree version, 0.45 vs 0.41 seconds in the cached case for the
 previously posted test case.
   

And now in context diff, sorry for my affection towards unified diffs. :-)

 Best regards,
 Zoltán Böszörményi

   


-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/

diff -dcrpN postgresql.orig/src/backend/optimizer/path/equivclass.c postgresql.1/src/backend/optimizer/path/equivclass.c
*** postgresql.orig/src/backend/optimizer/path/equivclass.c	2010-10-15 10:31:47.0 +0200
--- postgresql.1/src/backend/optimizer/path/equivclass.c	2010-10-26 17:01:57.0 +0200
***
*** 24,29 
--- 24,30 
  #include optimizer/planmain.h
  #include optimizer/prep.h
  #include optimizer/var.h
+ #include utils/hsearch.h
  #include utils/lsyscache.h
  
  
*** add_eq_member(EquivalenceClass *ec, Expr
*** 360,434 
  
  
  /*
!  * get_eclass_for_sort_expr
!  *	  Given an expression and opfamily info, find an existing equivalence
!  *	  class it is a member of; if none, build a new single-member
!  *	  EquivalenceClass for it.
!  *
!  * sortref is the SortGroupRef of the originating SortGroupClause, if any,
!  * or zero if not.	(It should never be zero if the expression is volatile!)
!  *
!  * This can be used safely both before and after EquivalenceClass merging;
!  * since it never causes merging it does not invalidate any existing ECs
!  * or PathKeys.
!  *
!  * Note: opfamilies must be chosen consistently with the way
!  * process_equivalence() would do; that is, generated from a mergejoinable
!  * equality operator.  Else we might fail to detect valid equivalences,
!  * generating poor (but not incorrect) plans.
   */
! EquivalenceClass *
! get_eclass_for_sort_expr(PlannerInfo *root,
! 		 Expr *expr,
! 		 Oid expr_datatype,
! 		 List *opfamilies,
! 		 Index sortref)
  {
! 	EquivalenceClass *newec;
! 	EquivalenceMember *newem;
  	ListCell   *lc1;
! 	MemoryContext oldcontext;
  
  	/*
! 	 * Scan through the existing EquivalenceClasses for a match
  	 */
! 	foreach(lc1, root-eq_classes)
  	{
! 		EquivalenceClass *cur_ec = (EquivalenceClass *) lfirst(lc1);
! 		ListCell   *lc2;
  
  		/*
! 		 * Never match to a volatile EC, except when we are looking at another
! 		 * reference to the same volatile SortGroupClause.
  		 */
! 		if (cur_ec-ec_has_volatile 
! 			(sortref == 0 || sortref != cur_ec-ec_sortref))
! 			continue;
! 
! 		if (!equal(opfamilies, cur_ec-ec_opfamilies))
  			continue;
  
! 		foreach(lc2, cur_ec-ec_members)
  		{
! 			EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc2);
! 
! 			/*
! 			 * If below an outer join, don't match constants: they're not as
! 			 * constant as they look.
! 			 */
! 			if (cur_ec-ec_below_outer_join 
! cur_em-em_is_const)
! continue;
  
! 			if (expr_datatype == cur_em-em_datatype 
! equal(expr, cur_em-em_expr))
! return cur_ec;	/* Match! */
  		}
  	}
  
  	/*
- 	 * No match, so build a new single-member EC
- 	 *
  	 * Here, we must be sure that we construct the EC in the right context. We
  	 * can assume, however, that the passed expr is long-lived.
  	 */
--- 361,463 
  
  
  /*
!  * eq_classes_match - matching function for eq_classes_hash in PlannerInfo
   */
! static int
! eq_classes_match(const void *key1, const void *key2, Size keysize)
  {
! 	EquivalenceClass *ec1 = (EquivalenceClass *) key1; /* this is in the hashtable */
! 	EquivalenceClass *ec2 = (EquivalenceClass *) key2; /* this is the new matched entry */
  	ListCell   *lc1;
! 	ListCell   *lc2;
  
  	/*
! 	 * Never match to a volatile EC, except when we are looking at another
! 	 * reference to the same volatile SortGroupClause.
  	 */
! 	if (ec1-ec_has_volatile 
! 		(ec2-ec_sortref == 0 || ec2-ec_sortref != ec1-ec_sortref))
! 		return 1;
! 
! 	if (!equal(ec1-ec_opfamilies, ec2-ec_opfamilies))
! 		return 1;
! 
! 	foreach(lc1, ec1-ec_members)
  	{
! 		EquivalenceMember *em1 = (EquivalenceMember *) lfirst(lc1);
  
  		/*
! 		 * If below an outer join, don't match constants: they're not as
! 		 * 

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Boszormenyi Zoltan
Boszormenyi Zoltan írta:
 Boszormenyi Zoltan írta:
   
 Heikki Linnakangas írta:
   
 
 On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
 
   
 thank you very much for pointing me to dynahash, here is the
 next version that finally seems to work.

 Two patches are attached, the first is the absolute minimum for
 making it work, this still has the Tree type for canon_pathkeys
 and eq_classes got the same treatment as join_rel_list/join_rel_hash
 has in the current sources: if the list grows larger than 32, a hash
 table
 is created. It seems to be be enough for doing in for
   get_eclass_for_sort_expr()
 only, the other users of eq_classes aren't bothered by this change.
   
 
 That's better, but can't you use dynahash for canon_pathkeys as well?
 
   
 Here's a purely dynahash solution. It's somewhat slower than
 the tree version, 0.45 vs 0.41 seconds in the cached case for the
 previously posted test case.
   
 

 And now in context diff, sorry for my affection towards unified diffs. :-)
   

A little better version, no need for the heavy hash_any, hash_uint32
on the lower 32 bits on pk_eclass is enough. The profiling runtime
is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/

diff -dcrpN postgresql.orig/src/backend/optimizer/path/equivclass.c postgresql.1/src/backend/optimizer/path/equivclass.c
*** postgresql.orig/src/backend/optimizer/path/equivclass.c	2010-10-15 10:31:47.0 +0200
--- postgresql.1/src/backend/optimizer/path/equivclass.c	2010-10-28 12:37:47.0 +0200
***
*** 24,29 
--- 24,30 
  #include optimizer/planmain.h
  #include optimizer/prep.h
  #include optimizer/var.h
+ #include utils/hsearch.h
  #include utils/lsyscache.h
  
  
*** add_eq_member(EquivalenceClass *ec, Expr
*** 360,434 
  
  
  /*
!  * get_eclass_for_sort_expr
!  *	  Given an expression and opfamily info, find an existing equivalence
!  *	  class it is a member of; if none, build a new single-member
!  *	  EquivalenceClass for it.
!  *
!  * sortref is the SortGroupRef of the originating SortGroupClause, if any,
!  * or zero if not.	(It should never be zero if the expression is volatile!)
!  *
!  * This can be used safely both before and after EquivalenceClass merging;
!  * since it never causes merging it does not invalidate any existing ECs
!  * or PathKeys.
!  *
!  * Note: opfamilies must be chosen consistently with the way
!  * process_equivalence() would do; that is, generated from a mergejoinable
!  * equality operator.  Else we might fail to detect valid equivalences,
!  * generating poor (but not incorrect) plans.
   */
! EquivalenceClass *
! get_eclass_for_sort_expr(PlannerInfo *root,
! 		 Expr *expr,
! 		 Oid expr_datatype,
! 		 List *opfamilies,
! 		 Index sortref)
  {
! 	EquivalenceClass *newec;
! 	EquivalenceMember *newem;
  	ListCell   *lc1;
! 	MemoryContext oldcontext;
  
  	/*
! 	 * Scan through the existing EquivalenceClasses for a match
  	 */
! 	foreach(lc1, root-eq_classes)
  	{
! 		EquivalenceClass *cur_ec = (EquivalenceClass *) lfirst(lc1);
! 		ListCell   *lc2;
  
  		/*
! 		 * Never match to a volatile EC, except when we are looking at another
! 		 * reference to the same volatile SortGroupClause.
  		 */
! 		if (cur_ec-ec_has_volatile 
! 			(sortref == 0 || sortref != cur_ec-ec_sortref))
! 			continue;
! 
! 		if (!equal(opfamilies, cur_ec-ec_opfamilies))
  			continue;
  
! 		foreach(lc2, cur_ec-ec_members)
  		{
! 			EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc2);
! 
! 			/*
! 			 * If below an outer join, don't match constants: they're not as
! 			 * constant as they look.
! 			 */
! 			if (cur_ec-ec_below_outer_join 
! cur_em-em_is_const)
! continue;
  
! 			if (expr_datatype == cur_em-em_datatype 
! equal(expr, cur_em-em_expr))
! return cur_ec;	/* Match! */
  		}
  	}
  
  	/*
- 	 * No match, so build a new single-member EC
- 	 *
  	 * Here, we must be sure that we construct the EC in the right context. We
  	 * can assume, however, that the passed expr is long-lived.
  	 */
--- 361,463 
  
  
  /*
!  * eq_classes_match - matching function for eq_classes_hash in PlannerInfo
   */
! static int
! eq_classes_match(const void *key1, const void *key2, Size keysize)
  {
! 	EquivalenceClass *ec1 = (EquivalenceClass *) key1; /* this is in the hashtable */
! 	EquivalenceClass *ec2 = (EquivalenceClass *) key2; /* this is the new matched entry */
  	ListCell   *lc1;
! 	ListCell   *lc2;
  
  	/*
! 	 * Never match to a volatile EC, except when we are looking at another
! 	 * reference to the same volatile SortGroupClause.
  	 */
! 	if (ec1-ec_has_volatile 
! 		(ec2-ec_sortref == 0 || ec2-ec_sortref != ec1-ec_sortref))
! 	

Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Boszormenyi Zoltan
Boszormenyi Zoltan írta:
 Boszormenyi Zoltan írta:
   
 Boszormenyi Zoltan írta:
   
 
 Heikki Linnakangas írta:
   
 
   
 On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
 
   
 
 thank you very much for pointing me to dynahash, here is the
 next version that finally seems to work.

 Two patches are attached, the first is the absolute minimum for
 making it work, this still has the Tree type for canon_pathkeys
 and eq_classes got the same treatment as join_rel_list/join_rel_hash
 has in the current sources: if the list grows larger than 32, a hash
 table
 is created. It seems to be be enough for doing in for
   get_eclass_for_sort_expr()
 only, the other users of eq_classes aren't bothered by this change.
   
 
   
 That's better, but can't you use dynahash for canon_pathkeys as well?
 
   
 
 Here's a purely dynahash solution. It's somewhat slower than
 the tree version, 0.45 vs 0.41 seconds in the cached case for the
 previously posted test case.
   
 
   
 And now in context diff, sorry for my affection towards unified diffs. :-)
   
 

 A little better version, no need for the heavy hash_any, hash_uint32
 on the lower 32 bits on pk_eclass is enough. The profiling runtime
 is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

 Best regards,
 Zoltán Böszörményi
   

Btw, the top entries in the current gprof output are:

Each sample counts as 0.01 seconds.
  %   cumulative   self  self total  
 time   seconds   secondscalls  ms/call  ms/call  name   
 19.05  0.08 0.08  482 0.17 0.29 
add_child_rel_equivalences
 11.90  0.13 0.05  1133447 0.00 0.00  bms_is_subset
  9.52  0.17 0.04   331162 0.00 0.00 
hash_search_with_hash_value
  7.14  0.20 0.03   548971 0.00 0.00  AllocSetAlloc
  4.76  0.22 0.02 2858 0.01 0.01  get_tabstat_entry
  4.76  0.24 0.02 1136 0.02 0.02  tzload

This means add_child_rel_equivalences() is still takes
too much time, the previously posted test case calls this
function 482 times, it's called for almost  every 10th entry
added to eq_classes. The elog() I put into this function says
that at the last call list_length(eq_classes) == 4754.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Heikki Linnakangas

On 28.10.2010 13:54, Boszormenyi Zoltan wrote:

A little better version, no need for the heavy hash_any, hash_uint32
on the lower 32 bits on pk_eclass is enough. The profiling runtime
is now 0.42 seconds vs the previous 0.41 seconds for the tree version.


Actually, I wonder if we could just have a separate canon_pathkeys list 
for each EquivalenceClass, instead of one big list in PlannerInfo. I'm 
not too familiar with equivalence classes and all that, but the attached 
patch at least passes the regressions. I haven't done any performance 
testing, but I would expect this to be even faster than the hashtable or 
tree implementations, and a lot simpler.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ee2aeb0..7a12c47 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1595,7 +1595,6 @@ _outPlannerInfo(StringInfo str, PlannerInfo *node)
 	WRITE_NODE_FIELD(init_plans);
 	WRITE_NODE_FIELD(cte_plan_ids);
 	WRITE_NODE_FIELD(eq_classes);
-	WRITE_NODE_FIELD(canon_pathkeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -1692,6 +1691,7 @@ _outEquivalenceClass(StringInfo str, EquivalenceClass *node)
 	WRITE_BOOL_FIELD(ec_below_outer_join);
 	WRITE_BOOL_FIELD(ec_broken);
 	WRITE_UINT_FIELD(ec_sortref);
+	WRITE_NODE_FIELD(ec_canon_pathkeys);
 }
 
 static void
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 643d57a..d5e5c42 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -93,9 +93,10 @@ make_canonical_pathkey(PlannerInfo *root,
 	while (eclass-ec_merged)
 		eclass = eclass-ec_merged;
 
-	foreach(lc, root-canon_pathkeys)
+	foreach(lc, eclass-ec_canon_pathkeys)
 	{
 		pk = (PathKey *) lfirst(lc);
+		Assert(eclass == pk-pk_eclass);
 		if (eclass == pk-pk_eclass 
 			opfamily == pk-pk_opfamily 
 			strategy == pk-pk_strategy 
@@ -110,7 +111,7 @@ make_canonical_pathkey(PlannerInfo *root,
 	oldcontext = MemoryContextSwitchTo(root-planner_cxt);
 
 	pk = makePathKey(eclass, opfamily, strategy, nulls_first);
-	root-canon_pathkeys = lappend(root-canon_pathkeys, pk);
+	eclass-ec_canon_pathkeys = lappend(eclass-ec_canon_pathkeys, pk);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index fd4c6f5..5f8f817 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -117,7 +117,10 @@ query_planner(PlannerInfo *root, List *tlist,
 		 * We still are required to canonicalize any pathkeys, in case it's
 		 * something like SELECT 2+2 ORDER BY 1.
 		 */
+/* XXX: Same as below */
+#if 0
 		root-canon_pathkeys = NIL;
+#endif
 		root-query_pathkeys = canonicalize_pathkeys(root,
 	 root-query_pathkeys);
 		root-group_pathkeys = canonicalize_pathkeys(root,
@@ -145,7 +148,13 @@ query_planner(PlannerInfo *root, List *tlist,
 	root-join_rel_hash = NULL;
 	root-join_rel_level = NULL;
 	root-join_cur_level = 0;
-	root-canon_pathkeys = NIL;
+/* XXX
+ * Do we need to reset something here? This is just initializing otherwise
+ * uninitialized fields, right?
+ */
+#if 0
+		root-canon_pathkeys = NIL;
+#endif
 	root-left_join_clauses = NIL;
 	root-right_join_clauses = NIL;
 	root-full_join_clauses = NIL;
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6e3d0f3..c959708 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -160,8 +160,6 @@ typedef struct PlannerInfo
 
 	List	   *eq_classes;		/* list of active EquivalenceClasses */
 
-	List	   *canon_pathkeys; /* list of canonical PathKeys */
-
 	List	   *left_join_clauses;		/* list of RestrictInfos for
 		 * mergejoinable outer join clauses
 		 * w/nonnullable var on left */
@@ -527,6 +525,7 @@ typedef struct EquivalenceClass
 	bool		ec_below_outer_join;	/* equivalence applies below an OJ */
 	bool		ec_broken;		/* failed to generate needed clauses? */
 	Index		ec_sortref;		/* originating sortclause label, or 0 */
+	List	   *ec_canon_pathkeys;
 	struct EquivalenceClass *ec_merged; /* set if merged into another EC */
 } EquivalenceClass;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Tom Lane
Boszormenyi Zoltan z...@cybertec.at writes:
 This means add_child_rel_equivalences() is still takes
 too much time, the previously posted test case calls this
 function 482 times, it's called for almost  every 10th entry
 added to eq_classes. The elog() I put into this function says
 that at the last call list_length(eq_classes) == 4754.

That seems like a ridiculously large number of ECs.  What is the
test query again?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Actually, I wonder if we could just have a separate canon_pathkeys list 
 for each EquivalenceClass, instead of one big list in PlannerInfo. I'm 
 not too familiar with equivalence classes and all that,

Hm.  I don't like getting rid of the main canon_pathkeys list like that.
The whole point of a canonical pathkey is that there is only one, so
it seems like we need a central list.  But it might be sane for each
EC to have an additional, side list of PKs made from it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Boszormenyi Zoltan
Tom Lane írta:
 Boszormenyi Zoltan z...@cybertec.at writes:
   
 This means add_child_rel_equivalences() is still takes
 too much time, the previously posted test case calls this
 function 482 times, it's called for almost  every 10th entry
 added to eq_classes. The elog() I put into this function says
 that at the last call list_length(eq_classes) == 4754.
 

 That seems like a ridiculously large number of ECs.  What is the
 test query again?

   regards, tom lane
   

The test case is here:
http://archives.postgresql.org/message-id/4cbd9ddc.4040...@cybertec.at

create_table.sql for the main table plus childtables.sql.gz, the EXPLAIN
query is in the message body.

Basically, it's a model for a lot of data for three months, partitioned by
4 hour intervals for every day. Imagine the call list handled by a
phone company in a large country.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-28 Thread Tom Lane
Boszormenyi Zoltan z...@cybertec.at writes:
 Tom Lane írta:
 That seems like a ridiculously large number of ECs.  What is the
 test query again?

 The test case is here:
 http://archives.postgresql.org/message-id/4cbd9ddc.4040...@cybertec.at

After poking through that a bit, I think that the real issue is in this
division of labor:

index_pathkeys = build_index_pathkeys(root, index,
  ForwardScanDirection);
useful_pathkeys = truncate_useless_pathkeys(root, rel,
index_pathkeys);

If you trace what is happening here, the index pathkeys that actually
survive the usefulness test all refer to exactly ONE equivalence
class, namely the one arising from the query's order by timestamp2
clause.  All the other pathkeys that get created are immediately
discarded as being irrelevant to the query.  The reason that we end up
with so many equivalence classes is that there is nothing causing the
variables of the different child tables to be recognized as all
sort-equivalent.  Maybe that's a bug in itself, but I would argue that
the right way to make this faster is to refactor things so that we
don't generate useless equivalence classes in the first place, or
at least don't keep them around in the planner's lists once we realize
they're useless.

I like Heikki's hack to cut down on searching in make_canonical_pathkey,
but I think that complicating the data structure searching beyond that
is just a band-aid.  Reasonably-sized queries shouldn't contain very
many equivalence classes: they should only come from equality clauses
or sort conditions that appeared in the query text.  Therefore, there
also shouldn't be all that many distinct pathkeys.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-27 Thread Heikki Linnakangas

On 26.10.2010 18:34, Boszormenyi Zoltan wrote:

thank you very much for pointing me to dynahash, here is the
next version that finally seems to work.

Two patches are attached, the first is the absolute minimum for
making it work, this still has the Tree type for canon_pathkeys
and eq_classes got the same treatment as join_rel_list/join_rel_hash
has in the current sources: if the list grows larger than 32, a hash table
is created. It seems to be be enough for doing in for
  get_eclass_for_sort_expr()
only, the other users of eq_classes aren't bothered by this change.


That's better, but can't you use dynahash for canon_pathkeys as well?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-19 Thread Boszormenyi Zoltan
Hi,

Boszormenyi Zoltan írta:
 There is one problem with the patch, it doesn't survive
 make check. One of the regression tests fails the
 Assert(!cur_em-em_is_child);
 line in process_equivalence() in equivclass.c, but I couldn't
 yet find it what causes it. The why is vaguely clear:
 something modifies the ec_members list in the eq_classes'
 tree nodes while the node is in the tree. Because I didn't find
 the offender yet, I couldn't fix it, so I send this patch as is.
 I'll try to fix it if someone doesn't beat me in fixing it. :)
   

I am a little closer to this bug, maybe I even found the cause of it.
I found that process_equivalence() is called from:

path/equivclass.c:
reconsider_outer_join_clause()
reconsider_full_join_clause()
plan/initsplan.c:
distribute_qual_to_rels()

The problem is with the two functions in path/equivclass.c,
as process_equivalance() and those functions are all walk
the tree, and the current RBTree code can only deal with
one walk at a time. We need to push/pop the iterator state
to be able to serve more than one walkers.

Also, we need to split out the tree modifying part from
process_equivalence() somehow, as the tree walking
also cannot deal with node additions and deletions.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-10-19 Thread Tom Lane
Boszormenyi Zoltan z...@cybertec.at writes:
 The problem is with the two functions in path/equivclass.c,
 as process_equivalance() and those functions are all walk
 the tree, and the current RBTree code can only deal with
 one walk at a time. We need to push/pop the iterator state
 to be able to serve more than one walkers.

Good luck with that --- the iteration state is embedded in the rbtree.

 Also, we need to split out the tree modifying part from
 process_equivalence() somehow, as the tree walking
 also cannot deal with node additions and deletions.

That's not happening either.  Maybe you need to think of some other data
structure to use.  Hashtable maybe?  dynahash.c at least has reasonably
well-defined limitations in this area.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Robert Haas
On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Hi,

 Robert Haas írta:
 2010/9/3 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:

 i tried this one with 5000 unindexed tables (just one col):

 test=# \timing
 Timing is on.
 test=# prepare x(int4) AS select * from t_data order by id desc;
 PREPARE
 Time: 361.552 ms

 you will see similar or higher runtimes in case of 500 partitions and a 
 handful of indexes.


 I'd like to see (1) a script to reproduce your test environment (as
 Stephen also requested) and (2) gprof or oprofile results.


 attached are the test scripts, create_tables.sql and childtables.sql.
 The following query takes 4.7 seconds according to psql with \timing on:
 EXPLAIN SELECT * FROM qdrs
 WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
 ORDER BY streamhash;

Neat.  Have you checked what effect this has on memory consumption?

Also, don't forget to add it to
https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Hans-Jürgen Schönig
hello ...

no, we have not checked memory consumption.
there is still some stuff left to optimize away - it seems we are going close 
to O(n^2) somewhere.
equal is called really often in our sample case as well:

ach sample counts as 0.01 seconds.
 %   cumulative   self  self total  
time   seconds   secondscalls   s/call   s/call  name   
18.87  0.80 0.80 4812 0.00 0.00  make_canonical_pathkey
15.33  1.45 0.65 4811 0.00 0.00 
get_eclass_for_sort_expr
14.15  2.05 0.60  8342410 0.00 0.00  equal
 6.13  2.31 0.26   229172 0.00 0.00  SearchCatCache
 3.66  2.47 0.16  5788835 0.00 0.00  _equalList
 3.07  2.60 0.13  1450043 0.00 0.00 
hash_search_with_hash_value
 2.36  2.70 0.10  2272545 0.00 0.00  AllocSetAlloc
 2.12  2.79 0.09   811460 0.00 0.00  hash_any
 1.89  2.87 0.08  3014983 0.00 0.00  list_head
 1.89  2.95 0.08   574526 0.00 0.00  _bt_compare
 1.77  3.02 0.08 11577670 0.00 0.00  list_head
 1.42  3.08 0.06 1136 0.00 0.00  tzload
 0.94  3.12 0.04  2992373 0.00 0.00  AllocSetFreeIndex


look at the number of calls ...
equal is scary ...

make_canonical_pathkey is fixed it seems.
get_eclass_for_sort_expr seems a little more complex to fix.

great you like it ...

regards,

hans



On Sep 8, 2010, at 3:54 PM, Robert Haas wrote:

 On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Hi,
 
 Robert Haas írta:
 2010/9/3 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 
 i tried this one with 5000 unindexed tables (just one col):
 
 test=# \timing
 Timing is on.
 test=# prepare x(int4) AS select * from t_data order by id desc;
 PREPARE
 Time: 361.552 ms
 
 you will see similar or higher runtimes in case of 500 partitions and a 
 handful of indexes.
 
 
 I'd like to see (1) a script to reproduce your test environment (as
 Stephen also requested) and (2) gprof or oprofile results.
 
 
 attached are the test scripts, create_tables.sql and childtables.sql.
 The following query takes 4.7 seconds according to psql with \timing on:
 EXPLAIN SELECT * FROM qdrs
 WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
 ORDER BY streamhash;
 
 Neat.  Have you checked what effect this has on memory consumption?
 
 Also, don't forget to add it to
 https://commitfest.postgresql.org/action/commitfest_view/open
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise Postgres Company
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 


--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Stephen Frost
* Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
 no, we have not checked memory consumption.
 there is still some stuff left to optimize away - it seems we are going close 
 to O(n^2) somewhere.
 equal is called really often in our sample case as well:

Did the mail with the scripts, etc, get hung up due to size or
something..?  I didn't see it on the mailing list nor in the archives..
If so, could you post them somewhere so others could look..?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Hans-Jürgen Schönig
here is the patch again.
we accidentally attached a wrong test file to the original posting so it grew 
to big. we had to revoke it from the moderator (this happens if you code from 
8am to 10pm).
here is just the patch - it is nice and small.

you can easily test it by making yourself a nice parent table, many subtables 
(hundreds or thousands) and a decent number of indexes per partition.
then run PREPARE with \timing to see what happens.
you should get scary planning times. the more potential indexes and tables the 
more scary it will be.

using this wonderful RB tree the time for this function call goes down to 
basically zero.
i hope this is something which is useful to some folks out there.

many thanks,

hans






canon-pathkeys-as-rbtree-3-ctxdiff.patch
Description: Binary data



On Sep 8, 2010, at 4:18 PM, Stephen Frost wrote:

 * Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
 no, we have not checked memory consumption.
 there is still some stuff left to optimize away - it seems we are going 
 close to O(n^2) somewhere.
 equal is called really often in our sample case as well:
 
 Did the mail with the scripts, etc, get hung up due to size or
 something..?  I didn't see it on the mailing list nor in the archives..
 If so, could you post them somewhere so others could look..?
 
   Thanks,
 
   Stephen


--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 Neat.  Have you checked what effect this has on memory consumption?
 
 Also, don't forget to add it to
 https://commitfest.postgresql.org/action/commitfest_view/open

Would be good to have the patch updated to be against HEAD before
posting to the commitfest.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Hans-Jürgen Schönig
On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote:

 * Robert Haas (robertmh...@gmail.com) wrote:
 Neat.  Have you checked what effect this has on memory consumption?
 
 Also, don't forget to add it to
 https://commitfest.postgresql.org/action/commitfest_view/open
 
 Would be good to have the patch updated to be against HEAD before
 posting to the commitfest.
 
   Thanks,
 
   Stephen



we will definitely provide something which is for HEAD.
but, it seems the problem we are looking is not sufficiently fixed yet.
in our case we shaved off some 18% of planning time or so - looking at the 
other top 2 functions i got the feeling that more can be done to reduce this. i 
guess we have to attack this as well.

regards,

hans


--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Stephen Frost
* Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
 but, it seems the problem we are looking is not sufficiently fixed yet.
 in our case we shaved off some 18% of planning time or so - looking at the 
 other top 2 functions i got the feeling that more can be done to reduce this. 
 i guess we have to attack this as well.

An 18% increase is certainly nice, provided it doesn't slow down or
break other things..  I'm looking through the patch now actually and
I'm not really happy with the naming, comments, or some of the code
flow, but I think the concept looks reasonable.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Robert Haas
2010/9/8 Hans-Jürgen Schönig postg...@cybertec.at:
 On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote:

 * Robert Haas (robertmh...@gmail.com) wrote:
 Neat.  Have you checked what effect this has on memory consumption?

 Also, don't forget to add it to
 https://commitfest.postgresql.org/action/commitfest_view/open

 Would be good to have the patch updated to be against HEAD before
 posting to the commitfest.


 we will definitely provide something which is for HEAD.
 but, it seems the problem we are looking is not sufficiently fixed yet.
 in our case we shaved off some 18% of planning time or so - looking at the 
 other top 2 functions i got the feeling that more can be done to reduce this. 
 i guess we have to attack this as well.

Just remember that four small patches (say) are apt to get committed
faster than one big one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 2010/9/8 Hans-Jürgen Schönig postg...@cybertec.at:
  but, it seems the problem we are looking is not sufficiently fixed yet.
  in our case we shaved off some 18% of planning time or so - looking at the 
  other top 2 functions i got the feeling that more can be done to reduce 
  this. i guess we have to attack this as well.
 
 Just remember that four small patches (say) are apt to get committed
 faster than one big one.

Indeed, but code like this makes me wonder if this is really working the
way it's supposed to:

+   val1 = (long)pk_left-pk_eclass;
+   val2 = (long)pk_right-pk_eclass;
+ 
+   if (val1  val2)
+   return -1;
+   else if (val1  val2)
+   return 1;

Have you compared how big the tree gets to the size of the list it's
supposed to be replacing..?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Alvaro Herrera
Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
 * Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
  but, it seems the problem we are looking is not sufficiently fixed yet.
  in our case we shaved off some 18% of planning time or so - looking at the 
  other top 2 functions i got the feeling that more can be done to reduce 
  this. i guess we have to attack this as well.
 
 An 18% increase is certainly nice, provided it doesn't slow down or
 break other things..  I'm looking through the patch now actually and
 I'm not really happy with the naming, comments, or some of the code
 flow, but I think the concept looks reasonable.

I don't understand the layering between pg_tree and rbtree.  Why does it
exist at all?  At first I thought this was another implementation of
rbtrees, but then I noticed it sits on top of it.  Is this really
necessary?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Tom Lane
=?iso-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= postg...@cybertec.at writes:
 here is the patch again.

This patch seems remarkably documentation-free.  What is it trying to
accomplish and what is it doing to the planner data structures?
(Which do have documentation BTW.)  Also, what will it do to runtime
in normal cases where the pathkeys list isn't that long?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Boszormenyi Zoltan
Stephen Frost írta:
 * Robert Haas (robertmh...@gmail.com) wrote:
   
 2010/9/8 Hans-Jürgen Schönig postg...@cybertec.at:
 
 but, it seems the problem we are looking is not sufficiently fixed yet.
 in our case we shaved off some 18% of planning time or so - looking at the 
 other top 2 functions i got the feeling that more can be done to reduce 
 this. i guess we have to attack this as well.
   
 Just remember that four small patches (say) are apt to get committed
 faster than one big one.
 

 Indeed, but code like this makes me wonder if this is really working the
 way it's supposed to:

 +   val1 = (long)pk_left-pk_eclass;
 +   val2 = (long)pk_right-pk_eclass;
 + 
 +   if (val1  val2)
 +   return -1;
 +   else if (val1  val2)
 +   return 1;
   

The original code checked for pointers being equal among
other conditions. This was an (almost) straight conversion
to a comparison function for rbtree. Do you mean casting
the pointer to long? Yes, e.g. on 64-bit Windows it wouldn't
work. Back to plain pointer comparison.

 Have you compared how big the tree gets to the size of the list it's
 supposed to be replacing..?
   

No, but I think it's obvious. Now we have one TreeCell
where we had one ListCell.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Boszormenyi Zoltan
Alvaro Herrera írta:
 Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
   
 * Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
 
 but, it seems the problem we are looking is not sufficiently fixed yet.
 in our case we shaved off some 18% of planning time or so - looking at the 
 other top 2 functions i got the feeling that more can be done to reduce 
 this. i guess we have to attack this as well.
   
 An 18% increase is certainly nice, provided it doesn't slow down or
 break other things..  I'm looking through the patch now actually and
 I'm not really happy with the naming, comments, or some of the code
 flow, but I think the concept looks reasonable.
 

 I don't understand the layering between pg_tree and rbtree.  Why does it
 exist at all?  At first I thought this was another implementation of
 rbtrees, but then I noticed it sits on top of it.  Is this really
 necessary?
   

No, if it's acceptable to omit PlannerInfo from outfuncs.c.
Or at least its canon_pathkeys member. Otherwise yes, it's
necessary. We need to store (Node *) in a fast searchable way.

This applies to anything else that may need to be converted
from list to tree to decrease planning time. Like ec_members
in EquivalenceClass.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 Indeed, but code like this makes me wonder if this is really working the
 way it's supposed to:

 +   val1 = (long)pk_left-pk_eclass;
 +   val2 = (long)pk_right-pk_eclass;

Ugh.  Casting a pointer to long?  We do have portable ways to do what
this is trying to do, but that is not one.  (For example, this is
guaranteed to misbehave on 64-bit Windows.)

Offhand I think PointerGetDatum might be the best way.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Tom Lane
Boszormenyi Zoltan z...@cybertec.at writes:
 This applies to anything else that may need to be converted
 from list to tree to decrease planning time. Like ec_members
 in EquivalenceClass.

AFAIR, canonical pathkeys are the *only* thing in the planner where pure
pointer equality is interesting.  So I doubt this hack is of any use for
EquivalenceClass, even if the lists were likely to be long which they
aren't.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 I'm not really happy with the naming, comments, or some of the code
 flow, but I think the concept looks reasonable.

There seems to be a lot of unnecessary palloc/pfree traffic in this
implementation.  Getting rid of that might help the speed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Boszormenyi Zoltan
Tom Lane írta:
 Boszormenyi Zoltan z...@cybertec.at writes:
   
 This applies to anything else that may need to be converted
 from list to tree to decrease planning time. Like ec_members
 in EquivalenceClass.
 

 AFAIR, canonical pathkeys are the *only* thing in the planner where pure
 pointer equality is interesting.  So I doubt this hack is of any use for
 EquivalenceClass, even if the lists were likely to be long which they
 aren't.

   regards, tom lane
   

No, for EquivalanceClass-ec_member, I need to do something
funnier, like implement compare(Node *, Node *) and use that
instead of equal(Node *, Node *)... Something like nodeToString()
on both Node * and strcmp() the resulting strings.

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Boszormenyi Zoltan
Tom Lane írta:
 Stephen Frost sfr...@snowman.net writes:
   
 I'm not really happy with the naming, comments, or some of the code
 flow, but I think the concept looks reasonable.
 

 There seems to be a lot of unnecessary palloc/pfree traffic in this
 implementation.  Getting rid of that might help the speed.

   regards, tom lane
   

This was a first WIP implementation, I will look into it, thanks.

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Tom Lane
Boszormenyi Zoltan z...@cybertec.at writes:
 Tom Lane írta:
 AFAIR, canonical pathkeys are the *only* thing in the planner where pure
 pointer equality is interesting.  So I doubt this hack is of any use for
 EquivalenceClass, even if the lists were likely to be long which they
 aren't.

 No, for EquivalanceClass-ec_member, I need to do something
 funnier, like implement compare(Node *, Node *) and use that
 instead of equal(Node *, Node *)... Something like nodeToString()
 on both Node * and strcmp() the resulting strings.

Well, (a) that doesn't work (hint: there are fields in nodes that are
intentionally ignored by equal()), and (b) I still don't believe that
there's an actual bottleneck there.  ECs generally aren't very big.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Boszormenyi Zoltan
Tom Lane írta:
 Boszormenyi Zoltan z...@cybertec.at writes:
   
 Tom Lane írta:
 
 AFAIR, canonical pathkeys are the *only* thing in the planner where pure
 pointer equality is interesting.  So I doubt this hack is of any use for
 EquivalenceClass, even if the lists were likely to be long which they
 aren't.
   

   
 No, for EquivalanceClass-ec_member, I need to do something
 funnier, like implement compare(Node *, Node *) and use that
 instead of equal(Node *, Node *)... Something like nodeToString()
 on both Node * and strcmp() the resulting strings.
 

 Well, (a) that doesn't work (hint: there are fields in nodes that are
 intentionally ignored by equal()),

Then this compare() needs to work like equal(), which can
ignore the fields that are ignored by equal(), too.
nodeToString would need more space anyway and comparing
non-equal Nodes can return the !0 result faster.

  and (b) I still don't believe that
 there's an actual bottleneck there.  ECs generally aren't very big.
   

Actually, PlannerInfo-eq_classes needs to be a Tree somehow,
the comparator function is not yet final in my head.

equal() is called over 8 million times with or without our patch:

without

  %   cumulative   self  self total  
 time   seconds   secondscalls   s/call   s/call  name   
 18.87  0.80 0.80 4812 0.00 0.00  make_canonical_pathkey
 15.33  1.45 0.65 4811 0.00 0.00 
get_eclass_for_sort_expr
 14.15  2.05 0.60  8342410 0.00 0.00  equal
  6.13  2.31 0.26   229172 0.00 0.00  SearchCatCache
  3.66  2.47 0.16  5788835 0.00 0.00  _equalList
  3.07  2.60 0.13  1450043 0.00 0.00 
hash_search_with_hash_value
  2.36  2.70 0.10  2272545 0.00 0.00  AllocSetAlloc
  2.12  2.79 0.09   811460 0.00 0.00  hash_any
  1.89  2.87 0.08  3014983 0.00 0.00  list_head
  1.89  2.95 0.08   574526 0.00 0.00  _bt_compare
  1.77  3.02 0.08 11577670 0.00 0.00  list_head
  1.42  3.08 0.06 1136 0.00 0.00  tzload
  0.94  3.12 0.04  2992373 0.00 0.00  AllocSetFreeIndex
  0.94  3.16 0.0491427 0.00 0.00  _bt_checkkeys
...

with

  %   cumulative   self  self total  
 time   seconds   secondscalls   s/call   s/call  name   
 24.51  0.88 0.88 4811 0.00 0.00 
get_eclass_for_sort_expr
 14.21  1.39 0.51  8342410 0.00 0.00  equal
  8.22  1.69 0.30  5788835 0.00 0.00  _equalList
  5.29  1.88 0.19   229172 0.00 0.00  SearchCatCache
  2.51  1.97 0.09 1136 0.00 0.00  tzload
  2.23  2.05 0.08  3014983 0.00 0.00  list_head
  2.23  2.13 0.08  2283130 0.00 0.00  AllocSetAlloc
  2.09  2.20 0.08   811547 0.00 0.00  hash_any
  2.09  2.28 0.08 11577670 0.00 0.00  list_head
  1.95  2.35 0.07  1450180 0.00 0.00 
hash_search_with_hash_value
  1.39  2.40 0.05   640690 0.00 0.00  _bt_compare
  1.39  2.45 0.05   157944 0.00 0.00  LockAcquireExtended
  1.39  2.50 0.0511164 0.00 0.00  AllocSetCheck
  1.11  2.54 0.04  3010547 0.00 0.00  AllocSetFreeIndex
  1.11  2.58 0.04   874975 0.00 0.00  AllocSetFree
  1.11  2.62 0.0466211 0.00 0.00  heap_form_tuple
  0.84  2.65 0.03   888128 0.00 0.00  LWLockRelease
...

The number of calls are the same for  equal and _equalList
in both cases as you can see.

And if you compare the number of AllocSetAlloc calls,
it's actually smaller with our patch, so it's not that the
conversion to Tree caused this.

Best regards,
Zoltán Böszörményi

-- 
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-08 Thread Tom Lane
Boszormenyi Zoltan z...@cybertec.at writes:
 equal() is called over 8 million times with or without our patch:

From where, though?  You've provided not a shred of evidence that
searching large ec_member lists is the problem.

Also, did the test case you're using ever make it to the list?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] plan time of MASSIVE partitioning ...

2010-09-03 Thread PostgreSQL - Hans-Jürgen Schönig
hello everybody,

we came across an issue which turned out to be more serious than previously 
expected.
imagine a system with, say, 1000 partitions (heavily indexed) or so. the time 
taken by the planner is already fairly heavy in this case.

i tried this one with 5000 unindexed tables (just one col):

test=# \timing
Timing is on.
test=# prepare x(int4) AS select * from t_data order by id desc;
PREPARE
Time: 361.552 ms

you will see similar or higher runtimes in case of 500 partitions and a handful 
of indexes.

does anybody see a potential way to do a shortcut through the planner?
a prepared query is no solution here as constraint exclusion would be dead in 
this case (making the entire thing an even bigger drama).

did anybody think of a solution to this problem.
or more precisely: can there be a solution to this problem?

many thanks,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-03 Thread Stephen Frost
* PostgreSQL - Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
 did anybody think of a solution to this problem.
 or more precisely: can there be a solution to this problem?

Please post to the correct list (-performance) and provide information
like PG version, postgresql.conf, the actual table definition, the
resulting query plan, etc, etc...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-03 Thread PostgreSQL - Hans-Jürgen Schönig

On Sep 3, 2010, at 2:04 PM, Stephen Frost wrote:

 * PostgreSQL - Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
 did anybody think of a solution to this problem.
 or more precisely: can there be a solution to this problem?
 
 Please post to the correct list (-performance) and provide information
 like PG version, postgresql.conf, the actual table definition, the
 resulting query plan, etc, etc...
 
   Thanks,
 
   Stephen



hello stephen,

this seems like more a developer question to me than a pre performance one.
it is not related to the table structure at all - it is basically an issue with 
incredibly large inheritance lists.
it applies to postgres 9 and most likely to everything before.
postgresql.conf is not relevant at all at this point.

the plan is pretty fine.
the question is rather: does anybody see a chance to handle such lists more 
efficiently inside postgres?
also, it is not the point if my data structure is sane or not. it is really 
more generic - namely a shortcut for this case inside the planing process.

many thanks,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-03 Thread Stephen Frost
* PostgreSQL - Hans-Jürgen Schönig (postg...@cybertec.at) wrote:
 this seems like more a developer question to me than a pre performance one.
 it is not related to the table structure at all - it is basically an issue 
 with incredibly large inheritance lists.
 it applies to postgres 9 and most likely to everything before.
 postgresql.conf is not relevant at all at this point.

Really?  What's constraint_exclusion set to?  Is GEQO getting used?
What are the GEQO parameters set to?  Do you have any CHECK constraints
on the tables?

If you want someone else to build a test case and start looking into it,
it's best to not make them have to guess at what you've done.

 the plan is pretty fine.
 the question is rather: does anybody see a chance to handle such lists more 
 efficiently inside postgres?
 also, it is not the point if my data structure is sane or not. it is really 
 more generic - namely a shortcut for this case inside the planing process.

Coming up with cases where PG doesn't perform well in a completely
contrived unrealistic environment isn't likely to impress anyone to
do anything.

A small (but complete..) test case which mimics a real world environment
and real world problem would go alot farther.  I can certainly believe
that people out there have partitions set up with lots of tables and
that it will likely grow- but they probably also have CHECK constraints,
have tweaked what constraint_exclusion is set to, have adjusted their
work_mem and related settings, maybe tweaked some planner GUCs, etc,
etc.

This is an area I'm actually interested in and curious about, I'd rather
work together on it than be told that the questions I'm asking aren't
relevant.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-03 Thread Robert Haas
2010/9/3 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 i tried this one with 5000 unindexed tables (just one col):

 test=# \timing
 Timing is on.
 test=# prepare x(int4) AS select * from t_data order by id desc;
 PREPARE
 Time: 361.552 ms

 you will see similar or higher runtimes in case of 500 partitions and a 
 handful of indexes.

I'd like to see (1) a script to reproduce your test environment (as
Stephen also requested) and (2) gprof or oprofile results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-03 Thread Tom Lane
=?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= postg...@cybertec.at 
writes:
 imagine a system with, say, 1000 partitions (heavily indexed) or so. the time 
 taken by the planner is already fairly heavy in this case.

As the fine manual points out, the current scheme for managing
partitioned tables isn't intended to scale past a few dozen partitions.

I think we'll be able to do better when we have an explicit
representation of partitioning, since then the planner won't
have to expend large amounts of effort reverse-engineering knowledge
about how an inheritance tree is partitioned.  Before that happens,
it's not really worth the trouble to worry about such cases.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plan time of MASSIVE partitioning ...

2010-09-03 Thread PostgreSQL - Hans-Jürgen Schönig

On Sep 3, 2010, at 4:40 PM, Tom Lane wrote:

 =?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= postg...@cybertec.at 
 writes:
 imagine a system with, say, 1000 partitions (heavily indexed) or so. the 
 time taken by the planner is already fairly heavy in this case.
 
 As the fine manual points out, the current scheme for managing
 partitioned tables isn't intended to scale past a few dozen partitions.
 
 I think we'll be able to do better when we have an explicit
 representation of partitioning, since then the planner won't
 have to expend large amounts of effort reverse-engineering knowledge
 about how an inheritance tree is partitioned.  Before that happens,
 it's not really worth the trouble to worry about such cases.
 
   regards, tom lane
 


thank you ... - the manual is clear here but we wanted to see if there is some 
reasonably low hanging fruit to get around this.
it is no solution but at least a clear statement ...

many thanks,

hans


--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers