Re: [PERFORM] How to force Nested Loop plan?

2003-09-01 Thread Tom Lane
Rob Nagler [EMAIL PROTECTED] writes:
 Are there plans for explicit hints to the planner?

Personally, I'm philosophically opposed to planner hints; see previous
discussions in the archives.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] How to force Nested Loop plan?

2003-09-01 Thread Tom Lane
Ron Johnson [EMAIL PROTECTED] writes:
 How about (if you don't already do it) ranked (or approximately 
 ranked) b-tree indexes, where each node also stores the (approximate)
 count of tuple pointers under it?
 This way, the planner would know whether or how skewed a tree is,
 and (approximately) how many tuples a given WHERE predicate resolves
 to.

Why is that better than our existing implementation of column statistics?

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] How to force Nested Loop plan?

2003-08-30 Thread Tom Lane
Rob Nagler [EMAIL PROTECTED] writes:
 I'm trying to understand how I can get the planner to always do the
 right thing with this query:

 SELECT
   aa_t.min_date_time
 FROM
   aa_t
   , bb_t
   , cc_t
 WHERE bb_t.bb_id = aa_t.bb_id
   AND aa_t.realm_id = cc_t.realm_id
   AND aa_t.server_id = 21
 ORDER BY aa_t.min_date_time desc
 LIMIT 1
 OFFSET 674


-  Index Scan Backward using aa_t20 on aa_t  (cost=0.00..76738.77 
 rows=3454 width=46) (actual time=0.10..31.30 rows=676 loops=1)
  Filter: (server_id = 21::numeric)

The reason the planner does not much like this plan is that it's
estimating that quite a lot of rows will have to be hit in min_date_time
order before it finds enough rows with server_id = 21.  Thus the high
cost estimate for the above step.

I suspect that the reason you like this plan is that there's actually
substantial correlation between server_id and min_date_time, such that
the required rows are found quickly.  Before trying to force the planner
into what you consider an optimal plan, you had better ask yourself
whether you can expect that correlation to hold up in the future.
If not, your plan could become pessimal pretty quickly.

I'd suggest creating a double-column index:

create index aa_ti on aa_t(server_id, min_date_time);

and altering the query to read

ORDER BY aa_t.server_id DESC, aa_t.min_date_time DESC

(you need this kluge to make sure the planner recognizes that the new
index matches the ORDER BY request).  Then you should get a plan with
a much smaller cost coefficient for this step.

regards, tom lane

PS: does server_id really need to be NUMERIC?  Why not integer, or at
worst bigint?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] How to force Nested Loop plan?

2003-08-30 Thread Rob Nagler
Tom Lane writes:
 The reason the planner does not much like this plan is that it's
 estimating that quite a lot of rows will have to be hit in min_date_time
 order before it finds enough rows with server_id = 21.  Thus the high
 cost estimate for the above step.

Thanks for the speedy and useful reply!  More questions follow. :)

Very interesting.  How does it know quite a lot?  Is there something
I can do to get the planner to analyze the data better?

 I suspect that the reason you like this plan is that there's actually
 substantial correlation between server_id and min_date_time, such that
 the required rows are found quickly.  Before trying to force the planner
 into what you consider an optimal plan, you had better ask yourself
 whether you can expect that correlation to hold up in the future.
 If not, your plan could become pessimal pretty quickly.

The correlation holds.  min_date_time increases over time as records
are inserted.  server_id is uniformly distributed over time.  There's
no randomness.  There is at least one 21 record for every value of
min_date_time.  21 is a special server_id containing aggregate
(denormalized) data for the other servers.  I thought about putting it
in a separate table, but this would complicate the code as the data is
identical to the non-aggregated case.

Do you have any suggestions for organizing the data/query now that you
know this?

 I'd suggest creating a double-column index:

Thanks.  I'll try this.

I'm a very big fan of declarative programming.  However, there's a
danger in declarative programming when the interperter isn't smart
enough.  When I add this index, I will slow down inserts (about
20K/day) and increase data size (this is the second largest table in
the database).  Moreover, if the planner is improved, I've should fix
my code, delete the index, etc.

Is there a way of giving the planner direct hints as in Oracle?  They
can be ignored when the optimizer is improved, just as register is
ignored by C compilers nowadays.

Adding the extra index and ORDER BY is also not easy in our case.  The
query is dynamically generated.  I can force the query ORDER BY to be
whatever I want, but I would lose the ability to do interesting
things, like the automatic generation of ORDER BY when someone clicks
on a column header in the application.  Indeed there are times when
people want to sort on other columns in the query. I reduced the
problem to the salient details for my post to this board.  What if the
ORDER BY was:

ORDER BY aa_t.server_id DESC, cc_t.name ASC

Would the planner do the right thing?

 PS: does server_id really need to be NUMERIC?  Why not integer, or at
 worst bigint?

It is a NUMERIC(18).  It could be a bigint.  What would be the change
in performance of this query if we changed it to bigint?

BTW, my customer is probably going to be switching to Oracle.  This
particular query has been one of the reasons.  Maybe this change will
help us stay with Postgres.

Thanks,
Rob



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] How to force Nested Loop plan?

2003-08-30 Thread Ron Johnson
On Sat, 2003-08-30 at 10:47, Rob Nagler wrote:
 Tom Lane writes:
[snip]
 enough.  When I add this index, I will slow down inserts (about
 20K/day) and increase data size (this is the second largest table in
[snip]

Since I gather that this is a web-site, can we presume that they
are clumped into an 8 hour range?  20,000/8 = 2,500/hour, which
is 41.67/minute.  If you can't do .69 inserts/second, something
is wrong, and it ain't hardware, and it ain't Postgresql...

  PS: does server_id really need to be NUMERIC?  Why not integer, or at
  worst bigint?
 
 It is a NUMERIC(18).  It could be a bigint.  What would be the change
 in performance of this query if we changed it to bigint?

http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-INT
http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-NUMERIC-DECIMAL

Scalars are faster than arbitrary precision types.  Small (32 bit)
scalars are faster than bit (64 bit) scalars on x86 h/w.

-- 
-
Ron Johnson, Jr. [EMAIL PROTECTED]
Jefferson, LA USA

Adventure is a sign of incompetence
Stephanson, great polar explorer


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] How to force Nested Loop plan?

2003-08-30 Thread Rob Nagler
Tom Lane writes:
 That doesn't really tell me anything.  What's the proportion of 21
 records out of the total table?

Currently we have about 15 servers so 6% of the data is uniformly
distributed with the value 21.

   create index fooi on foo (min_date_time) where server_id = 21;
 
 This reduces the cost of maintaining the index but also makes it useful
 *only* for id = 21 queries.  On the plus side, you don't need to hack
 the ORDER BY clause to get your queries to use it.  Your choice...

I like that better, thanks.  After testing I found the elbow was at
1610 records with this index, but this clause still yields better
performance at 1654 records:

AND aa_t.server_id IN (21, 0)

This is independent of the existence of the new index.

Interestingly, when I drop the aa_t5 index, the elbow goes up to
1729 with the IN (21, 0) query.

You might ask: why do I have an index at all?  That's from my Oracle
experience.  server_id is a foreign key into the server table.  If you
don't create an index on a foreign key, Oracle locks the entire
foreign table when you modify the local table.  With an index, it only
locks a row in the foreign table.  This causes a major bottleneck, but
in this case the server table is static.  Therefore, the index is
superfluous, and since there are only 16 values, the index should be
bitmap index (Oracle speak, sorry, don't know the PG term).  Dropping
the index probably won't change any of the other queries, I think.

Without the aa_t5 index and after the elbow, the Index Scan is
replaced with a Seq Scan, which is just about as fast, but still 50
times slower than before the elbow:

 Limit  (cost=34071.30..34071.31 rows=1 width=84) (actual time=5111.14..5111.15 rows=1 
loops=1)
   -  Sort  (cost=34066.98..34075.61 rows=3454 width=84) (actual 
time=5108.74..5109.96 rows=1733 loops=1)
 Sort Key: aa_t.min_date_time
 -  Merge Join  (cost=33801.26..33863.98 rows=3454 width=84) (actual 
time=4868.62..5020.58 rows=41879 loops=1)
   Merge Cond: (outer.realm_id = inner.realm_id)
   -  Sort  (cost=31.64..32.78 rows=455 width=19) (actual time=3.06..3.38 
rows=415 loops=1)
 Sort Key: cc_t.realm_id
 -  Seq Scan on cc_t  (cost=0.00..11.55 rows=455 width=19) 
(actual time=0.05..0.99 rows=455 loops=1)
   -  Sort  (cost=33769.63..33778.26 rows=3454 width=65) (actual 
time=4865.20..4895.28 rows=41879 loops=1)
 Sort Key: aa_t.realm_id
 -  Merge Join  (cost=33296.79..33566.63 rows=3454 width=65) 
(actual time=4232.52..4541.24 rows=41879 loops=1)
   Merge Cond: (outer.bb_id = inner.bb_id)
   -  Sort  (cost=25216.97..25225.60 rows=3454 width=46) 
(actual time=3213.53..3243.65 rows=41879 loops=1)
 Sort Key: aa_t.bb_id
 -  Seq Scan on aa_t  (cost=0.00..25013.97 rows=3454 
width=46) (actual time=20.07..2986.11 rows=41879 loops=1)
   Filter: (server_id = 21::numeric)
   -  Sort  (cost=8079.83..8184.53 rows=41879 width=19) 
(actual time=1018.95..1049.37 rows=41879 loops=1)
 Sort Key: bb_t.bb_id
 -  Seq Scan on bb_t  (cost=0.00..4864.79 rows=41879 
width=19) (actual time=0.04..810.88 rows=41879 loops=1)
 Total runtime: 5141.22 msec

What I'm not sure is why does it decide to switch modes so early,
i.e., at about 5% of the table size or less?  It seems that an Index
Scan would give better mileage than a Seq Scan for possibly up to 50%
of the table in this case.  I clearly don't understand the internals,
but the elbow seems rather sharp to me.

  What if the ORDER BY was:
  ORDER BY aa_t.server_id DESC, cc_t.name ASC
  Would the planner do the right thing?
 
 What do you consider the right thing?  
 cc_t.name doesn't seem connected
 to this table at all --- or did I miss something?

Sorry, this is a red herring.  Please ignore.

 If you've been generically using NUMERIC(n) where you could be using
 integer or bigint, then I think you've probably paid a high price
 without knowing it.  I don't know what Oracle's cost tradeoffs are for
 these datatypes, but I can tell you that Postgres's integer types are
 way faster (and more compact) than our NUMERIC.

I'll try to figure out what the price is in our case.  I think Oracle
does a pretty good job on data compression for NUMERIC.  I haven't
dealt with a large Postgres database until this one, so I guess it's
time to learn. :)

We actually have been quite pleased with Postgres's performance
without paying much attention to it before this.  When we first set up
Oracle, we got into all of its parameters pretty heavily.  With
Postgres, we just tried it and it worked.  This is the first query
where we ran out of ideas to try.

BTW, everybody's help on this list is fantastic.  Usually, I can 

Re: [PERFORM] How to force Nested Loop plan?

2003-08-30 Thread Ron Johnson
On Sat, 2003-08-30 at 15:42, Rob Nagler wrote:
[snip]
 We actually have been quite pleased with Postgres's performance
 without paying much attention to it before this.  When we first set up
 Oracle, we got into all of its parameters pretty heavily.  With
 Postgres, we just tried it and it worked.  This is the first query
 where we ran out of ideas to try.

Dumb question: given your out-of-the-box satisfaction, could it be
that postgresql.conf hasn't been tweaked?

-- 
-
Ron Johnson, Jr. [EMAIL PROTECTED]
Jefferson, LA USA

484,246 sq mi are needed for 6 billion people to live, 4 persons 
per lot, in lots that are 60'x150'.
That is ~ California, Texas and Missouri.
Alternatively, France, Spain and The United Kingdom.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly