Re: [PERFORM] Performance problem with semi-large tables

2005-01-30 Thread PFC
SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY  
release_code_id DESC, date DESC LIMIT 100;
I have done this in other queries where sorting by both release code and  
date were important. You are right, it is very fast and I do have this  
index in play.  However, most of the time I retreive shipment's when  
their shipment_status all have the same release_code, which makes  
sorting kind of moot :/  I guess that answers your comment below.
	Ah, well in this case, ORDER BY release_code_id DESC seems of course  
useless because you only have one order_code_id, but it is in fact  
necessary to make the planner realize it can use the index on  
(release_code_id,date) for the ordering. If you just ORDER BY date, the  
planner will not use your index.

Thanks again for your response.  I'll try and clarify some metrics that  
I took a few days to figure out what would be the best join order.

By running some count queries on the production database, I noticed  
there were only 8 rows in release_code.  The filtered column is unique,
Let's forget the shipments table for now.
	So you mean there is an unique, one-to-one relation between  
release_code_id and filtered_column ?
	The planner is not able to derermine this ahead of time ; and in your  
case, it's important that it be unique to be able to use the index to  
retrieve quickly the rows in (date DESC) order.
	So if you'll join only to ONE release_code_id, you can do this :

(SELECT * FROM shipment_status WHERE release_code_id =
(SELECT r.id FROM release_code WHERE r.filtered_column = '5' LIMIT 1)
ORDER BY release_code_id DESC, date DESC LIMIT 100)
	Which is no longer a join and will get your shipment_status_id's very  
quickly.

so that means the filter ratio is 0.125.  However, the data distribution  
is not normal.  When the filtered column is the constant '5', Postgres  
will join to 54% of the shipment_status rows.  Since shipment_status has  
32,000+ rows, this join is not a very good one to make.
Sure !
The shipment table has 17k rows, but also due to the distribution of  
data, almost every shipment will join to a shipment_status with a  
release_code of '5'.  For your information, this column indicates that a  
shipment has been released, as most shipments will move to this state  
eventually.  The actual join ratio from shipment_status to shipment is  
about 98.5% of the rows in the shipment table, which is still basically  
17k rows.

I was simply curious how to make something like this faster.  You see,  
it's the table size and the bad filters are really destroying this query  
example. I would never make a query to the database like this in  
practice, but I have similar queries that I do make that aren't much  
better (and can't be due to business requirements).

For example, let's add another filter to get all the shipments with  
release code '5' that are 7 days old or newer.

ss.date = current_date - 7
	It's the order by + limit which makes the query behaves badly, and which  
forces use of kludges to use the index. If you add another condition like  
that, it should be a breeze.

By analyzing the production data, this where clause has a filter ratio  
of 0.08, which is far better than the release_code filter both in ratio  
and in the number of rows that it can avoid joining.  However, if I had  
this filter into the original query, Postgres will not act on it first -  
and I think it really should before it even touches release_code.
	Well I think too.
	What with the subqueries I wrote with the LIMIT inside the subquery ? Any  
better ?
	Normally the planner is able to deconstruct subqueries and change the  
order as it sees fit, but if there are LIMIT's I don't know.

However, the planner (using EXPLAIN ANALYZE) will actually pick this  
filter last and will join 17k rows prematurely to release_code.  In this  
example, I'd like force postgres to do the date filter first, join to  
release_code next, then finally to shipment.
You could use the JOIN keywords to specify the join order youself.
Another example is filtering by the driver_id, which is a foreign key  
column on the shipment table itself to a driver table.  This has a  
filter ratio of 0.000625 when analyzing the production data.  However,  
PostgreSQL will not act on this filter first either.  The sad part is  
that since drivers are actually distributed more evenly in the database,  
it would filter out the shipment table from 17k to about 10 shipments on  
average.  In most cases, it ends up being more than 10, but not more  
than 60 or 70, which is very good since some drivers don't have any  
shipments (I question why they are even in the database, but that's  
another story).  As you can see, joining to shipment_status at this  
point (using the primary key index from shipment.current_status_id to  
shipment_status.id) should be extremely efficient.  Yet, Postgres's  
planner/optimizer won't make the right call until might later in 

Re: [PERFORM] Performance problem with semi-large tables

2005-01-30 Thread Tom Lane
PFC [EMAIL PROTECTED] writes:
 For example, let's add another filter to get all the shipments with  
 release code '5' that are 7 days old or newer.
 
 ss.date = current_date - 7

   It's the order by + limit which makes the query behaves badly, and 
 which  
 forces use of kludges to use the index. If you add another condition like  
 that, it should be a breeze.

Actually, that date condition has its own problem, namely that the
compared-to value isn't a constant.  The 8.0 planner is able to realize
that this is a pretty selective condition, but prior releases fall back
on a very pessimistic default estimate.  I'm sure that has something to
do with Ken not being able to get it to use an index on date.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Performance problem with semi-large tables

2005-01-29 Thread Ken Egervari



Yes, I'm very well aware of VACUUM and VACUUM 
ANALYZE. I've even clusted the date index and so on to ensure faster 
performance.

  - Original Message - 
  From: 
  David 
  Parker 
  To: Ken Egervari ; pgsql-performance@postgresql.org 
  
  Sent: Saturday, January 29, 2005 5:04 
  PM
  Subject: Re: [PERFORM] Performance 
  problem with semi-large tables
  
  You don't mention if you have run VACUUM or VACUUM 
  ANALYZE lately. That's generally one of the first things that folks will 
  suggest. If you have a lot of updates then VACUUM will clean up dead tuples; 
  if you have a lot of inserts then VACUUM ANALYZE will update statistics so 
  that the planner can make better decisions (as I understand 
  it).
  
  Another data point people will ask for in helping you 
  will be EXPLAIN ANALYZE output from running the queries you think are slowing 
  down.
  
  - DAP
  


From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Ken 
EgervariSent: Wednesday, January 26, 2005 9:17 PMTo: 
pgsql-performance@postgresql.orgSubject: [PERFORM] Performance 
problem with semi-large tables

Hi everyone.

I'm new to this forum and was wondering if 
anyone would be kind enough to help me out with a pretty severe performance 
issue. I believe the problem to be rather generic, so I'll put it in 
generic terms. Since I'm at home and not a work (but this is really 
bugging me), I can't post any specifics. However, I think my 
explaination will suffice.

I have a 2 tables that are are getting large 
and will only get larger with time (expoentially as more users sign on to 
the system). Right the now, a table called 'shipment' contains about 
16,000 rows and 'shipment_status' contains about 32,500 rows. These 
aren't massive rows (I keep reading about tables with millions), but they 
will definately get into 6 digits by next year and query performance is 
quite poor.

Now, from what I can understand about tuning, 
you want to specify good filters, provide good indexes on the driving filter 
as well as any referencial keys that are used while joining. This has 
helped me solve performance problems many times in the past (for example, 
changing a query speed from 2 seconds to 21 milliseconds). 


However, I am now tuning queries that operate 
on these two tables and the filters aren't very good (the best is a filter 
ratio of 0.125) and the number of rows returned is very large (not taking 
into consideration limits).

For example, consider something likethis 
query that takes ~1 second to finish:

select s.*, ss.*
from shipment s, shipment_status ss, 
release_code r
where s.current_status_id = ss.id
 and ss.release_code_id = 
r.id
 and r.filtered_column = 
'5'
order by ss.date desc
limit 100;

Release code is just a very small table of 8 
rows by looking at the production data, hence the 0.125 filter ratio. 
However, the data distribution is not normal since the filtered column 
actually pulls out about 54% of the rows in shipment_status when it 
joins. Postgres seems to be doing a sequencial scan to pull out all of 
these rows. Next, it joins approx 17550 rows to shipment. Since 
this query has a limit, it only returns the first 100, which seems like a 
waste.

Now, for this query, I know I can filter out 
the date instead to speed it up. For example, I can probably search 
for all the shipments in the last 3 days instead of limiting it to 
100. But since this isn't a real production query, I only wanted to 
show it as an example since many times I cannot do a filter by the date (and 
the sort may be date or something else irrelavant).

I'm just stressed out how I can make queries 
like this more efficient since all I see is a bunch of hash joins and 
sequencial scans taking all kinds of time.

I guess here are my 2 questions:

1. Should I just change beg to change the 
requirements so that I can make more specific queries and more screens to 
access those?
2. Can you recommend ways so that postgres acts 
on big tables more efficiently? I'm not really interested in this 
specific case (I just made it up). I'm more interested in general 
solutions to this general problem of big table sizes with bad filters and 
where join orders don't seem to help much.

Thank you very muchfor your 
help.

Best Regards,
Ken 
Egervari


Re: [PERFORM] Performance problem with semi-large tables

2005-01-29 Thread Ken Egervari
Well, postgres does what you asked. It will be slow, because you have a 
full table join. LIMIT does not change this because the rows have to be 
sorted first.
I am aware that limit doesn't really affect the execution time all that 
much.  It does speed up ORM though and keeps the rows to a manageable list 
so users don't have to look at thousands, which is good enough for me.   My 
intention here is that the date was supposed to be a good filter.

The date is in shipment_status so you should first get the 
shipment_status.id that you need and later join to shipment. This will 
avoid the big join :

SELECT s.*, ss.* FROM
(SELECT * FROM shipment_status WHERE release_code_id IN
(SELECT r.id FROM release_code WHERE r.filtered_column = '5')
ORDER BY date DESC LIMIT 100
) as ss, shipment s
WHERE s.current_status_id = ss.id
ORDER BY date DESC LIMIT 100
Is this better ?
This looks like it might be what I want.  It's not that I was not aware of 
the correct join order.  I used Dan Tow's diagram method and learned that 
filtering on date first is the best approach, then releae code, then finally 
shipment for this particular query.  I just didn't know how to tell 
PostgreSQL how to do this.

So are you suggesting as a general rule then that sub-queries are the way to 
force a specific join order in postgres?  If that is the case, I will do 
this from now on. 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Performance problem with semi-large tables

2005-01-29 Thread PFC

So are you suggesting as a general rule then that sub-queries are the  
way to force a specific join order in postgres?  If that is the case, I  
will do this from now on.
I'll try to explain a bit better...
Here's your original query :
select s.*, ss.*
from shipment s, shipment_status ss, release_code r
where s.current_status_id = ss.id
   and ss.release_code_id = r.id
   and r.filtered_column = '5'
order by ss.date desc
limit 100;
If you write something like :
SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY  
release_code_id DESC, date DESC LIMIT 100;

	In this case, if you have an index on (release_code_id, date), the  
planner will use a limited index scan which will yield the rows in index  
order, which will be very fast.

	However, if you just have an index on date, this won't help you.
	In your case, moreover, you don't use release_code_id = constant, but it  
comes from a join. So there may be several different values for  
release_code_id ; thus the planner can't use the optimization, it has to  
find the rows with the release_code_id first. And it can't use the index  
on (release_code_id, date) to get the rows in sorted order precisely  
because there could be several different values for the release_code_id.  
And then it has to sort by date.

	I hope this makes it clearer. If you are absolutely sure there is only  
one row in release_code with r.filtered_column = '5', then this means  
release_code_id is a constant and your query could get a huge speedup by  
writing it differently.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Performance problem with semi-large tables

2005-01-29 Thread Ken Egervari
Thanks again for your response.  I'll try and clarify some metrics that I 
took a few days to figure out what would be the best join order.

By running some count queries on the production database, I noticed there 
were only 8 rows in release_code.  The filtered column is unique, so that 
means the filter ratio is 0.125.  However, the data distribution is not 
normal.  When the filtered column is the constant '5', Postgres will join to 
54% of the shipment_status rows.  Since shipment_status has 32,000+ rows, 
this join is not a very good one to make.

The shipment table has 17k rows, but also due to the distribution of data, 
almost every shipment will join to a shipment_status with a release_code of 
'5'.  For your information, this column indicates that a shipment has been 
released, as most shipments will move to this state eventually.  The 
actual join ratio from shipment_status to shipment is about 98.5% of the 
rows in the shipment table, which is still basically 17k rows.

I was simply curious how to make something like this faster.  You see, it's 
the table size and the bad filters are really destroying this query example. 
I would never make a query to the database like this in practice, but I have 
similar queries that I do make that aren't much better (and can't be due to 
business requirements).

For example, let's add another filter to get all the shipments with release 
code '5' that are 7 days old or newer.

   ss.date = current_date - 7
By analyzing the production data, this where clause has a filter ratio of 
0.08, which is far better than the release_code filter both in ratio and in 
the number of rows that it can avoid joining.  However, if I had this filter 
into the original query, Postgres will not act on it first - and I think it 
really should before it even touches release_code.  However, the planner 
(using EXPLAIN ANALYZE) will actually pick this filter last and will join 
17k rows prematurely to release_code.  In this example, I'd like force 
postgres to do the date filter first, join to release_code next, then 
finally to shipment.

Another example is filtering by the driver_id, which is a foreign key column 
on the shipment table itself to a driver table.  This has a filter ratio of 
0.000625 when analyzing the production data.  However, PostgreSQL will not 
act on this filter first either.  The sad part is that since drivers are 
actually distributed more evenly in the database, it would filter out the 
shipment table from 17k to about 10 shipments on average.  In most cases, it 
ends up being more than 10, but not more than 60 or 70, which is very good 
since some drivers don't have any shipments (I question why they are even in 
the database, but that's another story).  As you can see, joining to 
shipment_status at this point (using the primary key index from 
shipment.current_status_id to shipment_status.id) should be extremely 
efficient.  Yet, Postgres's planner/optimizer won't make the right call 
until might later in the plan.

SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY 
release_code_id DESC, date DESC LIMIT 100;

In this case, if you have an index on (release_code_id, date), the 
planner will use a limited index scan which will yield the rows in index 
order, which will be very fast.
I have done this in other queries where sorting by both release code and 
date were important. You are right, it is very fast and I do have this index 
in play.  However, most of the time I retreive shipment's when their 
shipment_status all have the same release_code, which makes sorting kind of 
moot :/  I guess that answers your comment below.

However, if you just have an index on date, this won't help you.
In your case, moreover, you don't use release_code_id = constant, but it 
comes from a join. So there may be several different values for 
release_code_id ; thus the planner can't use the optimization, it has to 
find the rows with the release_code_id first. And it can't use the index 
on (release_code_id, date) to get the rows in sorted order precisely 
because there could be several different values for the release_code_id. 
And then it has to sort by date.
Well, the filtered column is actually unique (but it's not the primary key). 
Should I just make it the primary key?  Can't postgres be equally efficient 
when using other candidate keys as well?  If not, then I will definately 
change the design of my database.  I mostly use synthetic keys to make 
Hibernate configuration fairly straight-forward and to make it easy so all 
of my entities extend from the same base class.

I hope this makes it clearer. If you are absolutely sure there is only 
one row in release_code with r.filtered_column = '5', then this means 
release_code_id is a constant and your query could get a huge speedup by 
writing it differently.
You mean by avoiding the filter on number and avoiding the join?  You see, I 
never thought joining to release_code should be so bad since