date:20080506

Re: [PERFORM] pgfouine - commit details?

2008-05-06 Thread Josh Cole

We are shipping the postgres.log to a remote syslog repository to take the I/O 
burden off our postgresql server.  As such if we set log_min_duration_statement 
to 0 this allow us to get more detailed information about our commits using 
pgfouine...correct?

--
Josh

From: Guillaume Smet [mailto:[EMAIL PROTECTED]
Sent: Tue 5/6/2008 7:31 PM
To: Josh Cole
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] pgfouine - commit details?

Josh,

On Tue, May 6, 2008 at 11:10 PM, Josh Cole <[EMAIL PROTECTED]> wrote:
> We are using pgfouine to try and optimize our database at this time.  Is
> there a way to have pgfouine show examples or breakout commits?

I hesitated before not implementing this idea. The problem is that you
often don't log everything and use log_min_duration_statement and thus
you don't have all the queries of the transaction in your log file
(and you usually don't have the BEGIN; command in the logs).

--
Guillaume

Re: [PERFORM] Possible Redundancy/Performance Solution

2008-05-06 Thread Greg Smith


On Tue, 6 May 2008, Dennis Muhlestein wrote:

Those are good points.  So you'd go ahead and add the pgpool in front (or 
another redundancy approach, but then use raid1,5 or perhaps 10 on each 
server?


Right.  I don't advise using the fact that you've got some sort of 
replication going as an excuse to reduce the reliability of individual 
systems, particularly in the area of disks (unless you're really creating 
a much larger number of replicas than 2).


RAID5 can be problematic compared to other RAID setups when you are doing 
write-heavy scenarios of small blocks, and it should be avoided for 
database use.  You can find stories on this subject in the archives here 
and some of the papers at http://www.baarf.com/ go over why; "Is RAID 5 
Really a Bargain?" is the one I like best.


If you were thinking about 4 or more disks, there's a number of ways to 
distribute those:


1) RAID1+0 to make one big volume
2) RAID1 for OS/apps/etc, RAID1 for database
3) RAID1 for OS+xlog, RAID1 for database
4) RAID1 for OS+popular tables, RAID1 for rest of database

Exactly which of these splits is best depends on your application and the 
tradeoffs important to you, but any of these should improve performance 
and reliability over what you're doing now.  I personally tend to create 
two separate distinct volumes rather than using any striping here, create 
a tablespace or three right from the start, and then manage the underlying 
mapping to disk with symbolic links so I can shift the allocation around. 
That does require you have a steady hand and good nerves for when you 
screw up, so I wouldn't recommend that to everyone.


As you get more disks it gets less practical to handle things this way, 
and it becomes increasingly sensible to just make one big array out of 
them and stopping worrying about it.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Possible Redundancy/Performance Solution

2008-05-06 Thread Scott Marlowe

On Tue, May 6, 2008 at 3:39 PM, Dennis Muhlestein
<[EMAIL PROTECTED]> wrote:

>  Those are good points.  So you'd go ahead and add the pgpool in front (or
> another redundancy approach, but then use raid1,5 or perhaps 10 on each
> server?

That's what I'd do.  specificall RAID10 for small to medium drive sets
used for transactional stuff, and RAID6 for very large reporting
databases that are mostly read.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Seqscan problem

2008-05-06 Thread Vlad Arkhipov


Tom Lane writes:

Vlad Arkhipov <[EMAIL PROTECTED]> writes:
  
I've just discovered a problem with quite simple query. It's really 
confusing me.
Postgresql 8.3.1, random_page_cost=1.1. All tables were analyzed before 
query.



What have you got effective_cache_size set to?

regards, tom lane

  


1024M

Re: [PERFORM] pgfouine - commit details?

2008-05-06 Thread Guillaume Smet

Josh,

On Tue, May 6, 2008 at 11:10 PM, Josh Cole <[EMAIL PROTECTED]> wrote:
> We are using pgfouine to try and optimize our database at this time.  Is
> there a way to have pgfouine show examples or breakout commits?

I hesitated before not implementing this idea. The problem is that you
often don't log everything and use log_min_duration_statement and thus
you don't have all the queries of the transaction in your log file
(and you usually don't have the BEGIN; command in the logs).

-- 
Guillaume

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Possible Redundancy/Performance Solution

2008-05-06 Thread Dennis Muhlestein


Greg Smith wrote:

On Tue, 6 May 2008, Dennis Muhlestein wrote:


 > Since disks are by far the most likely thing to fail, I think it would
be bad planning to switch to a design that doubles the chance of a disk 
failure taking out the server just because you're adding some 
server-level redundancy.  Anybody who's been in this business for a 
while will tell you that seemingly improbable double failures happen, 
and if were you'd I want a plan that survived a) a single disk failure 
on the primary and b) a single disk failure on the secondary at the same 
time.


Let me strengthen that--I don't feel comfortable unless I'm able to 
survive a single disk failure on the primary and complete loss of the 
secondary (say by power supply failure), because a double failure that 
starts that way is a lot more likely than you might think.  Especially 
with how awful hard drives are nowadays.


Those are good points.  So you'd go ahead and add the pgpool in front 
(or another redundancy approach, but then use raid1,5 or perhaps 10 on 
each server?


-Dennis

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] pgfouine - commit details?

2008-05-06 Thread Josh Cole

We are using pgfouine to try and optimize our database at this time.  Is
there a way to have pgfouine show examples or breakout commits? 

 

Queries that took up the most time

Rank Total durationTimes executed Av. duration (s)
Query

1  26m54s 222,305 0.01
COMMIT;

 

Perhaps we need to tweak what is being logged by postgresql?

 

log_destination = 'syslog'  

logging_collector = on  

log_directory = 'pg_log'

log_truncate_on_rotation = on   

log_rotation_age = 1d   

 

syslog_facility = 'LOCAL0'

syslog_ident = 'postgres'

 

client_min_messages = notice

log_min_messages = notice   

log_error_verbosity = default   

log_min_error_statement = notice

log_min_duration_statement = 2  

#silent_mode = off  

 

debug_print_parse = off

debug_print_rewritten = off

debug_print_plan = off

debug_pretty_print = off

log_checkpoints = off

log_connections = off

log_disconnections = off

log_duration = off

log_hostname = off

#log_line_prefix = ''   

log_lock_waits = on 

log_statement = 'none'  

#log_temp_files = -1

#log_timezone = unknown 

 

#track_activities = on

#track_counts = on

#update_process_title = on

 

#log_parser_stats = off

#log_planner_stats = off

#log_executor_stats = off

#log_statement_stats = off

 

Regards,

Josh

Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers

2008-05-06 Thread Craig James


Greg Smith wrote:

On Tue, 6 May 2008, Craig James wrote:

I only did two runs of each, which took about 24 minutes.  Like the 
first round of tests, the "noise" in the measurements (about 10%) 
exceeds the difference between scheduler-algorithm performance, except 
that "anticipatory" seems to be measurably slower.


Those are much better results.  Any test that says anticipatory is 
anything other than useless for database system use with a good 
controller I presume is broken, so that's how I know you're in the right 
ballpark now but weren't before.


In order to actually get some useful data out of the noise that is 
pgbench, you need a lot more measurements of longer runs.  As 
perspective, the last time I did something in this area, in order to get 
enough data to get a clear picture I ran tests for 12 hours.  I'm hoping 
to repeat that soon with some more common hardware that gives useful 
results I can give out.


This data is good enough for what I'm doing.  There were reports from non-RAID 
users that the I/O scheduling could make as much as a 4x difference in 
performance (which makes sense for non-RAID), but these tests show me that 
three of the four I/O schedulers are within 10% of each other.  Since this 
matches my intuition of how battery-backed RAID will work, I'm satisfied.  If 
our servers get overloaded to the point where 10% matters, then I need a much 
more dramatic solution, like faster machines or more machines.

Craig


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Possible Redundancy/Performance Solution

2008-05-06 Thread Greg Smith


On Tue, 6 May 2008, Dennis Muhlestein wrote:

I was planning on pgpool being the cushion between the raid0 failure 
probability and my need for redundancy.  This way, I get protection against 
not only disks, but cpu, memory, network cards,motherboards etc.Is this 
not a reasonable approach?


Since disks are by far the most likely thing to fail, I think it would be 
bad planning to switch to a design that doubles the chance of a disk 
failure taking out the server just because you're adding some server-level 
redundancy.  Anybody who's been in this business for a while will tell you 
that seemingly improbable double failures happen, and if were you'd I want 
a plan that survived a) a single disk failure on the primary and b) a 
single disk failure on the secondary at the same time.


Let me strengthen that--I don't feel comfortable unless I'm able to 
survive a single disk failure on the primary and complete loss of the 
secondary (say by power supply failure), because a double failure that 
starts that way is a lot more likely than you might think.  Especially 
with how awful hard drives are nowadays.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers

2008-05-06 Thread Greg Smith


On Tue, 6 May 2008, Craig James wrote:

I only did two runs of each, which took about 24 minutes.  Like the first 
round of tests, the "noise" in the measurements (about 10%) exceeds the 
difference between scheduler-algorithm performance, except that 
"anticipatory" seems to be measurably slower.


Those are much better results.  Any test that says anticipatory is 
anything other than useless for database system use with a good controller 
I presume is broken, so that's how I know you're in the right ballpark now 
but weren't before.


In order to actually get some useful data out of the noise that is 
pgbench, you need a lot more measurements of longer runs.  As perspective, 
the last time I did something in this area, in order to get enough data to 
get a clear picture I ran tests for 12 hours.  I'm hoping to repeat that 
soon with some more common hardware that gives useful results I can give 
out.


So it still looks like cfq, noop and deadline are more or less equivalent 
when used with a battery-backed RAID.


I think it's fair to say they're within 10% of one another on raw 
throughput.  The thing you're not measuring here is worst-case latency, 
and that's where there might be a more interesting difference.  Most tests 
I've seen suggest deadline is the best in that regard, cfq the worst, and 
where noop fits in depends on the underlying controller.


pgbench produces log files with latency measurements if you pass it "-l". 
Here's a snippet of shell that runs pgbench then looks at the resulting 
latency results for the worst 5 numbers:


pgbench ... -l &
p=$!
wait $p
mv pgbench_log.${p} pgbench.log
echo Worst latency results:
cat pgbench.log | cut -f 3 -d " " | sort -n | tail -n 5

However, that may not give you much useful info either--in most cases 
checkpoint issues kind of swamp the worst-base behavior in PostgreSQL, 
and to quantify I/O schedulers you need to look more complicated 
statistics on latency.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Possible Redundancy/Performance Solution

2008-05-06 Thread Dennis Muhlestein


Greg Smith wrote:

On Tue, 6 May 2008, Dennis Muhlestein wrote:


RAID0 on two disks makes a disk failure that will wipe out the database 
twice as likely.  If you goal is better reliability, you want some sort 
of RAID1, which you can do with two disks.  That should increase read 
throughput a bit (not quite double though) while keeping write 
throughput about the same.


I was planning on pgpool being the cushion between the raid0 failure 
probability and my need for redundancy.  This way, I get protection 
against not only disks, but cpu, memory, network cards,motherboards etc. 
   Is this not a reasonable approach?




If you added four disks, then you could do a RAID1+0 combination which 
should substantially outperform your existing setup in every respect 
while also being more resiliant to drive failure.


Our applications are mostly read intensive.  I don't think that having 
two databases on one machine, where previously we had just one, would 
add too much of an impact, especially if we use the load balance 
feature of pgpool as well as the redundancy feature.


A lot depends on how much RAM you've got and whether it's enough to keep 
the cache hit rate fairly high here.  A reasonable thing to consider 
here is doing a round of standard performance tuning on the servers to 
make sure they're operating efficient before increasing their load.



Can anyone comment on any gotchas or issues we might encounter?


Getting writes to replicate to multiple instances of the database 
usefully is where all the really nasty gotchas are in this area.  
Starting with that part and working your way back toward the front-end 
pooling from there should crash you into the hard parts early in the 
process.



Thanks for the tips!
Dennis

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers

2008-05-06 Thread Craig James


Greg Smith wrote:

On Mon, 5 May 2008, Craig James wrote:


pgbench -i -s 20 -U test


That's way too low to expect you'll see a difference in I/O schedulers. 
A scale of 20 is giving you a 320MB database, you can fit the whole 
thing in RAM and almost all of it on your controller cache.  What's 
there to schedule?  You're just moving between buffers that are 
generally large enough to hold most of what they need.


Test repeated with:
autovacuum enabled
database destroyed and recreated between runs
pgbench -i -s 600 ...
pgbench -c 10 -t 5 -n ...

I/O Sched AVG Test1  Test2
---  -
cfq705  695715
noop   758  769747
deadline   741  705775
anticipatory   494  477511

I only did two runs of each, which took about 24 minutes.  Like the first round of tests, the 
"noise" in the measurements (about 10%) exceeds the difference between 
scheduler-algorithm performance, except that "anticipatory" seems to be measurably slower.

So it still looks like cfq, noop and deadline are more or less equivalent when 
used with a battery-backed RAID.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Shaun Thomas

On Tue, 2008-05-06 at 18:24 +0100, Antoine Baudoux wrote:

> Isnt the planner fooled by the index on the sorting column?
> If I remove the index the query runs OK.

In your case, for whatever reason, the stats say doing the index scan on
the sorted column will give you the results faster.  That isn't always
the case, and sometimes you can give the same query different where
clauses and that same slow-index-scan will randomly be fast.  It's all
based on the index distribution and the particular values being fetched.

This goes back to what Tom said.  If you know a "miss" can result in
terrible performance, it's best to just recode the query to avoid the
situation.

> This is crazy, so simply by adding a LIMIT to a query, the planning is
> changed in a very bad way. Does the planner use the LIMIT as a sort of
> hint?

Yes.  That's actually what tells it the index scan can be a "big win."
If it scans the index backwards on values returned from some of your
joins, it may just have to find 25 rows and then it can immediately stop
scanning and just give you the results.  In normal cases, this is a
massive performance boost when you have an order clause and are
expecting a ton of results, (say you're getting the first 25 rows of
1 or something).  But if it would be faster to generate the results
and *then* sort, but Postgres thinks otherwise, you're pretty much
screwed.

But that's the long answer.  You have like 3 ways to get around this
now, so pick one. ;)

-- 

Shaun Thomas
Database Administrator

Leapfrog Online 
807 Greenwood Street 
Evanston, IL 60201 
Tel. 847-440-8253
Fax. 847-570-5750
www.leapfrogonline.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Shaun Thomas

On Tue, 2008-05-06 at 18:59 +0100, Tom Lane wrote:

> Whether the scan is forwards or backwards has nothing
> to do with it.  The planner is using the index ordering
> to avoid having to do a full-table scan and sort.

Oh, I know that.  I just noticed that when this happened to us, more
often than not, it was a reverse index scan that did it.  The thing that
annoyed me most was when it happened on an index that, even on a table
having 20M rows, the cardinality is < 10 on almost every value of that
index.  In our case, having a "LIMIT 1" was much worse than just getting
back 5 or 10 rows and throwing away everything after the first one.

> but when it's a win it can be a big win, too, so "it's
> a bug take it out" is an unhelpful opinion.

That's just it... it *can* be a big win.  But when it's a loss, you're
index-scanning a 20M+ row table for no reason.  We got around it,
obviously, but it was a definite surprise when a query that normally
runs in 0.5ms time randomly and inexplicably runs at 4-120s.  This is
disaster for a feed loader chewing through a few ten-thousand entries.

But that's just me grousing about not having query hints or being able
to tell Postgres to never, ever, ever index-scan certain tables. :)

-- 

Shaun Thomas
Database Administrator

Leapfrog Online 
807 Greenwood Street 
Evanston, IL 60201 
Tel. 847-440-8253
Fax. 847-570-5750
www.leapfrogonline.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Heikki Linnakangas


Antoine Baudoux wrote:
Here is the explain analyse for the first query, the other is still 
running...



explain analyse select * from t_Event event
inner join t_Service service on event.service_id=service.id
inner join t_System system on service.system_id=system.id
inner join t_Interface interface on system.id=interface.system_id
inner join t_Network network on interface.network_id=network.id
where (network.customer_id=1) order by event.c_date desc limit 25

Limit  (cost=11761.44..11761.45 rows=1 width=976) (actual 
time=0.047..0.047 rows=0 loops=1)
  ->  Sort  (cost=11761.44..11761.45 rows=1 width=976) (actual 
time=0.045..0.045 rows=0 loops=1)

Sort Key: event.c_date
Sort Method:  quicksort  Memory: 17kB
->  Nested Loop  (cost=0.00..11761.43 rows=1 width=976) (actual 
time=0.024..0.024 rows=0 loops=1)
  ->  Nested Loop  (cost=0.00..11755.15 rows=1 width=960) 
(actual time=0.024..0.024 rows=0 loops=1)
->  Nested Loop  (cost=0.00..191.42 rows=1 
width=616) (actual time=0.024..0.024 rows=0 loops=1)
  Join Filter: (interface.system_id = 
service.system_id)
  ->  Nested Loop  (cost=0.00..9.29 rows=1 
width=576) (actual time=0.023..0.023 rows=0 loops=1)
->  Seq Scan on t_network network  
(cost=0.00..1.01 rows=1 width=18) (actual time=0.009..0.009 rows=1 loops=1)

  Filter: (customer_id = 1)
->  Index Scan using 
interface_network_id_idx on t_interface interface  (cost=0.00..8.27 
rows=1 width=558) (actual time=0.011..0.011 rows=0 loops=1)
  Index Cond: (interface.network_id 
= network.id)
  ->  Seq Scan on t_service service  
(cost=0.00..109.28 rows=5828 width=40) (never executed)
->  Index Scan using event_svc_id_idx on t_event 
event  (cost=0.00..11516.48 rows=3780 width=344) (never executed)

  Index Cond: (event.service_id = service.id)
  ->  Index Scan using t_system_pkey on t_system system  
(cost=0.00..6.27 rows=1 width=16) (never executed)

Index Cond: (system.id = service.system_id)
Total runtime: 0.362 ms


Are the queries returning the same results (except for the extra columns 
coming from t_network)? It looks like in this version, the 
network-interface join is performed first, which returns zero rows, so 
the rest of the joins don't need to be performed at all. That's why it's 
fast.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Heikki Linnakangas


Antoine Baudoux wrote:
Here is the explain analyse for the first query, the other is still 
running...



explain analyse select * from t_Event event
inner join t_Service service on event.service_id=service.id
inner join t_System system on service.system_id=system.id
inner join t_Interface interface on system.id=interface.system_id
inner join t_Network network on interface.network_id=network.id
where (network.customer_id=1) order by event.c_date desc limit 25

Limit  (cost=11761.44..11761.45 rows=1 width=976) (actual 
time=0.047..0.047 rows=0 loops=1)
  ->  Sort  (cost=11761.44..11761.45 rows=1 width=976) (actual 
time=0.045..0.045 rows=0 loops=1)

Sort Key: event.c_date
Sort Method:  quicksort  Memory: 17kB
->  Nested Loop  (cost=0.00..11761.43 rows=1 width=976) (actual 
time=0.024..0.024 rows=0 loops=1)
  ->  Nested Loop  (cost=0.00..11755.15 rows=1 width=960) 
(actual time=0.024..0.024 rows=0 loops=1)
->  Nested Loop  (cost=0.00..191.42 rows=1 
width=616) (actual time=0.024..0.024 rows=0 loops=1)
  Join Filter: (interface.system_id = 
service.system_id)
  ->  Nested Loop  (cost=0.00..9.29 rows=1 
width=576) (actual time=0.023..0.023 rows=0 loops=1)
->  Seq Scan on t_network network  
(cost=0.00..1.01 rows=1 width=18) (actual time=0.009..0.009 rows=1 loops=1)

  Filter: (customer_id = 1)
->  Index Scan using 
interface_network_id_idx on t_interface interface  (cost=0.00..8.27 
rows=1 width=558) (actual time=0.011..0.011 rows=0 loops=1)
  Index Cond: (interface.network_id 
= network.id)
  ->  Seq Scan on t_service service  
(cost=0.00..109.28 rows=5828 width=40) (never executed)
->  Index Scan using event_svc_id_idx on t_event 
event  (cost=0.00..11516.48 rows=3780 width=344) (never executed)

  Index Cond: (event.service_id = service.id)
  ->  Index Scan using t_system_pkey on t_system system  
(cost=0.00..6.27 rows=1 width=16) (never executed)

Index Cond: (system.id = service.system_id)
Total runtime: 0.362 ms


Are the queries even returning the same results (except for the extra 
columns coming from t_network)? It looks like in this version, the 
network-interface join is performed first, which returns zero rows, so 
the rest of the joins don't need to be performed at all. That's why it's 
fast.


Which version of PostgreSQL is this, BTW?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Tom Lane

Shaun Thomas <[EMAIL PROTECTED]> writes:
> I'm not sure what causes this, but the problem with indexes is that
> they're not necessarily in the order you want unless you also cluster
> them, so a backwards index scan is almost always the wrong answer.

Whether the scan is forwards or backwards has nothing to do with it.
The planner is using the index ordering to avoid having to do a
full-table scan and sort.  It's essentially betting that it will find
25 (or whatever your LIMIT is) rows that satisfy the other query
conditions soon enough in the index scan to make this faster than the
full-scan approach.  If there are a lot fewer matching rows than it
expects, or if the target rows aren't uniformly scattered in the index
ordering, then this way can be a loss; but when it's a win it can be
a big win, too, so "it's a bug take it out" is an unhelpful opinion.

If a misestimate of this kind is bugging you enough that you're willing
to change the query, I think you can fix it like this:

select ... from foo order by x limit n;
=>
select ... from (select ... from foo order by x) ss limit n;

The subselect will be planned without awareness of the LIMIT, so you
should get a plan using a sort rather than one that bets on the LIMIT
being reached quickly.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] need to speed up query

2008-05-06 Thread Justin




PFC wrote:


What is a "period" ? Is it a month, or something more "custom" ? 
Can periods overlap ?


No periods can never overlap.  If the periods did you would be in  
violation of many tax laws around the world. Plus it you would not 
know how much money you are making or losing.


I was wondering if you'd be using the same query to compute how 
much was gained every month and every week, which would have 
complicated things.

But now it's clear.

To make this really funky you can have a Fiscal  Calendar year start 
June 15 2008 and end on June 14 2009


Don't you just love those guys ? Always trying new tricks to make 
your life more interesting.


Thats been around been around a long time.  You can go back a few 
hundreds years



Note that here you are scanning the entire table multiple times, 
the complexity of this is basically (rows in gltrans)^2 which is 
something you'd like to avoid.


For accounting purposes you need to know the Beginning Balances, 
Debits,  Credits,  Difference between Debits to Credits and the 
Ending Balance  for each account.  We have 133 accounts with 
presently 12 periods defined so we end up 1596 rows returned for this 
query.


Alright, I propose a solution which only works when periods don't 
overlap.
It will scan the entire table, but only once, not many times as 
your current query does.


So period 1 should have for the most part have Zero for Beginning 
Balances for most types of Accounts.  Period 2 is Beginning Balance 
is Period 1 Ending Balance, Period 3 is Period 2 ending balance so 
and so on forever.


Precisely. So, it is not necessary to recompute everything for 
each period.
Use the previous period's ending balance as the current period's 
starting balance...


There are several ways to do this.
First, you could use your current query, but only compute the sum 
of what happened during a period, for each period, and store that in a 
temporary table.
Then, you use a plpgsql function, or you do that in your client, 
you take the rows in chronological order, you sum them as they come, 
and you get your balances. Use a NUMERIC type, not a FLOAT, to avoid 
rounding errors.


The other solution does the same thing but optimizes the first 
step like this :

INSERT INTO temp_table SELECT period, sum(...) GROUP BY period

To do this you must be able to compute the period from the date 
and not the other way around. You could store a period_id in your 
table, or use a function.


Another much more efficient solution would be to have a summary 
table which keeps the summary data for each period, with beginning 
balance and end balance. This table will only need to be updated when 
someone finds an old receipt in their pocket or something.




As i posted earlier the software did do this but it has so many bugs 
else where in the code it allows it get out of balance to what really is 
happening.   I spent a several weeks trying to get this working and find 
all the places it  went wrong.  I gave up and did this query which took 
a day write and balance to a point that i turned it over to the 
accountant.   I redid the front end and i'm off to the races and Fixing 
other critical problems.


All i need to do is take Shanun Thomas code and replace the View this 
select statement creates



This falls under the stupid question and i'm just curious what other 
people think what makes a query complex?


I have some rather complex queries which postgres burns in a few 
milliseconds.
You could define complexity as the amount of brain sweat that went 
into writing that query.
You could also define complexity as O(n) or O(n^2) etc, for 
instance your query (as written) is O(n^2) which is something you 
don't want, I've seen stuff that was O(2^n) or worse, O(n!) in 
software written by drunk students, in this case getting rid of it is 
an emergency...




Thanks for your help and ideas i really appreciate it.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Possible Redundancy/Performance Solution

2008-05-06 Thread Greg Smith


On Tue, 6 May 2008, Dennis Muhlestein wrote:

First, I'd replace are sata hard drives with a scsi controller and two 
scsi hard drives that run raid 0 (probably running the OS and logs on 
the original sata drive).


RAID0 on two disks makes a disk failure that will wipe out the database 
twice as likely.  If you goal is better reliability, you want some sort of 
RAID1, which you can do with two disks.  That should increase read 
throughput a bit (not quite double though) while keeping write throughput 
about the same.


If you added four disks, then you could do a RAID1+0 combination which 
should substantially outperform your existing setup in every respect while 
also being more resiliant to drive failure.


Our applications are mostly read intensive.  I don't think that having two 
databases on one machine, where previously we had just one, would add too 
much of an impact, especially if we use the load balance feature of pgpool as 
well as the redundancy feature.


A lot depends on how much RAM you've got and whether it's enough to keep 
the cache hit rate fairly high here.  A reasonable thing to consider here 
is doing a round of standard performance tuning on the servers to make 
sure they're operating efficient before increasing their load.



Can anyone comment on any gotchas or issues we might encounter?


Getting writes to replicate to multiple instances of the database usefully 
is where all the really nasty gotchas are in this area.  Starting with 
that part and working your way back toward the front-end pooling from 
there should crash you into the hard parts early in the process.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What constitutes a complex query

2008-05-06 Thread Scott Marlowe

On Tue, May 6, 2008 at 11:23 AM, Justin <[EMAIL PROTECTED]> wrote:
>
>
>  Craig James wrote:
>
> > Justin wrote:
> >
> > > This falls under the stupid question and i'm just curious what other
> people think what makes a query complex?
> > >
> >
> > There are two kinds:
> >
> > 1. Hard for Postgres to get the answer.
> >
>  this one

Sometimes, postgresql makes a bad choice on simple queries, so it's
hard to say what all the ones are that postgresql tends to get wrong.
Plus the query planner is under constant improvement thanks to the
folks who find poor planner choices and Tom for making the changes.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Antoine Baudoux


Thanks a lot for your answer, there are some points I didnt understand

On May 6, 2008, at 6:43 PM, Shaun Thomas wrote:



The second query says "Awesome!  Only one network... I can just search
the index of t_event backwards for this small result set!"



Shouldnt It be the opposite? considering that only a few row must be  
"joined" (Sorry but I'm not familiar with DBMS terms) with the
t_event table, why not simply look up the corresponding rows in the  
t_event table using the service_id foreign key, then do the sort? Isnt  
the planner fooled by the index on the sorting column? If I remove the  
index the query runs OK.



But here's the rub... try your query *without* the limit clause, and  
you

may find it's actually faster, because the planner suddenly thinks it
will have to scan the whole table, so it choses an alternate plan
(probably back to the nest-loop).  Alternatively, take off the order- 
by

clause, and it'll remove the slow backwards index-scan.


You are right, if i remove the order-by clause It doesnt backwards  
index-scan.


And if I remove the limit and keep the  order-by clause, the backwards  
index-scan is gone too, and the query runs in a few millisecs!!


This is crazy, so simply by adding a LIMIT to a query, the planning is  
changed in a very bad way. Does the planner use the LIMIT as a sort of  
hint?



Thank you for your explanations,


Antoine Baudoux

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What constitutes a complex query

2008-05-06 Thread Justin




Craig James wrote:

Justin wrote:
This falls under the stupid question and i'm just curious what other 
people think what makes a query complex?


There are two kinds:

1. Hard for Postgres to get the answer.

this one


2. Hard for a person to comprehend.

Which do you mean?

Craig



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] need to speed up query

2008-05-06 Thread Justin

it worked it had couple missing parts but it worked and ran in 3.3 
seconds.  *Thanks for this *
i need to review the result and balance it to my results as the 
Accountant already went through and balanced some accounts by hand to 
verify my results


<>

You might want to consider a
denormalized summary table that contains this information (and maybe
more) maintained by a trigger or regularly invoked stored-procedure and
then you can select from *that* with much less agony.

<>

I just dumped the summary table because it kept getting out of balance 
all the time and was missing accounts that did not have transaction in 
them for given period.  Again i did not lay out the table nor the old 
code which was terrible and did not work  correctly.   I tried several 
times to fix the summary table  but to many  things allowed it to get 
out of sync.  Keeping the Ending and Beginning Balance correct was to 
much trouble and i needed to get numbers we can trust to the accountant. 

The developers of the code got credits and debits backwards so instead 
of fixing the code they just added code to flip the values on the front 
end.  Its really annoying.  At this point if i could go back 7 months 
ago i would not purchased this software if i had known what i know now.


I've had to make all kinds of changes i never intended to make in order 
to get the stuff to balance and agree. I've spent the last 3 months in 
code review fixing things that allow accounts to get out of balance and 
stop stupid things from happening, like posting GL Transactions into 
non-existing accounting periods.  the list of things i have to fix is 
getting dam long.

[PERFORM] Possible Redundancy/Performance Solution

2008-05-06 Thread Dennis Muhlestein


Right now, we have a few servers that host our databases.  None of them
are redundant.  Each hosts databases for one or more applications.
Things work reasonably well but I'm worried about the availability of
some of the sites.  Our hardware is 3-4 years old at this point and I'm 
not naive to the possibility of drives, memory, motherboards or whatever 
failing.


I'm toying with the idea of adding a little redundancy and maybe some
performance to our setup.  First, I'd replace are sata hard drives with
a scsi controller and two scsi hard drives that run raid 0 (probably 
running the OS and logs on the original sata drive).  Then I'd run the 
previous two databases on one cluster of two servers with pgpool in 
front (using the redundancy feature of pgpool).


Our applications are mostly read intensive.  I don't think that having 
two databases on one machine, where previously we had just one, would 
add too much of an impact, especially if we use the load balance feature 
of pgpool as well as the redundancy feature.


Can anyone comment on any gotchas or issues we might encounter?  Do you 
think this strategy has possibility to accomplish what I'm originally 
setting out to do?


TIA
-Dennis

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What constitutes a complex query

2008-05-06 Thread Richard Broersma

On Tue, May 6, 2008 at 9:41 AM, Scott Marlowe <[EMAIL PROTECTED]> wrote:
> I'd say that the use of correlated subqueries qualifies a query as
> complicated.  Joining on non-usual pk-fk stuff.  the more you're
> mashing one set of data against another, and the odder the way you
> have to do it, the more complex the query becomes.

I would add that data analysis queries that have multiple level of
aggregation analysis can be complicated also.

For example, in a table of racer times find the average time for each
team while only counting teams whom at least have greater than four
team members and produce an ordered list displaying the ranking for
each team according to their average time.

-- 
Regards,
Richard Broersma Jr.

Visit the Los Angles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What constitutes a complex query

2008-05-06 Thread Steve Atkins



On May 6, 2008, at 8:45 AM, Justin wrote:

This falls under the stupid question and i'm just curious what other  
people think what makes a query complex?


If I know in advance exactly how the planner will plan the query (and  
be right), it's a simple query.


Otherwise it's a complex query.

As I get a better feel for the planner, some queries that used to be  
complex become simple. :)


Cheers,
  Steve


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] need to speed up query

2008-05-06 Thread Shaun Thomas

On Tue, 2008-05-06 at 03:01 +0100, Justin wrote:

> i've had to write queries to get trail balance values out of the GL
> transaction table and i'm not happy with its performance

Go ahead and give this a try:

SELECT p.period_id, p.period_start, p.period_end, a.accnt_id,
   a.accnt_number, a.accnt_descrip, p.period_yearperiod_id,
   a.accnt_type,
   SUM(CASE WHEN g.gltrans_date < p.period_start
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS beginbalance,
   SUM(CASE WHEN g.gltrans_date < p.period_end
 AND g.gltrans_date >= p.period_start
 AND g.gltrans_amount <= 0::numeric
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS negative,
   SUM(CASE WHEN g.gltrans_date <= p.period_end
 AND g.gltrans_date >= p.period_start
 AND g.gltrans_amount >= 0::numeric
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS positive,
   SUM(CASE WHEN g.gltrans_date <= p.period_end
 AND g.gltrans_date >= p.period_start
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS difference,
   SUM(CASE WHEN g.gltrans_date <= p.period_end
THEN g.gltrans_amount ELSE 0.0
   END)::text::money AS endbalance,
  FROM period p
 CROSS JOIN accnt a
  LEFT JOIN gltrans g ON (g.gltrans_accnt_id = a.accnt_id
  AND g.gltrans_posted = true)
 ORDER BY period.period_id, accnt.accnt_number;

Depending on how the planner saw your old query, it may have forced
several different sequence or index scans to get the information from
gltrans.  One thing all of your subqueries had in common was a join on
the account id and listing only posted transactions.  It's still a big
gulp, but it's only one gulp.

The other thing I did was that I guessed you added the coalesce clause
because the subqueries individually could return null rowsets for
various groupings, and you wouldn't want that.  This left-join solution
only lets it add to your various sums if it matches all the conditions,
otherwise it falls through the list of cases until nothing matches.  If
some of your transactions can have null amounts, you might consider
turning g.gltrans into COALESCE(g.gltrans, 0.0) instead.

Otherwise, this *might* work; without knowing more about your schema,
it's only a guess.  I'm a little skeptical about the conditionless
cross-join, but whatever.

Either way, by looking at this query, it looks like some year-end
summary piece, or an at-a-glance idea of your account standings.  The
problem you're going to have with this is that there's no way to truly
optimize this.  One way or another, you're going to incur some
combination of three sequence scans or three index scans; if those
tables get huge, you're in trouble.  You might want to consider a
denormalized summary table that contains this information (and maybe
more) maintained by a trigger or regularly invoked stored-procedure and
then you can select from *that* with much less agony.

Then there's fact-tables, but that's beyond the scope of this email. ;)

Good luck!

-- 

Shaun Thomas
Database Administrator

Leapfrog Online 
807 Greenwood Street 
Evanston, IL 60201 
Tel. 847-440-8253
Fax. 847-570-5750
www.leapfrogonline.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Shaun Thomas

On Tue, 2008-05-06 at 16:03 +0100, Antoine Baudoux wrote:

> My understanding is that in the first case the sort is
> done after all the table joins and filtering, but in the
> second case ALL the rows in t_event are scanned and sorted
> before the join.

You've actually run into a problem that's bitten us in the ass a couple
of times.  The problem with your second query is that it's *too*
efficient.  You'll notice the first plan uses a bevy of nest-loops,
which is very risky if the row estimates are not really really accurate.
The planner says "Hey, customer_id=1 could be several rows in the
t_network table, but not too many... I better check them one by one."
I've turned off nest-loops sometimes to avoid queries that would run
several hours due to mis-estimation, but it looks like yours was just
fine.

The second query says "Awesome!  Only one network... I can just search
the index of t_event backwards for this small result set!"

But here's the rub... try your query *without* the limit clause, and you
may find it's actually faster, because the planner suddenly thinks it
will have to scan the whole table, so it choses an alternate plan
(probably back to the nest-loop).  Alternatively, take off the order-by
clause, and it'll remove the slow backwards index-scan.

I'm not sure what causes this, but the problem with indexes is that
they're not necessarily in the order you want unless you also cluster
them, so a backwards index scan is almost always the wrong answer.
Personally I consider this a bug, and it's been around since at least
the 8.1 tree.  The only real answer is that you have a fast version of
the query, so try and play with it until it acts the way you want.

-- 

Shaun Thomas
Database Administrator

Leapfrog Online 
807 Greenwood Street 
Evanston, IL 60201 
Tel. 847-440-8253
Fax. 847-570-5750
www.leapfrogonline.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Antoine Baudoux

Here is the explain analyse for the first query, the other is still  
running...



explain analyse select * from t_Event event
inner join t_Service service on event.service_id=service.id
inner join t_System system on service.system_id=system.id
inner join t_Interface interface on system.id=interface.system_id
inner join t_Network network on interface.network_id=network.id
where (network.customer_id=1) order by event.c_date desc limit 25

Limit  (cost=11761.44..11761.45 rows=1 width=976) (actual  
time=0.047..0.047 rows=0 loops=1)
  ->  Sort  (cost=11761.44..11761.45 rows=1 width=976) (actual  
time=0.045..0.045 rows=0 loops=1)

Sort Key: event.c_date
Sort Method:  quicksort  Memory: 17kB
->  Nested Loop  (cost=0.00..11761.43 rows=1 width=976)  
(actual time=0.024..0.024 rows=0 loops=1)
  ->  Nested Loop  (cost=0.00..11755.15 rows=1 width=960)  
(actual time=0.024..0.024 rows=0 loops=1)
->  Nested Loop  (cost=0.00..191.42 rows=1  
width=616) (actual time=0.024..0.024 rows=0 loops=1)
  Join Filter: (interface.system_id =  
service.system_id)
  ->  Nested Loop  (cost=0.00..9.29 rows=1  
width=576) (actual time=0.023..0.023 rows=0 loops=1)
->  Seq Scan on t_network network   
(cost=0.00..1.01 rows=1 width=18) (actual time=0.009..0.009 rows=1  
loops=1)

  Filter: (customer_id = 1)
->  Index Scan using  
interface_network_id_idx on t_interface interface  (cost=0.00..8.27  
rows=1 width=558) (actual time=0.011..0.011 rows=0 loops=1)
  Index Cond:  
(interface.network_id = network.id)
  ->  Seq Scan on t_service service   
(cost=0.00..109.28 rows=5828 width=40) (never executed)
->  Index Scan using event_svc_id_idx on t_event  
event  (cost=0.00..11516.48 rows=3780 width=344) (never executed)

  Index Cond: (event.service_id = service.id)
  ->  Index Scan using t_system_pkey on t_system system   
(cost=0.00..6.27 rows=1 width=16) (never executed)

Index Cond: (system.id = service.system_id)
Total runtime: 0.362 ms



On May 6, 2008, at 5:38 PM, Guillaume Smet wrote:


Antoine,

On Tue, May 6, 2008 at 5:03 PM, Antoine Baudoux <[EMAIL PROTECTED]> wrote:

"Limit  (cost=23981.18..23981.18 rows=1 width=977)"
"  ->  Sort  (cost=23981.18..23981.18 rows=1 width=977)"
"Sort Key: this_.c_date"


Can you please provide the EXPLAIN ANALYZE output instead of EXPLAIN?

Thanks.

--
Guillaume



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What constitutes a complex query

2008-05-06 Thread Scott Marlowe

On Tue, May 6, 2008 at 9:45 AM, Justin <[EMAIL PROTECTED]> wrote:
> This falls under the stupid question and i'm just curious what other people
> think what makes a query complex?

Well, as mentioned, there's two kinds.  some that look big and ugly
are actually just shovelling data with no fancy interactions between
sets.  Some reporting queries are like this.  I've made simple
reporting queries that took up many pages that were really simple in
nature and fast on even older pgsql versions (7.2-7.4)

I'd say that the use of correlated subqueries qualifies a query as
complicated.  Joining on non-usual pk-fk stuff.  the more you're
mashing one set of data against another, and the odder the way you
have to do it, the more complex the query becomes.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] need to speed up query

2008-05-06 Thread PFC



What is a "period" ? Is it a month, or something more "custom" ?  
Can periods overlap ?


No periods can never overlap.  If the periods did you would be in   
violation of many tax laws around the world. Plus it you would not know  
how much money you are making or losing.


	I was wondering if you'd be using the same query to compute how much was  
gained every month and every week, which would have complicated things.

But now it's clear.

To make this really funky you can have a Fiscal  Calendar year start  
June 15 2008 and end on June 14 2009


	Don't you just love those guys ? Always trying new tricks to make your  
life more interesting.


Note that here you are scanning the entire table multiple times,  
the complexity of this is basically (rows in gltrans)^2 which is  
something you'd like to avoid.


For accounting purposes you need to know the Beginning Balances,  
Debits,  Credits,  Difference between Debits to Credits and the Ending  
Balance  for each account.  We have 133 accounts with presently 12  
periods defined so we end up 1596 rows returned for this query.


Alright, I propose a solution which only works when periods don't 
overlap.
	It will scan the entire table, but only once, not many times as your  
current query does.


So period 1 should have for the most part have Zero for Beginning  
Balances for most types of Accounts.  Period 2 is Beginning Balance is  
Period 1 Ending Balance, Period 3 is Period 2 ending balance so and so  
on forever.


	Precisely. So, it is not necessary to recompute everything for each  
period.
	Use the previous period's ending balance as the current period's starting  
balance...


There are several ways to do this.
	First, you could use your current query, but only compute the sum of what  
happened during a period, for each period, and store that in a temporary  
table.
	Then, you use a plpgsql function, or you do that in your client, you take  
the rows in chronological order, you sum them as they come, and you get  
your balances. Use a NUMERIC type, not a FLOAT, to avoid rounding errors.


	The other solution does the same thing but optimizes the first step like  
this :

INSERT INTO temp_table SELECT period, sum(...) GROUP BY period

	To do this you must be able to compute the period from the date and not  
the other way around. You could store a period_id in your table, or use a  
function.


	Another much more efficient solution would be to have a summary table  
which keeps the summary data for each period, with beginning balance and  
end balance. This table will only need to be updated when someone finds an  
old receipt in their pocket or something.


This falls under the stupid question and i'm just curious what other  
people think what makes a query complex?


	I have some rather complex queries which postgres burns in a few  
milliseconds.
	You could define complexity as the amount of brain sweat that went into  
writing that query.
	You could also define complexity as O(n) or O(n^2) etc, for instance your  
query (as written) is O(n^2) which is something you don't want, I've seen  
stuff that was O(2^n) or worse, O(n!) in software written by drunk  
students, in this case getting rid of it is an emergency...


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] What constitutes a complex query

2008-05-06 Thread Craig James


Justin wrote:
This falls under the stupid question and i'm just curious what other 
people think what makes a query complex?


There are two kinds:

1. Hard for Postgres to get the answer.

2. Hard for a person to comprehend.

Which do you mean?

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] What constitutes a complex query

2008-05-06 Thread Justin

This falls under the stupid question and i'm just curious what other 
people think what makes a query complex?






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Guillaume Smet

Antoine,

On Tue, May 6, 2008 at 5:03 PM, Antoine Baudoux <[EMAIL PROTECTED]> wrote:
>  "Limit  (cost=23981.18..23981.18 rows=1 width=977)"
>  "  ->  Sort  (cost=23981.18..23981.18 rows=1 width=977)"
>  "Sort Key: this_.c_date"

Can you please provide the EXPLAIN ANALYZE output instead of EXPLAIN?

Thanks.

-- 
Guillaume

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] multiple joins + Order by + LIMIT query performance issue

2008-05-06 Thread Antoine Baudoux


Hello,

I have a query that runs for hours when joining 4 tables but takes  
milliseconds when joining one MORE table to the query.
I have One big table, t_event (8 million rows) and 4 small tables  
(t_network,t_system,t_service, t_interface, all < 1000 rows). This  
query takes a few milliseconds :

[code]
select * from t_Event event
inner join t_Service service on event.service_id=service.id
inner join t_System system on service.system_id=system.id
inner join t_Interface interface on system.id=interface.system_id
inner join t_Network network on interface.network_id=network.id
where (network.customer_id=1) order by event.c_date desc limit 25

"Limit  (cost=23981.18..23981.18 rows=1 width=977)"
"  ->  Sort  (cost=23981.18..23981.18 rows=1 width=977)"
"Sort Key: this_.c_date"
"->  Nested Loop  (cost=0.00..23981.17 rows=1 width=977)"
"  ->  Nested Loop  (cost=0.00..23974.89 rows=1 width=961)"
"->  Nested Loop  (cost=0.00..191.42 rows=1  
width=616)"
"  Join Filter: (service_s3_.system_id =  
service1_.system_id)"
"  ->  Nested Loop  (cost=0.00..9.29 rows=1  
width=576)"
"->  Seq Scan on t_network  
service_s4_  (cost=0.00..1.01 rows=1 width=18)"

"  Filter: (customer_id = 1)"
"->  Index Scan using  
interface_network_id_idx on t_interface service_s3_  (cost=0.00..8.27  
rows=1 width=558)"
"  Index Cond:  
(service_s3_.network_id = service_s4_.id)"
"  ->  Seq Scan on t_service service1_   
(cost=0.00..109.28 rows=5828 width=40)"
"->  Index Scan using event_svc_id_idx on t_event  
this_  (cost=0.00..23681.12 rows=8188 width=345)"
"  Index Cond: (this_.service_id =  
service1_.id)"
"  ->  Index Scan using t_system_pkey on t_system  
service_s2_  (cost=0.00..6.27 rows=1 width=16)"

"Index Cond: (service_s2_.id = service1_.system_id)"
[/code]

This one takes HOURS, but I'm joining one table LESS :

[code]
select * from t_Event event
inner join t_Service service on event.service_id=service.id
inner join t_System system on service.system_id=system.id
inner join t_Interface interface on system.id=interface.system_id
where (interface.network_id=1) order by event.c_date desc limit 25

"Limit  (cost=147.79..2123.66 rows=10 width=959)"
"  ->  Nested Loop  (cost=147.79..2601774.46 rows=13167 width=959)"
"Join Filter: (service1_.id = this_.service_id)"
"->  Index Scan Backward using event_date_idx on t_event  
this_  (cost=0.00..887080.22 rows=8466896 width=345)"

"->  Materialize  (cost=147.79..147.88 rows=9 width=614)"
"  ->  Hash Join  (cost=16.56..147.79 rows=9 width=614)"
"Hash Cond: (service1_.system_id = service_s2_.id)"
"->  Seq Scan on t_service service1_   
(cost=0.00..109.28 rows=5828 width=40)"

"->  Hash  (cost=16.55..16.55 rows=1 width=574)"
"  ->  Nested Loop  (cost=0.00..16.55 rows=1  
width=574)"
"->  Index Scan using  
interface_network_id_idx on t_interface service_s3_  (cost=0.00..8.27  
rows=1 width=558)"

"  Index Cond: (network_id = 1)"
"->  Index Scan using t_system_pkey on  
t_system service_s2_  (cost=0.00..8.27 rows=1 width=16)"
"  Index Cond: (service_s2_.id =  
service_s3_.system_id)"

[/code]

My understanding is that in the first case the sort is done after all  
the table joins and filtering, but in the second case ALL the rows in  
t_event are scanned and sorted before the join. There is an index on  
the sorting column. If I remove this index, the query runs very fast.  
But I still need this index for other queries.So I must force the  
planner to do the sort after the join, in the second case. How can i  
do that?


Thanks a lot for your help,

Antoine


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join

2008-05-06 Thread Frank van Vugt

> > db=# explain analyse
> > select sum(base_total_val)
> > from sales_invoice
> > where id in (select id from si_credit_tree(8057));
>
> Did you check whether this query even gives the right answer?

You knew the right answer to that already ;)

> I think you forgot the alias foo(id) in the subselect and it's
> actually reducing to "where id in (id)", ie, TRUE.

Tricky, but completely obvious once pointed out, that's _exactly_ what was 
happening.


db=# explain analyse
select sum(base_total_val)
from sales_invoice
where id in (select id from si_credit_tree(8057) foo(id));
 QUERY 
PLAN
-
 Aggregate  (cost=42.79..42.80 rows=1 width=8) (actual time=0.440..0.441 
rows=1 loops=1)
   ->  Nested Loop  (cost=1.31..42.77 rows=5 width=8) (actual 
time=0.346..0.413 rows=5 loops=1)
 ->  HashAggregate  (cost=1.31..1.36 rows=5 width=4) (actual 
time=0.327..0.335 rows=5 loops=1)
   ->  Function Scan on si_credit_tree foo  (cost=0.00..1.30 
rows=5 width=4) (actual time=0.300..0.306 rows=5 loops=1)
 ->  Index Scan using sales_invoice_pkey on sales_invoice  
(cost=0.00..8.27 rows=1 width=12) (actual time=0.006..0.008 rows=1 loops=5)
   Index Cond: (sales_invoice.id = foo.id)

Total runtime: 0.559 ms




Thanks for the replies!


-- 
Best,




Frank.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Seqscan problem

2008-05-06 Thread Tom Lane

Vlad Arkhipov <[EMAIL PROTECTED]> writes:
> I've just discovered a problem with quite simple query. It's really 
> confusing me.
> Postgresql 8.3.1, random_page_cost=1.1. All tables were analyzed before 
> query.

What have you got effective_cache_size set to?

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join

2008-05-06 Thread Tom Lane

Frank van Vugt <[EMAIL PROTECTED]> writes:
> db=# explain analyse
>   select sum(base_total_val)
>   from sales_invoice
>   where id in (select id from si_credit_tree(8057));

Did you check whether this query even gives the right answer?  The
EXPLAIN output shows that 21703 rows of sales_invoice are being
selected, which is a whole lot different than the other behavior.

I think you forgot the alias foo(id) in the subselect and it's
actually reducing to "where id in (id)", ie, TRUE.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] need to speed up query

2008-05-06 Thread Justin




PFC wrote:


i've had to write queries to get trail balance values out of the GL 
transaction table and i'm not happy with its performance The table 
has 76K rows growing about 1000 rows per working day so the 
performance is not that great it takes about 20 to 30 seconds to get 
all the records for the table and when we limit it to single 
accounting period it drops down to 2 seconds


What is a "period" ? Is it a month, or something more "custom" ? 
Can periods overlap ?

No periods can never overlap.  If the periods did you would be in
violation of many tax laws around the world. Plus it you would not know
how much money you are making or losing.
Generally  yes a accounting period is a normal calendar month.  but you
can have 13 periods in a normal calendar year.  52 weeks in a year / 4
weeks in month = 13 periods or 13 months in a Fiscal Calendar year.
This means if someone is using a 13 period fiscal accounting year the
start and end dates are offset from a normal calendar.
To make this really funky you can have a Fiscal  Calendar year start
June 15 2008 and end on June 14 2009

http://en.wikipedia.org/wiki/Fiscal_year



COALESCE(( SELECT sum(gltrans.gltrans_amount) AS sum
FROM gltrans
WHERE gltrans.gltrans_date < period.period_start
AND gltrans.gltrans_accnt_id = accnt.accnt_id
AND gltrans.gltrans_posted = true), 0.00)::text::money AS 
beginbalance,


Note that here you are scanning the entire table multiple times, 
the complexity of this is basically (rows in gltrans)^2 which is 
something you'd like to avoid.



For accounting purposes you need to know the Beginning Balances,
Debits,  Credits,  Difference between Debits to Credits and the Ending
Balance  for each account.  We have 133 accounts with presently 12
periods defined so we end up 1596 rows returned for this query.

So period 1 should have for the most part have Zero for Beginning
Balances for most types of Accounts.  Period 2 is Beginning Balance is
Period 1 Ending Balance, Period 3 is Period 2 ending balance so and so
on forever.






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)

2008-05-06 Thread Jeff



On May 5, 2008, at 7:33 PM, Craig James wrote:

I had the opportunity to do more testing on another new server to  
see whether the kernel's I/O scheduling makes any difference.   
Conclusion: On a battery-backed RAID 10 system, the kernel's I/O  
scheduling algorithm has no effect.  This makes sense, since a  
battery-backed cache will supercede any I/O rescheduling that the  
kernel tries to do.




this goes against my real world experience here.


pgbench -i -s 20 -U test
pgbench -c 10 -t 5 -v -U test



You should use a sample size of 2x ram to get a more realistic number,  
or try out my pgiosim tool on pgfoundry which "sort of" simulates an  
index scan.  I posted numbers from that a month or two ago here.



--
Jeff Trout <[EMAIL PROTECTED]>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join

2008-05-06 Thread Frank van Vugt

> > I'm noticing a difference in planning between a join and an in() clause,
> > before trying to create an independent test-case, I'd like to know if
> > there's
> > an obvious reason why this would be happening:
>
> Is the function STABLE ?

Yep.

For the record, even changing it to immutable doesn't make a difference in 
performance here.



-- 
Best,




Frank.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Seqscan problem

2008-05-06 Thread Vlad Arkhipov

I've just discovered a problem with quite simple query. It's really 
confusing me.
Postgresql 8.3.1, random_page_cost=1.1. All tables were analyzed before 
query.


EXPLAIN ANALYZE
SELECT i.c, d.r
FROM i
 JOIN d ON d.cr = i.c
WHERE i.dd between '2007-08-01' and '2007-08-30'

Hash Join  (cost=2505.42..75200.16 rows=98275 width=16) (actual 
time=2728.959..23118.632 rows=93159 loops=1)

 Hash Cond: (d.c = i.c)
 ->  Seq Scan on d d  (cost=0.00..61778.75 rows=5081098 width=16) 
(actual time=0.075..8859.807 rows=5081098 loops=1)
 ->  Hash  (cost=2226.85..2226.85 rows=89862 width=8) (actual 
time=416.526..416.526 rows=89473 loops=1)
   ->  Index Scan using i_dd on i  (cost=0.00..2226.85 rows=89862 
width=8) (actual time=0.078..237.504 rows=89473 loops=1)
 Index Cond: ((dd >= '2007-08-01'::date) AND (dd <= 
'2007-08-30'::date))

Total runtime: 23246.640 ms

EXPLAIN ANALYZE
SELECT i.*, d.r
FROM i
 JOIN d ON d.c = i.c
WHERE i.dd between '2007-08-01' and '2007-08-30'

Nested Loop  (cost=0.00..114081.69 rows=98275 width=416) (actual 
time=0.114..1711.256 rows=93159 loops=1)
 ->  Index Scan using i_dd on i  (cost=0.00..2226.85 rows=89862 
width=408) (actual time=0.075..207.574 rows=89473 loops=1)
   Index Cond: ((dd >= '2007-08-01'::date) AND (dd <= 
'2007-08-30'::date))
 ->  Index Scan using d_uniq on d  (cost=0.00..1.24 rows=2 width=16) 
(actual time=0.007..0.009 rows=1 loops=89473)

   Index Cond: (d.c = i.c)
Total runtime: 1839.228 ms

And this never happened with LEFT JOIN.

EXPLAIN ANALYZE
SELECT i.c, d.r
FROM i
 LEFT JOIN d ON d.cr = i.c
WHERE i.dd between '2007-08-01' and '2007-08-30'

Nested Loop Left Join  (cost=0.00..114081.69 rows=98275 width=16) 
(actual time=0.111..1592.225 rows=93159 loops=1)
 ->  Index Scan using i_dd on i  (cost=0.00..2226.85 rows=89862 
width=8) (actual time=0.072..210.421 rows=89473 loops=1)
   Index Cond: ((dd >= '2007-08-01'::date) AND (dd <= 
'2007-08-30'::date))
 ->  Index Scan using d_uniq on d  (cost=0.00..1.24 rows=2 width=16) 
(actual time=0.007..0.009 rows=1 loops=89473)

   Index Cond: (d.c = i.c)
"Total runtime: 1720.185 ms"

d_uniq is unique index on d(r, ...).


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join

2008-05-06 Thread PFC

On Tue, 06 May 2008 10:21:43 +0200, Frank van Vugt <[EMAIL PROTECTED]>  
wrote:



L.S.

I'm noticing a difference in planning between a join and an in() clause,
before trying to create an independent test-case, I'd like to know if  
there's

an obvious reason why this would be happening:


Is the function STABLE ?

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join

2008-05-06 Thread Frank van Vugt

L.S.

I'm noticing a difference in planning between a join and an in() clause, 
before trying to create an independent test-case, I'd like to know if there's 
an obvious reason why this would be happening:


=> the relatively simple PLPGSQL si_credit_tree() function has 'ROWS 5' in 
it's definition


df=# select version();
version

 PostgreSQL 8.3.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.2
(1 row)



db=# explain analyse
select sum(si.base_total_val)
from sales_invoice si, si_credit_tree(8057) foo(id)
where si.id = foo.id;
QUERY PLAN  
   
-
 Aggregate  (cost=42.73..42.74 rows=1 width=8) (actual time=0.458..0.459 
rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..42.71 rows=5 width=8) (actual 
time=0.361..0.429 rows=5 loops=1)
 ->  Function Scan on si_credit_tree foo  (cost=0.00..1.30 rows=5 
width=4) (actual time=0.339..0.347 rows=5 loops=1)
 ->  Index Scan using sales_invoice_pkey on sales_invoice si  
(cost=0.00..8.27 rows=1 width=12) (actual time=0.006..0.008 rows=1 loops=5)
   Index Cond: (si.id = foo.id)

Total runtime: 0.562 ms




db=# explain analyse
select sum(base_total_val)
from sales_invoice
where id in (select id from si_credit_tree(8057));
   QUERY PLAN   
   
-
 Aggregate  (cost=15338.31..15338.32 rows=1 width=8) (actual 
time=3349.401..3349.402 rows=1 loops=1)
   ->  Seq Scan on sales_invoice  (cost=0.00..15311.19 rows=10846 width=8) 
(actual time=0.781..3279.046 rows=21703 loops=1)
 Filter: (subplan)
 SubPlan
   ->  Function Scan on si_credit_tree  (cost=0.00..1.30 rows=5 
width=0) (actual time=0.146..0.146 rows=1 loops=21703)

Total runtime: 3349.501 ms





I'd hoped the planner would use the ROWS=5 knowledge a bit better:


db=# explain analyse
select sum(base_total_val)
from sales_invoice
where id in (8057,8058,8059,80500010,80500011);
QUERY PLAN
--
 Aggregate  (cost=40.21..40.22 rows=1 width=8) (actual time=0.105..0.106 
rows=1 loops=1)
   ->  Bitmap Heap Scan on sales_invoice  (cost=21.29..40.19 rows=5 width=8) 
(actual time=0.061..0.070 rows=5 loops=1)
 Recheck Cond: (id = ANY 
('{8057,8058,8059,80500010,80500011}'::integer[]))
 ->  Bitmap Index Scan on sales_invoice_pkey  (cost=0.00..21.29 rows=5 
width=0) (actual time=0.049..0.049 rows=5 loops=1)
   Index Cond: (id = ANY 
('{8057,8058,8059,80500010,80500011}'::integer[]))

Total runtime: 0.201 ms






-- 
Best,




Frank.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

44 matches

Mail list logo