Re: [PERFORM] pgfouine - commit details?
We are shipping the postgres.log to a remote syslog repository to take the I/O burden off our postgresql server. As such if we set log_min_duration_statement to 0 this allow us to get more detailed information about our commits using pgfouine...correct? -- Josh From: Guillaume Smet [mailto:[EMAIL PROTECTED] Sent: Tue 5/6/2008 7:31 PM To: Josh Cole Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] pgfouine - commit details? Josh, On Tue, May 6, 2008 at 11:10 PM, Josh Cole <[EMAIL PROTECTED]> wrote: > We are using pgfouine to try and optimize our database at this time. Is > there a way to have pgfouine show examples or breakout commits? I hesitated before not implementing this idea. The problem is that you often don't log everything and use log_min_duration_statement and thus you don't have all the queries of the transaction in your log file (and you usually don't have the BEGIN; command in the logs). -- Guillaume
Re: [PERFORM] Possible Redundancy/Performance Solution
On Tue, 6 May 2008, Dennis Muhlestein wrote: Those are good points. So you'd go ahead and add the pgpool in front (or another redundancy approach, but then use raid1,5 or perhaps 10 on each server? Right. I don't advise using the fact that you've got some sort of replication going as an excuse to reduce the reliability of individual systems, particularly in the area of disks (unless you're really creating a much larger number of replicas than 2). RAID5 can be problematic compared to other RAID setups when you are doing write-heavy scenarios of small blocks, and it should be avoided for database use. You can find stories on this subject in the archives here and some of the papers at http://www.baarf.com/ go over why; "Is RAID 5 Really a Bargain?" is the one I like best. If you were thinking about 4 or more disks, there's a number of ways to distribute those: 1) RAID1+0 to make one big volume 2) RAID1 for OS/apps/etc, RAID1 for database 3) RAID1 for OS+xlog, RAID1 for database 4) RAID1 for OS+popular tables, RAID1 for rest of database Exactly which of these splits is best depends on your application and the tradeoffs important to you, but any of these should improve performance and reliability over what you're doing now. I personally tend to create two separate distinct volumes rather than using any striping here, create a tablespace or three right from the start, and then manage the underlying mapping to disk with symbolic links so I can shift the allocation around. That does require you have a steady hand and good nerves for when you screw up, so I wouldn't recommend that to everyone. As you get more disks it gets less practical to handle things this way, and it becomes increasingly sensible to just make one big array out of them and stopping worrying about it. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Possible Redundancy/Performance Solution
On Tue, May 6, 2008 at 3:39 PM, Dennis Muhlestein <[EMAIL PROTECTED]> wrote: > Those are good points. So you'd go ahead and add the pgpool in front (or > another redundancy approach, but then use raid1,5 or perhaps 10 on each > server? That's what I'd do. specificall RAID10 for small to medium drive sets used for transactional stuff, and RAID6 for very large reporting databases that are mostly read. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Seqscan problem
Tom Lane writes: Vlad Arkhipov <[EMAIL PROTECTED]> writes: I've just discovered a problem with quite simple query. It's really confusing me. Postgresql 8.3.1, random_page_cost=1.1. All tables were analyzed before query. What have you got effective_cache_size set to? regards, tom lane 1024M
Re: [PERFORM] pgfouine - commit details?
Josh, On Tue, May 6, 2008 at 11:10 PM, Josh Cole <[EMAIL PROTECTED]> wrote: > We are using pgfouine to try and optimize our database at this time. Is > there a way to have pgfouine show examples or breakout commits? I hesitated before not implementing this idea. The problem is that you often don't log everything and use log_min_duration_statement and thus you don't have all the queries of the transaction in your log file (and you usually don't have the BEGIN; command in the logs). -- Guillaume -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Possible Redundancy/Performance Solution
Greg Smith wrote: On Tue, 6 May 2008, Dennis Muhlestein wrote: > Since disks are by far the most likely thing to fail, I think it would be bad planning to switch to a design that doubles the chance of a disk failure taking out the server just because you're adding some server-level redundancy. Anybody who's been in this business for a while will tell you that seemingly improbable double failures happen, and if were you'd I want a plan that survived a) a single disk failure on the primary and b) a single disk failure on the secondary at the same time. Let me strengthen that--I don't feel comfortable unless I'm able to survive a single disk failure on the primary and complete loss of the secondary (say by power supply failure), because a double failure that starts that way is a lot more likely than you might think. Especially with how awful hard drives are nowadays. Those are good points. So you'd go ahead and add the pgpool in front (or another redundancy approach, but then use raid1,5 or perhaps 10 on each server? -Dennis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] pgfouine - commit details?
We are using pgfouine to try and optimize our database at this time. Is there a way to have pgfouine show examples or breakout commits? Queries that took up the most time Rank Total durationTimes executed Av. duration (s) Query 1 26m54s 222,305 0.01 COMMIT; Perhaps we need to tweak what is being logged by postgresql? log_destination = 'syslog' logging_collector = on log_directory = 'pg_log' log_truncate_on_rotation = on log_rotation_age = 1d syslog_facility = 'LOCAL0' syslog_ident = 'postgres' client_min_messages = notice log_min_messages = notice log_error_verbosity = default log_min_error_statement = notice log_min_duration_statement = 2 #silent_mode = off debug_print_parse = off debug_print_rewritten = off debug_print_plan = off debug_pretty_print = off log_checkpoints = off log_connections = off log_disconnections = off log_duration = off log_hostname = off #log_line_prefix = '' log_lock_waits = on log_statement = 'none' #log_temp_files = -1 #log_timezone = unknown #track_activities = on #track_counts = on #update_process_title = on #log_parser_stats = off #log_planner_stats = off #log_executor_stats = off #log_statement_stats = off Regards, Josh
Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers
Greg Smith wrote: On Tue, 6 May 2008, Craig James wrote: I only did two runs of each, which took about 24 minutes. Like the first round of tests, the "noise" in the measurements (about 10%) exceeds the difference between scheduler-algorithm performance, except that "anticipatory" seems to be measurably slower. Those are much better results. Any test that says anticipatory is anything other than useless for database system use with a good controller I presume is broken, so that's how I know you're in the right ballpark now but weren't before. In order to actually get some useful data out of the noise that is pgbench, you need a lot more measurements of longer runs. As perspective, the last time I did something in this area, in order to get enough data to get a clear picture I ran tests for 12 hours. I'm hoping to repeat that soon with some more common hardware that gives useful results I can give out. This data is good enough for what I'm doing. There were reports from non-RAID users that the I/O scheduling could make as much as a 4x difference in performance (which makes sense for non-RAID), but these tests show me that three of the four I/O schedulers are within 10% of each other. Since this matches my intuition of how battery-backed RAID will work, I'm satisfied. If our servers get overloaded to the point where 10% matters, then I need a much more dramatic solution, like faster machines or more machines. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Possible Redundancy/Performance Solution
On Tue, 6 May 2008, Dennis Muhlestein wrote: I was planning on pgpool being the cushion between the raid0 failure probability and my need for redundancy. This way, I get protection against not only disks, but cpu, memory, network cards,motherboards etc.Is this not a reasonable approach? Since disks are by far the most likely thing to fail, I think it would be bad planning to switch to a design that doubles the chance of a disk failure taking out the server just because you're adding some server-level redundancy. Anybody who's been in this business for a while will tell you that seemingly improbable double failures happen, and if were you'd I want a plan that survived a) a single disk failure on the primary and b) a single disk failure on the secondary at the same time. Let me strengthen that--I don't feel comfortable unless I'm able to survive a single disk failure on the primary and complete loss of the secondary (say by power supply failure), because a double failure that starts that way is a lot more likely than you might think. Especially with how awful hard drives are nowadays. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers
On Tue, 6 May 2008, Craig James wrote: I only did two runs of each, which took about 24 minutes. Like the first round of tests, the "noise" in the measurements (about 10%) exceeds the difference between scheduler-algorithm performance, except that "anticipatory" seems to be measurably slower. Those are much better results. Any test that says anticipatory is anything other than useless for database system use with a good controller I presume is broken, so that's how I know you're in the right ballpark now but weren't before. In order to actually get some useful data out of the noise that is pgbench, you need a lot more measurements of longer runs. As perspective, the last time I did something in this area, in order to get enough data to get a clear picture I ran tests for 12 hours. I'm hoping to repeat that soon with some more common hardware that gives useful results I can give out. So it still looks like cfq, noop and deadline are more or less equivalent when used with a battery-backed RAID. I think it's fair to say they're within 10% of one another on raw throughput. The thing you're not measuring here is worst-case latency, and that's where there might be a more interesting difference. Most tests I've seen suggest deadline is the best in that regard, cfq the worst, and where noop fits in depends on the underlying controller. pgbench produces log files with latency measurements if you pass it "-l". Here's a snippet of shell that runs pgbench then looks at the resulting latency results for the worst 5 numbers: pgbench ... -l & p=$! wait $p mv pgbench_log.${p} pgbench.log echo Worst latency results: cat pgbench.log | cut -f 3 -d " " | sort -n | tail -n 5 However, that may not give you much useful info either--in most cases checkpoint issues kind of swamp the worst-base behavior in PostgreSQL, and to quantify I/O schedulers you need to look more complicated statistics on latency. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Possible Redundancy/Performance Solution
Greg Smith wrote: On Tue, 6 May 2008, Dennis Muhlestein wrote: RAID0 on two disks makes a disk failure that will wipe out the database twice as likely. If you goal is better reliability, you want some sort of RAID1, which you can do with two disks. That should increase read throughput a bit (not quite double though) while keeping write throughput about the same. I was planning on pgpool being the cushion between the raid0 failure probability and my need for redundancy. This way, I get protection against not only disks, but cpu, memory, network cards,motherboards etc. Is this not a reasonable approach? If you added four disks, then you could do a RAID1+0 combination which should substantially outperform your existing setup in every respect while also being more resiliant to drive failure. Our applications are mostly read intensive. I don't think that having two databases on one machine, where previously we had just one, would add too much of an impact, especially if we use the load balance feature of pgpool as well as the redundancy feature. A lot depends on how much RAM you've got and whether it's enough to keep the cache hit rate fairly high here. A reasonable thing to consider here is doing a round of standard performance tuning on the servers to make sure they're operating efficient before increasing their load. Can anyone comment on any gotchas or issues we might encounter? Getting writes to replicate to multiple instances of the database usefully is where all the really nasty gotchas are in this area. Starting with that part and working your way back toward the front-end pooling from there should crash you into the hard parts early in the process. Thanks for the tips! Dennis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers
Greg Smith wrote: On Mon, 5 May 2008, Craig James wrote: pgbench -i -s 20 -U test That's way too low to expect you'll see a difference in I/O schedulers. A scale of 20 is giving you a 320MB database, you can fit the whole thing in RAM and almost all of it on your controller cache. What's there to schedule? You're just moving between buffers that are generally large enough to hold most of what they need. Test repeated with: autovacuum enabled database destroyed and recreated between runs pgbench -i -s 600 ... pgbench -c 10 -t 5 -n ... I/O Sched AVG Test1 Test2 --- - cfq705 695715 noop 758 769747 deadline 741 705775 anticipatory 494 477511 I only did two runs of each, which took about 24 minutes. Like the first round of tests, the "noise" in the measurements (about 10%) exceeds the difference between scheduler-algorithm performance, except that "anticipatory" seems to be measurably slower. So it still looks like cfq, noop and deadline are more or less equivalent when used with a battery-backed RAID. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
On Tue, 2008-05-06 at 18:24 +0100, Antoine Baudoux wrote: > Isnt the planner fooled by the index on the sorting column? > If I remove the index the query runs OK. In your case, for whatever reason, the stats say doing the index scan on the sorted column will give you the results faster. That isn't always the case, and sometimes you can give the same query different where clauses and that same slow-index-scan will randomly be fast. It's all based on the index distribution and the particular values being fetched. This goes back to what Tom said. If you know a "miss" can result in terrible performance, it's best to just recode the query to avoid the situation. > This is crazy, so simply by adding a LIMIT to a query, the planning is > changed in a very bad way. Does the planner use the LIMIT as a sort of > hint? Yes. That's actually what tells it the index scan can be a "big win." If it scans the index backwards on values returned from some of your joins, it may just have to find 25 rows and then it can immediately stop scanning and just give you the results. In normal cases, this is a massive performance boost when you have an order clause and are expecting a ton of results, (say you're getting the first 25 rows of 1 or something). But if it would be faster to generate the results and *then* sort, but Postgres thinks otherwise, you're pretty much screwed. But that's the long answer. You have like 3 ways to get around this now, so pick one. ;) -- Shaun Thomas Database Administrator Leapfrog Online 807 Greenwood Street Evanston, IL 60201 Tel. 847-440-8253 Fax. 847-570-5750 www.leapfrogonline.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
On Tue, 2008-05-06 at 18:59 +0100, Tom Lane wrote: > Whether the scan is forwards or backwards has nothing > to do with it. The planner is using the index ordering > to avoid having to do a full-table scan and sort. Oh, I know that. I just noticed that when this happened to us, more often than not, it was a reverse index scan that did it. The thing that annoyed me most was when it happened on an index that, even on a table having 20M rows, the cardinality is < 10 on almost every value of that index. In our case, having a "LIMIT 1" was much worse than just getting back 5 or 10 rows and throwing away everything after the first one. > but when it's a win it can be a big win, too, so "it's > a bug take it out" is an unhelpful opinion. That's just it... it *can* be a big win. But when it's a loss, you're index-scanning a 20M+ row table for no reason. We got around it, obviously, but it was a definite surprise when a query that normally runs in 0.5ms time randomly and inexplicably runs at 4-120s. This is disaster for a feed loader chewing through a few ten-thousand entries. But that's just me grousing about not having query hints or being able to tell Postgres to never, ever, ever index-scan certain tables. :) -- Shaun Thomas Database Administrator Leapfrog Online 807 Greenwood Street Evanston, IL 60201 Tel. 847-440-8253 Fax. 847-570-5750 www.leapfrogonline.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
Antoine Baudoux wrote: Here is the explain analyse for the first query, the other is still running... explain analyse select * from t_Event event inner join t_Service service on event.service_id=service.id inner join t_System system on service.system_id=system.id inner join t_Interface interface on system.id=interface.system_id inner join t_Network network on interface.network_id=network.id where (network.customer_id=1) order by event.c_date desc limit 25 Limit (cost=11761.44..11761.45 rows=1 width=976) (actual time=0.047..0.047 rows=0 loops=1) -> Sort (cost=11761.44..11761.45 rows=1 width=976) (actual time=0.045..0.045 rows=0 loops=1) Sort Key: event.c_date Sort Method: quicksort Memory: 17kB -> Nested Loop (cost=0.00..11761.43 rows=1 width=976) (actual time=0.024..0.024 rows=0 loops=1) -> Nested Loop (cost=0.00..11755.15 rows=1 width=960) (actual time=0.024..0.024 rows=0 loops=1) -> Nested Loop (cost=0.00..191.42 rows=1 width=616) (actual time=0.024..0.024 rows=0 loops=1) Join Filter: (interface.system_id = service.system_id) -> Nested Loop (cost=0.00..9.29 rows=1 width=576) (actual time=0.023..0.023 rows=0 loops=1) -> Seq Scan on t_network network (cost=0.00..1.01 rows=1 width=18) (actual time=0.009..0.009 rows=1 loops=1) Filter: (customer_id = 1) -> Index Scan using interface_network_id_idx on t_interface interface (cost=0.00..8.27 rows=1 width=558) (actual time=0.011..0.011 rows=0 loops=1) Index Cond: (interface.network_id = network.id) -> Seq Scan on t_service service (cost=0.00..109.28 rows=5828 width=40) (never executed) -> Index Scan using event_svc_id_idx on t_event event (cost=0.00..11516.48 rows=3780 width=344) (never executed) Index Cond: (event.service_id = service.id) -> Index Scan using t_system_pkey on t_system system (cost=0.00..6.27 rows=1 width=16) (never executed) Index Cond: (system.id = service.system_id) Total runtime: 0.362 ms Are the queries returning the same results (except for the extra columns coming from t_network)? It looks like in this version, the network-interface join is performed first, which returns zero rows, so the rest of the joins don't need to be performed at all. That's why it's fast. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
Antoine Baudoux wrote: Here is the explain analyse for the first query, the other is still running... explain analyse select * from t_Event event inner join t_Service service on event.service_id=service.id inner join t_System system on service.system_id=system.id inner join t_Interface interface on system.id=interface.system_id inner join t_Network network on interface.network_id=network.id where (network.customer_id=1) order by event.c_date desc limit 25 Limit (cost=11761.44..11761.45 rows=1 width=976) (actual time=0.047..0.047 rows=0 loops=1) -> Sort (cost=11761.44..11761.45 rows=1 width=976) (actual time=0.045..0.045 rows=0 loops=1) Sort Key: event.c_date Sort Method: quicksort Memory: 17kB -> Nested Loop (cost=0.00..11761.43 rows=1 width=976) (actual time=0.024..0.024 rows=0 loops=1) -> Nested Loop (cost=0.00..11755.15 rows=1 width=960) (actual time=0.024..0.024 rows=0 loops=1) -> Nested Loop (cost=0.00..191.42 rows=1 width=616) (actual time=0.024..0.024 rows=0 loops=1) Join Filter: (interface.system_id = service.system_id) -> Nested Loop (cost=0.00..9.29 rows=1 width=576) (actual time=0.023..0.023 rows=0 loops=1) -> Seq Scan on t_network network (cost=0.00..1.01 rows=1 width=18) (actual time=0.009..0.009 rows=1 loops=1) Filter: (customer_id = 1) -> Index Scan using interface_network_id_idx on t_interface interface (cost=0.00..8.27 rows=1 width=558) (actual time=0.011..0.011 rows=0 loops=1) Index Cond: (interface.network_id = network.id) -> Seq Scan on t_service service (cost=0.00..109.28 rows=5828 width=40) (never executed) -> Index Scan using event_svc_id_idx on t_event event (cost=0.00..11516.48 rows=3780 width=344) (never executed) Index Cond: (event.service_id = service.id) -> Index Scan using t_system_pkey on t_system system (cost=0.00..6.27 rows=1 width=16) (never executed) Index Cond: (system.id = service.system_id) Total runtime: 0.362 ms Are the queries even returning the same results (except for the extra columns coming from t_network)? It looks like in this version, the network-interface join is performed first, which returns zero rows, so the rest of the joins don't need to be performed at all. That's why it's fast. Which version of PostgreSQL is this, BTW? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
Shaun Thomas <[EMAIL PROTECTED]> writes: > I'm not sure what causes this, but the problem with indexes is that > they're not necessarily in the order you want unless you also cluster > them, so a backwards index scan is almost always the wrong answer. Whether the scan is forwards or backwards has nothing to do with it. The planner is using the index ordering to avoid having to do a full-table scan and sort. It's essentially betting that it will find 25 (or whatever your LIMIT is) rows that satisfy the other query conditions soon enough in the index scan to make this faster than the full-scan approach. If there are a lot fewer matching rows than it expects, or if the target rows aren't uniformly scattered in the index ordering, then this way can be a loss; but when it's a win it can be a big win, too, so "it's a bug take it out" is an unhelpful opinion. If a misestimate of this kind is bugging you enough that you're willing to change the query, I think you can fix it like this: select ... from foo order by x limit n; => select ... from (select ... from foo order by x) ss limit n; The subselect will be planned without awareness of the LIMIT, so you should get a plan using a sort rather than one that bets on the LIMIT being reached quickly. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] need to speed up query
PFC wrote: What is a "period" ? Is it a month, or something more "custom" ? Can periods overlap ? No periods can never overlap. If the periods did you would be in violation of many tax laws around the world. Plus it you would not know how much money you are making or losing. I was wondering if you'd be using the same query to compute how much was gained every month and every week, which would have complicated things. But now it's clear. To make this really funky you can have a Fiscal Calendar year start June 15 2008 and end on June 14 2009 Don't you just love those guys ? Always trying new tricks to make your life more interesting. Thats been around been around a long time. You can go back a few hundreds years Note that here you are scanning the entire table multiple times, the complexity of this is basically (rows in gltrans)^2 which is something you'd like to avoid. For accounting purposes you need to know the Beginning Balances, Debits, Credits, Difference between Debits to Credits and the Ending Balance for each account. We have 133 accounts with presently 12 periods defined so we end up 1596 rows returned for this query. Alright, I propose a solution which only works when periods don't overlap. It will scan the entire table, but only once, not many times as your current query does. So period 1 should have for the most part have Zero for Beginning Balances for most types of Accounts. Period 2 is Beginning Balance is Period 1 Ending Balance, Period 3 is Period 2 ending balance so and so on forever. Precisely. So, it is not necessary to recompute everything for each period. Use the previous period's ending balance as the current period's starting balance... There are several ways to do this. First, you could use your current query, but only compute the sum of what happened during a period, for each period, and store that in a temporary table. Then, you use a plpgsql function, or you do that in your client, you take the rows in chronological order, you sum them as they come, and you get your balances. Use a NUMERIC type, not a FLOAT, to avoid rounding errors. The other solution does the same thing but optimizes the first step like this : INSERT INTO temp_table SELECT period, sum(...) GROUP BY period To do this you must be able to compute the period from the date and not the other way around. You could store a period_id in your table, or use a function. Another much more efficient solution would be to have a summary table which keeps the summary data for each period, with beginning balance and end balance. This table will only need to be updated when someone finds an old receipt in their pocket or something. As i posted earlier the software did do this but it has so many bugs else where in the code it allows it get out of balance to what really is happening. I spent a several weeks trying to get this working and find all the places it went wrong. I gave up and did this query which took a day write and balance to a point that i turned it over to the accountant. I redid the front end and i'm off to the races and Fixing other critical problems. All i need to do is take Shanun Thomas code and replace the View this select statement creates This falls under the stupid question and i'm just curious what other people think what makes a query complex? I have some rather complex queries which postgres burns in a few milliseconds. You could define complexity as the amount of brain sweat that went into writing that query. You could also define complexity as O(n) or O(n^2) etc, for instance your query (as written) is O(n^2) which is something you don't want, I've seen stuff that was O(2^n) or worse, O(n!) in software written by drunk students, in this case getting rid of it is an emergency... Thanks for your help and ideas i really appreciate it. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Possible Redundancy/Performance Solution
On Tue, 6 May 2008, Dennis Muhlestein wrote: First, I'd replace are sata hard drives with a scsi controller and two scsi hard drives that run raid 0 (probably running the OS and logs on the original sata drive). RAID0 on two disks makes a disk failure that will wipe out the database twice as likely. If you goal is better reliability, you want some sort of RAID1, which you can do with two disks. That should increase read throughput a bit (not quite double though) while keeping write throughput about the same. If you added four disks, then you could do a RAID1+0 combination which should substantially outperform your existing setup in every respect while also being more resiliant to drive failure. Our applications are mostly read intensive. I don't think that having two databases on one machine, where previously we had just one, would add too much of an impact, especially if we use the load balance feature of pgpool as well as the redundancy feature. A lot depends on how much RAM you've got and whether it's enough to keep the cache hit rate fairly high here. A reasonable thing to consider here is doing a round of standard performance tuning on the servers to make sure they're operating efficient before increasing their load. Can anyone comment on any gotchas or issues we might encounter? Getting writes to replicate to multiple instances of the database usefully is where all the really nasty gotchas are in this area. Starting with that part and working your way back toward the front-end pooling from there should crash you into the hard parts early in the process. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What constitutes a complex query
On Tue, May 6, 2008 at 11:23 AM, Justin <[EMAIL PROTECTED]> wrote: > > > Craig James wrote: > > > Justin wrote: > > > > > This falls under the stupid question and i'm just curious what other > people think what makes a query complex? > > > > > > > There are two kinds: > > > > 1. Hard for Postgres to get the answer. > > > this one Sometimes, postgresql makes a bad choice on simple queries, so it's hard to say what all the ones are that postgresql tends to get wrong. Plus the query planner is under constant improvement thanks to the folks who find poor planner choices and Tom for making the changes. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
Thanks a lot for your answer, there are some points I didnt understand On May 6, 2008, at 6:43 PM, Shaun Thomas wrote: The second query says "Awesome! Only one network... I can just search the index of t_event backwards for this small result set!" Shouldnt It be the opposite? considering that only a few row must be "joined" (Sorry but I'm not familiar with DBMS terms) with the t_event table, why not simply look up the corresponding rows in the t_event table using the service_id foreign key, then do the sort? Isnt the planner fooled by the index on the sorting column? If I remove the index the query runs OK. But here's the rub... try your query *without* the limit clause, and you may find it's actually faster, because the planner suddenly thinks it will have to scan the whole table, so it choses an alternate plan (probably back to the nest-loop). Alternatively, take off the order- by clause, and it'll remove the slow backwards index-scan. You are right, if i remove the order-by clause It doesnt backwards index-scan. And if I remove the limit and keep the order-by clause, the backwards index-scan is gone too, and the query runs in a few millisecs!! This is crazy, so simply by adding a LIMIT to a query, the planning is changed in a very bad way. Does the planner use the LIMIT as a sort of hint? Thank you for your explanations, Antoine Baudoux -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What constitutes a complex query
Craig James wrote: Justin wrote: This falls under the stupid question and i'm just curious what other people think what makes a query complex? There are two kinds: 1. Hard for Postgres to get the answer. this one 2. Hard for a person to comprehend. Which do you mean? Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] need to speed up query
it worked it had couple missing parts but it worked and ran in 3.3 seconds. *Thanks for this * i need to review the result and balance it to my results as the Accountant already went through and balanced some accounts by hand to verify my results <> You might want to consider a denormalized summary table that contains this information (and maybe more) maintained by a trigger or regularly invoked stored-procedure and then you can select from *that* with much less agony. <> I just dumped the summary table because it kept getting out of balance all the time and was missing accounts that did not have transaction in them for given period. Again i did not lay out the table nor the old code which was terrible and did not work correctly. I tried several times to fix the summary table but to many things allowed it to get out of sync. Keeping the Ending and Beginning Balance correct was to much trouble and i needed to get numbers we can trust to the accountant. The developers of the code got credits and debits backwards so instead of fixing the code they just added code to flip the values on the front end. Its really annoying. At this point if i could go back 7 months ago i would not purchased this software if i had known what i know now. I've had to make all kinds of changes i never intended to make in order to get the stuff to balance and agree. I've spent the last 3 months in code review fixing things that allow accounts to get out of balance and stop stupid things from happening, like posting GL Transactions into non-existing accounting periods. the list of things i have to fix is getting dam long.
[PERFORM] Possible Redundancy/Performance Solution
Right now, we have a few servers that host our databases. None of them are redundant. Each hosts databases for one or more applications. Things work reasonably well but I'm worried about the availability of some of the sites. Our hardware is 3-4 years old at this point and I'm not naive to the possibility of drives, memory, motherboards or whatever failing. I'm toying with the idea of adding a little redundancy and maybe some performance to our setup. First, I'd replace are sata hard drives with a scsi controller and two scsi hard drives that run raid 0 (probably running the OS and logs on the original sata drive). Then I'd run the previous two databases on one cluster of two servers with pgpool in front (using the redundancy feature of pgpool). Our applications are mostly read intensive. I don't think that having two databases on one machine, where previously we had just one, would add too much of an impact, especially if we use the load balance feature of pgpool as well as the redundancy feature. Can anyone comment on any gotchas or issues we might encounter? Do you think this strategy has possibility to accomplish what I'm originally setting out to do? TIA -Dennis -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What constitutes a complex query
On Tue, May 6, 2008 at 9:41 AM, Scott Marlowe <[EMAIL PROTECTED]> wrote: > I'd say that the use of correlated subqueries qualifies a query as > complicated. Joining on non-usual pk-fk stuff. the more you're > mashing one set of data against another, and the odder the way you > have to do it, the more complex the query becomes. I would add that data analysis queries that have multiple level of aggregation analysis can be complicated also. For example, in a table of racer times find the average time for each team while only counting teams whom at least have greater than four team members and produce an ordered list displaying the ranking for each team according to their average time. -- Regards, Richard Broersma Jr. Visit the Los Angles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What constitutes a complex query
On May 6, 2008, at 8:45 AM, Justin wrote: This falls under the stupid question and i'm just curious what other people think what makes a query complex? If I know in advance exactly how the planner will plan the query (and be right), it's a simple query. Otherwise it's a complex query. As I get a better feel for the planner, some queries that used to be complex become simple. :) Cheers, Steve -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] need to speed up query
On Tue, 2008-05-06 at 03:01 +0100, Justin wrote: > i've had to write queries to get trail balance values out of the GL > transaction table and i'm not happy with its performance Go ahead and give this a try: SELECT p.period_id, p.period_start, p.period_end, a.accnt_id, a.accnt_number, a.accnt_descrip, p.period_yearperiod_id, a.accnt_type, SUM(CASE WHEN g.gltrans_date < p.period_start THEN g.gltrans_amount ELSE 0.0 END)::text::money AS beginbalance, SUM(CASE WHEN g.gltrans_date < p.period_end AND g.gltrans_date >= p.period_start AND g.gltrans_amount <= 0::numeric THEN g.gltrans_amount ELSE 0.0 END)::text::money AS negative, SUM(CASE WHEN g.gltrans_date <= p.period_end AND g.gltrans_date >= p.period_start AND g.gltrans_amount >= 0::numeric THEN g.gltrans_amount ELSE 0.0 END)::text::money AS positive, SUM(CASE WHEN g.gltrans_date <= p.period_end AND g.gltrans_date >= p.period_start THEN g.gltrans_amount ELSE 0.0 END)::text::money AS difference, SUM(CASE WHEN g.gltrans_date <= p.period_end THEN g.gltrans_amount ELSE 0.0 END)::text::money AS endbalance, FROM period p CROSS JOIN accnt a LEFT JOIN gltrans g ON (g.gltrans_accnt_id = a.accnt_id AND g.gltrans_posted = true) ORDER BY period.period_id, accnt.accnt_number; Depending on how the planner saw your old query, it may have forced several different sequence or index scans to get the information from gltrans. One thing all of your subqueries had in common was a join on the account id and listing only posted transactions. It's still a big gulp, but it's only one gulp. The other thing I did was that I guessed you added the coalesce clause because the subqueries individually could return null rowsets for various groupings, and you wouldn't want that. This left-join solution only lets it add to your various sums if it matches all the conditions, otherwise it falls through the list of cases until nothing matches. If some of your transactions can have null amounts, you might consider turning g.gltrans into COALESCE(g.gltrans, 0.0) instead. Otherwise, this *might* work; without knowing more about your schema, it's only a guess. I'm a little skeptical about the conditionless cross-join, but whatever. Either way, by looking at this query, it looks like some year-end summary piece, or an at-a-glance idea of your account standings. The problem you're going to have with this is that there's no way to truly optimize this. One way or another, you're going to incur some combination of three sequence scans or three index scans; if those tables get huge, you're in trouble. You might want to consider a denormalized summary table that contains this information (and maybe more) maintained by a trigger or regularly invoked stored-procedure and then you can select from *that* with much less agony. Then there's fact-tables, but that's beyond the scope of this email. ;) Good luck! -- Shaun Thomas Database Administrator Leapfrog Online 807 Greenwood Street Evanston, IL 60201 Tel. 847-440-8253 Fax. 847-570-5750 www.leapfrogonline.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
On Tue, 2008-05-06 at 16:03 +0100, Antoine Baudoux wrote: > My understanding is that in the first case the sort is > done after all the table joins and filtering, but in the > second case ALL the rows in t_event are scanned and sorted > before the join. You've actually run into a problem that's bitten us in the ass a couple of times. The problem with your second query is that it's *too* efficient. You'll notice the first plan uses a bevy of nest-loops, which is very risky if the row estimates are not really really accurate. The planner says "Hey, customer_id=1 could be several rows in the t_network table, but not too many... I better check them one by one." I've turned off nest-loops sometimes to avoid queries that would run several hours due to mis-estimation, but it looks like yours was just fine. The second query says "Awesome! Only one network... I can just search the index of t_event backwards for this small result set!" But here's the rub... try your query *without* the limit clause, and you may find it's actually faster, because the planner suddenly thinks it will have to scan the whole table, so it choses an alternate plan (probably back to the nest-loop). Alternatively, take off the order-by clause, and it'll remove the slow backwards index-scan. I'm not sure what causes this, but the problem with indexes is that they're not necessarily in the order you want unless you also cluster them, so a backwards index scan is almost always the wrong answer. Personally I consider this a bug, and it's been around since at least the 8.1 tree. The only real answer is that you have a fast version of the query, so try and play with it until it acts the way you want. -- Shaun Thomas Database Administrator Leapfrog Online 807 Greenwood Street Evanston, IL 60201 Tel. 847-440-8253 Fax. 847-570-5750 www.leapfrogonline.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
Here is the explain analyse for the first query, the other is still running... explain analyse select * from t_Event event inner join t_Service service on event.service_id=service.id inner join t_System system on service.system_id=system.id inner join t_Interface interface on system.id=interface.system_id inner join t_Network network on interface.network_id=network.id where (network.customer_id=1) order by event.c_date desc limit 25 Limit (cost=11761.44..11761.45 rows=1 width=976) (actual time=0.047..0.047 rows=0 loops=1) -> Sort (cost=11761.44..11761.45 rows=1 width=976) (actual time=0.045..0.045 rows=0 loops=1) Sort Key: event.c_date Sort Method: quicksort Memory: 17kB -> Nested Loop (cost=0.00..11761.43 rows=1 width=976) (actual time=0.024..0.024 rows=0 loops=1) -> Nested Loop (cost=0.00..11755.15 rows=1 width=960) (actual time=0.024..0.024 rows=0 loops=1) -> Nested Loop (cost=0.00..191.42 rows=1 width=616) (actual time=0.024..0.024 rows=0 loops=1) Join Filter: (interface.system_id = service.system_id) -> Nested Loop (cost=0.00..9.29 rows=1 width=576) (actual time=0.023..0.023 rows=0 loops=1) -> Seq Scan on t_network network (cost=0.00..1.01 rows=1 width=18) (actual time=0.009..0.009 rows=1 loops=1) Filter: (customer_id = 1) -> Index Scan using interface_network_id_idx on t_interface interface (cost=0.00..8.27 rows=1 width=558) (actual time=0.011..0.011 rows=0 loops=1) Index Cond: (interface.network_id = network.id) -> Seq Scan on t_service service (cost=0.00..109.28 rows=5828 width=40) (never executed) -> Index Scan using event_svc_id_idx on t_event event (cost=0.00..11516.48 rows=3780 width=344) (never executed) Index Cond: (event.service_id = service.id) -> Index Scan using t_system_pkey on t_system system (cost=0.00..6.27 rows=1 width=16) (never executed) Index Cond: (system.id = service.system_id) Total runtime: 0.362 ms On May 6, 2008, at 5:38 PM, Guillaume Smet wrote: Antoine, On Tue, May 6, 2008 at 5:03 PM, Antoine Baudoux <[EMAIL PROTECTED]> wrote: "Limit (cost=23981.18..23981.18 rows=1 width=977)" " -> Sort (cost=23981.18..23981.18 rows=1 width=977)" "Sort Key: this_.c_date" Can you please provide the EXPLAIN ANALYZE output instead of EXPLAIN? Thanks. -- Guillaume -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What constitutes a complex query
On Tue, May 6, 2008 at 9:45 AM, Justin <[EMAIL PROTECTED]> wrote: > This falls under the stupid question and i'm just curious what other people > think what makes a query complex? Well, as mentioned, there's two kinds. some that look big and ugly are actually just shovelling data with no fancy interactions between sets. Some reporting queries are like this. I've made simple reporting queries that took up many pages that were really simple in nature and fast on even older pgsql versions (7.2-7.4) I'd say that the use of correlated subqueries qualifies a query as complicated. Joining on non-usual pk-fk stuff. the more you're mashing one set of data against another, and the odder the way you have to do it, the more complex the query becomes. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] need to speed up query
What is a "period" ? Is it a month, or something more "custom" ? Can periods overlap ? No periods can never overlap. If the periods did you would be in violation of many tax laws around the world. Plus it you would not know how much money you are making or losing. I was wondering if you'd be using the same query to compute how much was gained every month and every week, which would have complicated things. But now it's clear. To make this really funky you can have a Fiscal Calendar year start June 15 2008 and end on June 14 2009 Don't you just love those guys ? Always trying new tricks to make your life more interesting. Note that here you are scanning the entire table multiple times, the complexity of this is basically (rows in gltrans)^2 which is something you'd like to avoid. For accounting purposes you need to know the Beginning Balances, Debits, Credits, Difference between Debits to Credits and the Ending Balance for each account. We have 133 accounts with presently 12 periods defined so we end up 1596 rows returned for this query. Alright, I propose a solution which only works when periods don't overlap. It will scan the entire table, but only once, not many times as your current query does. So period 1 should have for the most part have Zero for Beginning Balances for most types of Accounts. Period 2 is Beginning Balance is Period 1 Ending Balance, Period 3 is Period 2 ending balance so and so on forever. Precisely. So, it is not necessary to recompute everything for each period. Use the previous period's ending balance as the current period's starting balance... There are several ways to do this. First, you could use your current query, but only compute the sum of what happened during a period, for each period, and store that in a temporary table. Then, you use a plpgsql function, or you do that in your client, you take the rows in chronological order, you sum them as they come, and you get your balances. Use a NUMERIC type, not a FLOAT, to avoid rounding errors. The other solution does the same thing but optimizes the first step like this : INSERT INTO temp_table SELECT period, sum(...) GROUP BY period To do this you must be able to compute the period from the date and not the other way around. You could store a period_id in your table, or use a function. Another much more efficient solution would be to have a summary table which keeps the summary data for each period, with beginning balance and end balance. This table will only need to be updated when someone finds an old receipt in their pocket or something. This falls under the stupid question and i'm just curious what other people think what makes a query complex? I have some rather complex queries which postgres burns in a few milliseconds. You could define complexity as the amount of brain sweat that went into writing that query. You could also define complexity as O(n) or O(n^2) etc, for instance your query (as written) is O(n^2) which is something you don't want, I've seen stuff that was O(2^n) or worse, O(n!) in software written by drunk students, in this case getting rid of it is an emergency... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] What constitutes a complex query
Justin wrote: This falls under the stupid question and i'm just curious what other people think what makes a query complex? There are two kinds: 1. Hard for Postgres to get the answer. 2. Hard for a person to comprehend. Which do you mean? Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] What constitutes a complex query
This falls under the stupid question and i'm just curious what other people think what makes a query complex? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] multiple joins + Order by + LIMIT query performance issue
Antoine, On Tue, May 6, 2008 at 5:03 PM, Antoine Baudoux <[EMAIL PROTECTED]> wrote: > "Limit (cost=23981.18..23981.18 rows=1 width=977)" > " -> Sort (cost=23981.18..23981.18 rows=1 width=977)" > "Sort Key: this_.c_date" Can you please provide the EXPLAIN ANALYZE output instead of EXPLAIN? Thanks. -- Guillaume -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] multiple joins + Order by + LIMIT query performance issue
Hello, I have a query that runs for hours when joining 4 tables but takes milliseconds when joining one MORE table to the query. I have One big table, t_event (8 million rows) and 4 small tables (t_network,t_system,t_service, t_interface, all < 1000 rows). This query takes a few milliseconds : [code] select * from t_Event event inner join t_Service service on event.service_id=service.id inner join t_System system on service.system_id=system.id inner join t_Interface interface on system.id=interface.system_id inner join t_Network network on interface.network_id=network.id where (network.customer_id=1) order by event.c_date desc limit 25 "Limit (cost=23981.18..23981.18 rows=1 width=977)" " -> Sort (cost=23981.18..23981.18 rows=1 width=977)" "Sort Key: this_.c_date" "-> Nested Loop (cost=0.00..23981.17 rows=1 width=977)" " -> Nested Loop (cost=0.00..23974.89 rows=1 width=961)" "-> Nested Loop (cost=0.00..191.42 rows=1 width=616)" " Join Filter: (service_s3_.system_id = service1_.system_id)" " -> Nested Loop (cost=0.00..9.29 rows=1 width=576)" "-> Seq Scan on t_network service_s4_ (cost=0.00..1.01 rows=1 width=18)" " Filter: (customer_id = 1)" "-> Index Scan using interface_network_id_idx on t_interface service_s3_ (cost=0.00..8.27 rows=1 width=558)" " Index Cond: (service_s3_.network_id = service_s4_.id)" " -> Seq Scan on t_service service1_ (cost=0.00..109.28 rows=5828 width=40)" "-> Index Scan using event_svc_id_idx on t_event this_ (cost=0.00..23681.12 rows=8188 width=345)" " Index Cond: (this_.service_id = service1_.id)" " -> Index Scan using t_system_pkey on t_system service_s2_ (cost=0.00..6.27 rows=1 width=16)" "Index Cond: (service_s2_.id = service1_.system_id)" [/code] This one takes HOURS, but I'm joining one table LESS : [code] select * from t_Event event inner join t_Service service on event.service_id=service.id inner join t_System system on service.system_id=system.id inner join t_Interface interface on system.id=interface.system_id where (interface.network_id=1) order by event.c_date desc limit 25 "Limit (cost=147.79..2123.66 rows=10 width=959)" " -> Nested Loop (cost=147.79..2601774.46 rows=13167 width=959)" "Join Filter: (service1_.id = this_.service_id)" "-> Index Scan Backward using event_date_idx on t_event this_ (cost=0.00..887080.22 rows=8466896 width=345)" "-> Materialize (cost=147.79..147.88 rows=9 width=614)" " -> Hash Join (cost=16.56..147.79 rows=9 width=614)" "Hash Cond: (service1_.system_id = service_s2_.id)" "-> Seq Scan on t_service service1_ (cost=0.00..109.28 rows=5828 width=40)" "-> Hash (cost=16.55..16.55 rows=1 width=574)" " -> Nested Loop (cost=0.00..16.55 rows=1 width=574)" "-> Index Scan using interface_network_id_idx on t_interface service_s3_ (cost=0.00..8.27 rows=1 width=558)" " Index Cond: (network_id = 1)" "-> Index Scan using t_system_pkey on t_system service_s2_ (cost=0.00..8.27 rows=1 width=16)" " Index Cond: (service_s2_.id = service_s3_.system_id)" [/code] My understanding is that in the first case the sort is done after all the table joins and filtering, but in the second case ALL the rows in t_event are scanned and sorted before the join. There is an index on the sorting column. If I remove this index, the query runs very fast. But I still need this index for other queries.So I must force the planner to do the sort after the join, in the second case. How can i do that? Thanks a lot for your help, Antoine -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join
> > db=# explain analyse > > select sum(base_total_val) > > from sales_invoice > > where id in (select id from si_credit_tree(8057)); > > Did you check whether this query even gives the right answer? You knew the right answer to that already ;) > I think you forgot the alias foo(id) in the subselect and it's > actually reducing to "where id in (id)", ie, TRUE. Tricky, but completely obvious once pointed out, that's _exactly_ what was happening. db=# explain analyse select sum(base_total_val) from sales_invoice where id in (select id from si_credit_tree(8057) foo(id)); QUERY PLAN - Aggregate (cost=42.79..42.80 rows=1 width=8) (actual time=0.440..0.441 rows=1 loops=1) -> Nested Loop (cost=1.31..42.77 rows=5 width=8) (actual time=0.346..0.413 rows=5 loops=1) -> HashAggregate (cost=1.31..1.36 rows=5 width=4) (actual time=0.327..0.335 rows=5 loops=1) -> Function Scan on si_credit_tree foo (cost=0.00..1.30 rows=5 width=4) (actual time=0.300..0.306 rows=5 loops=1) -> Index Scan using sales_invoice_pkey on sales_invoice (cost=0.00..8.27 rows=1 width=12) (actual time=0.006..0.008 rows=1 loops=5) Index Cond: (sales_invoice.id = foo.id) Total runtime: 0.559 ms Thanks for the replies! -- Best, Frank. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Seqscan problem
Vlad Arkhipov <[EMAIL PROTECTED]> writes: > I've just discovered a problem with quite simple query. It's really > confusing me. > Postgresql 8.3.1, random_page_cost=1.1. All tables were analyzed before > query. What have you got effective_cache_size set to? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join
Frank van Vugt <[EMAIL PROTECTED]> writes: > db=# explain analyse > select sum(base_total_val) > from sales_invoice > where id in (select id from si_credit_tree(8057)); Did you check whether this query even gives the right answer? The EXPLAIN output shows that 21703 rows of sales_invoice are being selected, which is a whole lot different than the other behavior. I think you forgot the alias foo(id) in the subselect and it's actually reducing to "where id in (id)", ie, TRUE. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] need to speed up query
PFC wrote: i've had to write queries to get trail balance values out of the GL transaction table and i'm not happy with its performance The table has 76K rows growing about 1000 rows per working day so the performance is not that great it takes about 20 to 30 seconds to get all the records for the table and when we limit it to single accounting period it drops down to 2 seconds What is a "period" ? Is it a month, or something more "custom" ? Can periods overlap ? No periods can never overlap. If the periods did you would be in violation of many tax laws around the world. Plus it you would not know how much money you are making or losing. Generally yes a accounting period is a normal calendar month. but you can have 13 periods in a normal calendar year. 52 weeks in a year / 4 weeks in month = 13 periods or 13 months in a Fiscal Calendar year. This means if someone is using a 13 period fiscal accounting year the start and end dates are offset from a normal calendar. To make this really funky you can have a Fiscal Calendar year start June 15 2008 and end on June 14 2009 http://en.wikipedia.org/wiki/Fiscal_year COALESCE(( SELECT sum(gltrans.gltrans_amount) AS sum FROM gltrans WHERE gltrans.gltrans_date < period.period_start AND gltrans.gltrans_accnt_id = accnt.accnt_id AND gltrans.gltrans_posted = true), 0.00)::text::money AS beginbalance, Note that here you are scanning the entire table multiple times, the complexity of this is basically (rows in gltrans)^2 which is something you'd like to avoid. For accounting purposes you need to know the Beginning Balances, Debits, Credits, Difference between Debits to Credits and the Ending Balance for each account. We have 133 accounts with presently 12 periods defined so we end up 1596 rows returned for this query. So period 1 should have for the most part have Zero for Beginning Balances for most types of Accounts. Period 2 is Beginning Balance is Period 1 Ending Balance, Period 3 is Period 2 ending balance so and so on forever. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID 10 Benchmark with different I/O schedulers (was: Performance increase with elevator=deadline)
On May 5, 2008, at 7:33 PM, Craig James wrote: I had the opportunity to do more testing on another new server to see whether the kernel's I/O scheduling makes any difference. Conclusion: On a battery-backed RAID 10 system, the kernel's I/O scheduling algorithm has no effect. This makes sense, since a battery-backed cache will supercede any I/O rescheduling that the kernel tries to do. this goes against my real world experience here. pgbench -i -s 20 -U test pgbench -c 10 -t 5 -v -U test You should use a sample size of 2x ram to get a more realistic number, or try out my pgiosim tool on pgfoundry which "sort of" simulates an index scan. I posted numbers from that a month or two ago here. -- Jeff Trout <[EMAIL PROTECTED]> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join
> > I'm noticing a difference in planning between a join and an in() clause, > > before trying to create an independent test-case, I'd like to know if > > there's > > an obvious reason why this would be happening: > > Is the function STABLE ? Yep. For the record, even changing it to immutable doesn't make a difference in performance here. -- Best, Frank. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Seqscan problem
I've just discovered a problem with quite simple query. It's really confusing me. Postgresql 8.3.1, random_page_cost=1.1. All tables were analyzed before query. EXPLAIN ANALYZE SELECT i.c, d.r FROM i JOIN d ON d.cr = i.c WHERE i.dd between '2007-08-01' and '2007-08-30' Hash Join (cost=2505.42..75200.16 rows=98275 width=16) (actual time=2728.959..23118.632 rows=93159 loops=1) Hash Cond: (d.c = i.c) -> Seq Scan on d d (cost=0.00..61778.75 rows=5081098 width=16) (actual time=0.075..8859.807 rows=5081098 loops=1) -> Hash (cost=2226.85..2226.85 rows=89862 width=8) (actual time=416.526..416.526 rows=89473 loops=1) -> Index Scan using i_dd on i (cost=0.00..2226.85 rows=89862 width=8) (actual time=0.078..237.504 rows=89473 loops=1) Index Cond: ((dd >= '2007-08-01'::date) AND (dd <= '2007-08-30'::date)) Total runtime: 23246.640 ms EXPLAIN ANALYZE SELECT i.*, d.r FROM i JOIN d ON d.c = i.c WHERE i.dd between '2007-08-01' and '2007-08-30' Nested Loop (cost=0.00..114081.69 rows=98275 width=416) (actual time=0.114..1711.256 rows=93159 loops=1) -> Index Scan using i_dd on i (cost=0.00..2226.85 rows=89862 width=408) (actual time=0.075..207.574 rows=89473 loops=1) Index Cond: ((dd >= '2007-08-01'::date) AND (dd <= '2007-08-30'::date)) -> Index Scan using d_uniq on d (cost=0.00..1.24 rows=2 width=16) (actual time=0.007..0.009 rows=1 loops=89473) Index Cond: (d.c = i.c) Total runtime: 1839.228 ms And this never happened with LEFT JOIN. EXPLAIN ANALYZE SELECT i.c, d.r FROM i LEFT JOIN d ON d.cr = i.c WHERE i.dd between '2007-08-01' and '2007-08-30' Nested Loop Left Join (cost=0.00..114081.69 rows=98275 width=16) (actual time=0.111..1592.225 rows=93159 loops=1) -> Index Scan using i_dd on i (cost=0.00..2226.85 rows=89862 width=8) (actual time=0.072..210.421 rows=89473 loops=1) Index Cond: ((dd >= '2007-08-01'::date) AND (dd <= '2007-08-30'::date)) -> Index Scan using d_uniq on d (cost=0.00..1.24 rows=2 width=16) (actual time=0.007..0.009 rows=1 loops=89473) Index Cond: (d.c = i.c) "Total runtime: 1720.185 ms" d_uniq is unique index on d(r, ...). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join
On Tue, 06 May 2008 10:21:43 +0200, Frank van Vugt <[EMAIL PROTECTED]> wrote: L.S. I'm noticing a difference in planning between a join and an in() clause, before trying to create an independent test-case, I'd like to know if there's an obvious reason why this would be happening: Is the function STABLE ? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] plan difference between set-returning function with ROWS within IN() and a plain join
L.S. I'm noticing a difference in planning between a join and an in() clause, before trying to create an independent test-case, I'd like to know if there's an obvious reason why this would be happening: => the relatively simple PLPGSQL si_credit_tree() function has 'ROWS 5' in it's definition df=# select version(); version PostgreSQL 8.3.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 (1 row) db=# explain analyse select sum(si.base_total_val) from sales_invoice si, si_credit_tree(8057) foo(id) where si.id = foo.id; QUERY PLAN - Aggregate (cost=42.73..42.74 rows=1 width=8) (actual time=0.458..0.459 rows=1 loops=1) -> Nested Loop (cost=0.00..42.71 rows=5 width=8) (actual time=0.361..0.429 rows=5 loops=1) -> Function Scan on si_credit_tree foo (cost=0.00..1.30 rows=5 width=4) (actual time=0.339..0.347 rows=5 loops=1) -> Index Scan using sales_invoice_pkey on sales_invoice si (cost=0.00..8.27 rows=1 width=12) (actual time=0.006..0.008 rows=1 loops=5) Index Cond: (si.id = foo.id) Total runtime: 0.562 ms db=# explain analyse select sum(base_total_val) from sales_invoice where id in (select id from si_credit_tree(8057)); QUERY PLAN - Aggregate (cost=15338.31..15338.32 rows=1 width=8) (actual time=3349.401..3349.402 rows=1 loops=1) -> Seq Scan on sales_invoice (cost=0.00..15311.19 rows=10846 width=8) (actual time=0.781..3279.046 rows=21703 loops=1) Filter: (subplan) SubPlan -> Function Scan on si_credit_tree (cost=0.00..1.30 rows=5 width=0) (actual time=0.146..0.146 rows=1 loops=21703) Total runtime: 3349.501 ms I'd hoped the planner would use the ROWS=5 knowledge a bit better: db=# explain analyse select sum(base_total_val) from sales_invoice where id in (8057,8058,8059,80500010,80500011); QUERY PLAN -- Aggregate (cost=40.21..40.22 rows=1 width=8) (actual time=0.105..0.106 rows=1 loops=1) -> Bitmap Heap Scan on sales_invoice (cost=21.29..40.19 rows=5 width=8) (actual time=0.061..0.070 rows=5 loops=1) Recheck Cond: (id = ANY ('{8057,8058,8059,80500010,80500011}'::integer[])) -> Bitmap Index Scan on sales_invoice_pkey (cost=0.00..21.29 rows=5 width=0) (actual time=0.049..0.049 rows=5 loops=1) Index Cond: (id = ANY ('{8057,8058,8059,80500010,80500011}'::integer[])) Total runtime: 0.201 ms -- Best, Frank. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance