Re: [PERFORM] inserting into brand new database faster than old database
Missner, T. R. wrote: Hello, I have been a happy postgresql developer for a few years now. Recently I have discovered a very strange phenomenon in regards to inserting rows. My app inserts millions of records a day, averaging about 30 rows a second. I use autovac to make sure my stats and indexes are up to date. Rarely are rows ever deleted. Each day a brand new set of tables is created and eventually the old tables are dropped. The app calls functions which based on some simple logic perform the correct inserts. Have you profiled where the time goes in a brand new schema and a degraded database? Is it IO? Is it CPU? Is the function making decision becoming bottleneck? The problem I am seeing is that after a particular database gets kinda old, say a couple of months, performance begins to degrade. Even after creating brand new tables my insert speed is slow in comparison ( by a magnitude of 5 or more ) with a brand new schema which has the exact same tables. I am running on an IBM 360 dual processor Linux server with a 100 gig raid array spanning 5 scsi disks. The machine has 1 gig of ram of which 500 meg is dedicated to Postgresql. Just to be clear, the question I have is why would a brand new db schema allow inserts faster than an older schema with brand new tables? Since the tables are empty to start, vacuuming should not be an issue at all. Each schema is identical in every way except the db name and creation date. You can do few things. - Get explain analyze. See the difference between actual and projected timings. The difference is the hint about where planner is going wrong. - Is IO your bottleneck? Are vacuum taking longer and longer? If yes then you could try the vacuum delay patch. If your IO is saturated for any reason, everything is going to crawl - Are your indexes bloat free? If you are using pre7.x,vacuum does not clean up indexes. You need to reindex. - Have you vacuumed the complete database? If the catalogs collect dead space it could cause degradation too but that is just a guess. Basically monitor slow inserts and try to find out where time is spent. HTH Shridhar ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Terrible performance after deleting/recreating indexes
Bill Chandler wrote: Hi, Using PostgreSQL 7.4.2 on Solaris. I'm trying to improve performance on some queries to my databases so I wanted to try out various index structures. Since I'm going to be running my performance tests repeatedly, I created some SQL scripts to delete and recreate various index configurations. One of the scripts contains the commands for recreating the 'original' index configuration (i.e. the one I've already got some numbers for). Only thing is now when I delete and recreate the original indexes then run the query, I'm finding the performance has gone completely down the tubes compared to what I originally had. A query that used to take 5 minutes to complete now takes hours to complete. For what it's worth my query looks something like: select * from tbl_1, tbl_2 where tbl_1.id = tbl_2.id and tbl_2.name like 'x%y%' and tbl_1.x 1234567890123 order by tbl_1.x; tbl_1 is very big ( 2 million rows) tbl_2 is relatively small (7000 or so rows) tbl_1.x is a numeric(13) tbl_1.id tbl_2.id are integers tbl_2.name is a varchar(64) I've run 'VACUUM ANALYZE' on both tables involved in the query. I also used 'EXPLAIN' and observed that the query plan is completely changed from what it was originally. Get an explain analyze. That gives actual v/s planned time spent. See what is causing the difference. A discrepency between planned and actual row is usually a indication of out-of-date stats. Which are the indexes on these tables? You should list fields with indexes first in where clause. Also list most selective field first so that it eliminates as many rows as possible in first scan. I hope you have read the tuning articles on varlena.com and applied some basic tuning. And post the table schema, hardware config, postgresql config(important ones of course) and explain analyze for queries. That would be something to start with. Shridhar ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[PERFORM] Odd sorting behaviour
[Apologies if this reaches the list twice -- I sent a copy before subscribing, but it seems to be stuck waiting for listmaster forever, so I subscribed and sent it again.] Hi, I'm trying to find out why one of my queries is so slow -- I'm primarily using PostgreSQL 7.2 (Debian stable), but I don't really get much better performance with 7.4 (Debian unstable). My prototype table looks like this: CREATE TABLE opinions ( prodid INTEGER NOT NULL, uid INTEGER NOT NULL, opinion INTEGER NOT NULL, PRIMARY KEY ( prodid, uid ) ); In addition, there are separate indexes on prodid and uid. I've run VACUUM ANALYZE before all queries, and they are repeatable. (If anybody needs the data, that could be arranged -- it's not secret or anything :-) ) My query looks like this: EXPLAIN ANALYZE SELECT o3.prodid, SUM(o3.opinion*o12.correlation) AS total_correlation FROM opinions o3 RIGHT JOIN ( SELECT o2.uid, SUM(o1.opinion*o2.opinion)/SQRT(count(*)+0.0) AS correlation FROM opinions o1 LEFT JOIN opinions o2 ON o1.prodid=o2.prodid WHERE o1.uid=1355 GROUP BY o2.uid ) o12 ON o3.uid=o12.uid LEFT JOIN ( SELECT o4.prodid, COUNT(*) as num_my_comments FROM opinions o4 WHERE o4.uid=1355 GROUP BY o4.prodid ) nmc ON o3.prodid=nmc.prodid WHERE nmc.num_my_comments IS NULL AND o3.opinion0 AND o12.correlation0 GROUP BY o3.prodid ORDER BY total_correlation desc; And produces the query plan at http://www.samfundet.no/~sesse/queryplan.txt (The lines were a bit too long to include in an e-mail :-) ) Note that the o3.opinion0 AND o12.correleation0 lines are an optimization; I can run the query fine without them and it will produce the same results, but it goes slower both in 7.2 and 7.4. There are a few oddities here: - The subquery scan o12 phase outputs 1186 rows, yet 83792 are sorted. Where do the other ~82000 rows come from? And why would it take ~100ms to sort the rows at all? (In earlier tests, this was _one full second_ but somehow that seems to have improved, yet without really improving the overall query time. shared_buffers is 4096 and sort_mem is 16384, so it should really fit into RAM.) - Why does it use uid_index for an index scan on the table, when it obviously has no filter on it (since it returns all the rows)? Furthermore, why would this take half a second? (The machine is a 950MHz machine with SCSI disks.) - Also, the outer sort (the sorting of the 58792 rows from the merge join) is slow. :-) 7.4 isn't really much better: http://www.samfundet.no/~sesse/queryplan74.txt Note that this is run on a machine with almost twice the speed (in terms of CPU speed, at least). The same oddities are mostly present (such as o12 returning 1186 rows, but 58788 rows are sorted), so I really don't understand what's going on here. Any ideas on how to improve this? /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] finding a max value
Edoardo Ceccarelli [EMAIL PROTECTED] writes: This is the query: select max(KA) from annuncio field KA is indexed and is int4, explaining gives: explain select max(KA) from annuncio; QUERY PLAN --- Aggregate (cost=21173.70..21173.70 rows=1 width=4) - Seq Scan on annuncio (cost=0.00..20326.76 rows=338776 width=4) (2 rows) wasn't supposed to do an index scan? it takes about 1sec to get the result. This is a known misfeature of max() in postgresql, see... http://archives.postgresql.org/pgsql-performance/2003-12/msg00283.php -- # James Antill -- [EMAIL PROTECTED] :0: * ^From: [EMAIL PROTECTED] /dev/null ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] query plan wierdness?
The limit is tricking you. I guess a sequential scan is cheaper than an index scan with the limit 26 found there. I am wrong? Greets -- --- Guido Barosio Buenos Aires, Argentina --- ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] query plan wierdness?
However, this query performs a sequence scan on the table, ignoring the call_idx13 index (the only difference is the addition of the aspid field in the order by clause): You do not have an index which matches the ORDER BY, so PostgreSQL cannot simply scan the index for the data you want. Thus is needs to find all matching rows, order them, etc. 23:59:59.999' order by aspid, openeddatetime desc, callstatus desc, calltype desc, callkey desc limit 26; aspid ASC, openeddatetime DESC, callstatus DESC, calltype DESC call_idx13 btree (aspid, openeddatetime, callstatus, calltype, callkey), This index is: aspid ASC, openeddatetime ASC, callstatus ASC, calltype ASC, callkey ASC A reverse scan, would of course be DESC, DESC, DESC, DESC, DESC -- neither of which matches your requested order by, thus cannot help the reduce the lines looked at to 26. This leaves your WHERE clause to restrict the dataset and it doesn't do a very good job of it. There are more than 45 rows matching the where clause, which means the sequential scan was probably the right choice (unless you have over 10 million entries in the table). Since your WHERE clause contains a single aspid, an improvement to the PostgreSQL optimizer may be to ignore that field in the ORDER BY as order is no longer important since there is only one possible value. If it did ignore aspid, it would use a plan similar to the first one you provided. You can accomplish the same thing by leaving out aspid ASC OR by setting it to aspid DESC in the ORDER BY. Leaving it out entirely will be slightly faster, but DESC will cause PostgreSQL to use index call_idx13. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] vacuum_mem
Hi, I tested vacuum_mem setting under a 4CPU and 4G RAM machine. I am the only person on that machine. The table: tablename | size_kb | reltuples ---+- big_t | 2048392 | 7.51515e+06 Case 1: 1. vacuum full big_t; 2. begin; update big_t set email = lpad('a', 255, 'b'); rollback; 3. set vacuum_mem=655360; -- 640M 4. vacuum big_t; It takes 1415,375 ms Also from top, the max SIZE is 615M while SHARE is always 566M PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 5914 postgres 16 0 615M 615M 566M D 7.5 15.8 21:21 postgres: postgres mydb xxx.xxx.xxx.xxx:34361 VACUUM Case 2: 1. vacuum full big_t; 2. begin; update big_t set email = lpad('a', 255, 'b'); rollback; 3. set vacuum_mem=65536; -- 64M 4. vacuum big_t; It takes 1297,798 ms Also from top, the max SIZE is 615M while SHARE is always 566M PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 3613 postgres 15 0 615M 615M 566M D17.1 15.8 9:04 postgres: postgres mydb xxx.xxx.xxx.xxx:34365 VACUUM It seems vacuum_mem does not have performance effect at all. In reality, we vaccum nightly and I want to find out which vacuum_mem value is the best to short vacuum time. Any thoughts? Thanks, __ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] vacuum_mem
It seems vacuum_mem does not have performance effect at all. Wrong conclusion. It implies that your test case takes less than 64M of memory to track your removed tuples. I think it takes 8 bytes to track a tuple for vacuuming an index, which means it should be able to track 80 deletions. Since you're demonstration had 75 for removal, it's under the limit. Try your test again with 32MB; it should make a single sequential pass on the table, and 2 passes on each index for that table. Either that, or do a few more aborted updates. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Terrible performance after deleting/recreating indexes
Thanks for the advice. On further review it appears I am only getting this performance degradation when I run the command via a JDBC app. If I do the exact same query from psql, the performance is fine. I've tried both the JDBC2 and JDBC3 jars. Same results. It definitely seems to correspond to deleting and recreating the indexes, though. The same query thru JDBC worked fine before recreating the indexes. Does that make any sense at all? thanks Bill --- Shridhar Daithankar [EMAIL PROTECTED] wrote: Bill Chandler wrote: Hi, Using PostgreSQL 7.4.2 on Solaris. I'm trying to improve performance on some queries to my databases so I wanted to try out various index structures. Since I'm going to be running my performance tests repeatedly, I created some SQL scripts to delete and recreate various index configurations. One of the scripts contains the commands for recreating the 'original' index configuration (i.e. the one I've already got some numbers for). Only thing is now when I delete and recreate the original indexes then run the query, I'm finding the performance has gone completely down the tubes compared to what I originally had. A query that used to take 5 minutes to complete now takes hours to complete. For what it's worth my query looks something like: select * from tbl_1, tbl_2 where tbl_1.id = tbl_2.id and tbl_2.name like 'x%y%' and tbl_1.x 1234567890123 order by tbl_1.x; tbl_1 is very big ( 2 million rows) tbl_2 is relatively small (7000 or so rows) tbl_1.x is a numeric(13) tbl_1.id tbl_2.id are integers tbl_2.name is a varchar(64) I've run 'VACUUM ANALYZE' on both tables involved in the query. I also used 'EXPLAIN' and observed that the query plan is completely changed from what it was originally. Get an explain analyze. That gives actual v/s planned time spent. See what is causing the difference. A discrepency between planned and actual row is usually a indication of out-of-date stats. Which are the indexes on these tables? You should list fields with indexes first in where clause. Also list most selective field first so that it eliminates as many rows as possible in first scan. I hope you have read the tuning articles on varlena.com and applied some basic tuning. And post the table schema, hardware config, postgresql config(important ones of course) and explain analyze for queries. That would be something to start with. Shridhar ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) __ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Terrible performance after deleting/recreating indexes
That is interesting - both psql and JDBC merely submit statements for the backend to process, so generally you would expect no difference in execution plan or performance. It might be worth setting log_statement=true in postgresql.conf and checking that you are executing *exactly* the same statement in both JDBC and psql. regards Mark P.s : lets see the output from EXPLAIN ANALYZE :-) Bill Chandler wrote: Thanks for the advice. On further review it appears I am only getting this performance degradation when I run the command via a JDBC app. If I do the exact same query from psql, the performance is fine. I've tried both the JDBC2 and JDBC3 jars. Same results. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] Working on huge RAM based datasets
Hi, I'm really stuck and I wonder if any of you could help. I have an application which will be sitting on a quite large database (roughly 8-16GB). The nature of the application is such that, on a second by second basis, the working set of the database is likely to be a substantial portion (e.g. between 50 and 70%) of the data - Just imagine an almost stochastic sampling of the data in each table, and you'll get an idea. Potentially quite smelly. To start with, I thought. No problems. Just configure a DB server with an obscene amount of RAM (e.g. 64GB), and configure PG with a shared buffer cache that is big enough to hold every page of data in the database, plus 10% or whatever to allow for a bit of room, ensuring that there is enough RAM in the box so that all the backend processes can do their thing, and all the other services can do their thing, and the swap system on the host remains idle. Apparently not :( I've read a number of places now saying that the PG cache has an optimal size which isn't as big as you can make it without affecting other stuff on the machine. The suggestion is to let linux take the strain for the lion's share of the caching (using its buffer cache), and just make the PG cache big enough to hold the data it needs for individual queries. ___ Ignoring for a moment the problem of answering the question 'so how big shall I make the PG cache?', and ignoring the possibility that as the database content changes over the months this answer will need updating from time to time for optimal performance, does anyone have any actual experience with trying to maintain a large, mainly RAM resident database? What is it about the buffer cache that makes it so unhappy being able to hold everything? I don't want to be seen as a cache hit fascist, but isn't it just better if the data is just *there*, available in the postmaster's address space ready for each backend process to access it, rather than expecting the Linux cache mechanism, optimised as it may be, to have to do the caching? Is it that the PG cache entries are accessed through a 'not particularly optimal for large numbers of tuples' type of strategy? (Optimal though it might be for more modest numbers). And on a more general note, with the advent of 64 bit addressing and rising RAM sizes, won't there, with time, be more and more DB applications that would want to capitalise on the potential speed improvements that come with not having to work hard to get the right bits in the right bit of memory all the time? And finally, am I worrying too much, and actually this problem is common to all databases? Thanks for reading, Andy ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Working on huge RAM based datasets
What is it about the buffer cache that makes it so unhappy being able to hold everything? I don't want to be seen as a cache hit fascist, but isn't it just better if the data is just *there*, available in the postmaster's address space ready for each backend process to access it, rather than expecting the Linux cache mechanism, optimised as it may be, to have to do the caching? Because the PostgreSQL buffer management algorithms are pitiful compared to Linux's. In 7.5, it's improved with the new ARC algorithm, but still - the Linux disk buffer cache will be very fast. Chris ---(end of broadcast)--- TIP 8: explain analyze is your friend