Re: [PERFORM] Queries 15 times slower on 8.1 beta 2 than on 8.0
Jean-Pierre Pelletier wrote: Hi, I've got many queries running much slower on 8.1 beta2 than on 8.0.1 Here is a simplified one that takes 484 ms on 8.1 and 32 ms on 8.0.1. select 0 from Content C left outer join Supplier S on C.SupplierId = S.SupplierId left outer join Price P on C.PriceId = P.PriceId; Any ideas why it's slower? You really have to post the results of EXPLAIN ANALYZE not just explain. So that we can tell what the planner is expecting, versus what really happened. John =:- Thanks Jean-Pierre Pelletier e-djuster signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Queries 15 times slower on 8.1 beta 2 than on 8.0
Jean-Pierre Pelletier wrote: Here are the explain analyze: What is the explain analyze if you use set enable_seqscan to off? Also, can you post the output of: \d supplier \d price \d content Mostly I just want to see what the indexes are, in the case that you don't want to show us your schema. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] How many tables is too many tables?
[EMAIL PROTECTED] wrote: I have a database of hundreds of millions of web links (between sites) in Postgres. For each link, we record the url, the referer, and the most recent date the link exists. I'm having some serious performance issues when it comes to writing new data into the database. One machine is simply not going to be able to scale with the quantities of links we hope to store information about and we want to move to some kind of cluster. Because of the quantities of data, it seems to make sense to go for a cluster setup such that in a 4 machine cluster, each machine has a quarter of the data (is this Share nothing, or, Share everything?). To that end, we figured a good first step was to partition the data on one machine into multiple tables defining the logic which would find the appropriate table given a piece of data. Then, we assumed, adding the logic to find the appropriate machine and database in our cluster would only be an incremental upgrade. In a database app, you generally don't win by going to a cluster, because you are almost always bound by your I/O. Which means that a single machine, just with more disks, is going to outperform a group of machines. As Tom mentioned, your schema is not very good. So lets discuss what a better schema would be, and also how you might be able to get decent performance with a cluster. First, 200rows * 400,000 tables = 80M rows. Postgres can handle this in a single table without too much difficulty. It all depends on the selectivity of your indexes, etc. I'm not sure how you are trying to normalize your data, but it sounds like having a url table so that each entry can be a simple integer, rather than the full path, considering that you are likely to have a bunch of repeated information. This makes your main table something like 2 integers, plus the interesting stuff (from url, to url, data). If you are finding you are running into I/O problems, you probably could use this layout to move your indexes off onto their own spindles, and maybe separate the main table from the url tables. What is your hardware? What are you trying to do that you don't think will scale? If you were SELECT bound, then maybe a cluster would help you, because you could off-load the SELECTs onto slave machines, and leave your primary machine available for INSERTs and replication. ... At this point, the primary performance bottleneck is in adding additional data to the database. Our loader program (we load text files of link information) is currently getting about 40 rows a second, which is nowhere near the performance we need to be seeing. In theory, we want to be able to re-write our entire archive of data within on a 1-2 month cycle, so this is a very heavy write application (though we're also constantly generating reports from the data, so its not write only). Are you VACUUMing enough? If you are rewriting all of the data, postgres needs you to clean up afterwards. It is pessimistic, and leaves old rows in their place. Is the total number of tables prohibitively affecting our write speed or is that an IO problem that can only be addressed by better drive partitioning (all data is on one drive, which I've already read is a problem)? Is this approach to data partitioning one which makes any sense for performance, or should we move to a more normal distribution of links across fewer tables which house more rows each? If all data is on a single drive, you are nowhere near needing a cluster to improve your database. What you need is a 14-drive RAID array. It's probably cheaper than 4x powerful machines, and will provide you with much better performance. And put all of your tables back into one. John =:- Thanks in advance for your advice. -matt ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match signature.asc Description: OpenPGP digital signature
Re: [PERFORM] RAID Stripe size
bm\mbn wrote: Hi Everyone The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k disks. 2 disks are in RAID1 and hold the OS, SWAP pg_xlog 4 disks are in RAID10 and hold the Cluster itself. the DB will have two major tables 1 with 10 million rows and one with 100 million rows. All the activities against this tables will be SELECT. What type of SELECTs will you be doing? Mostly sequential reads of a bunch of data, or indexed lookups of random pieces? Currently the strip size is 8k. I read in many place this is a poor setting. From what I've heard of RAID, if you are doing large sequential transfers, larger stripe sizes (128k, 256k) generally perform better. For postgres, though, when you are writing, having the stripe size be around the same size as your page size (8k) could be advantageous, as when postgres reads a page, it only reads a single stripe. So if it were reading a series of pages, each one would come from a different disk. I may be wrong about that, though. John =:- Am i right ? signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Battery Backed Cache for RAID
Alvaro Herrera wrote: On Wed, Sep 14, 2005 at 11:28:43AM -0700, Jeffrey W. Baker wrote: On Wed, 2005-09-14 at 11:25 -0700, Peter Darley wrote: I'm getting a new server for our database, and I have a quick question about RAID controllers with a battery backed cache. I understand that the cache will allow the cache to be written out if the power fails to the box, which allows it to report a write as committed safely when it's not actually committed. Actually the cache will just hold its contents while the power is out. When the power is restored, the RAID controller will complete the writes to disk. If the battery does not last through the outage, the data is lost. Just curious: how long are the batteries supposed to last? The recent *cheap* version of a ramdisk had battery backup for 16 hours. (Very expensive ramdisks actually have enough battery power to power a small hard-drive to dump the contents into). I'm guessing for a RAID controller, the time would be in the max 1 day range. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] shared buffers
Martin Nickel wrote: Chris, Would you say that 3 pages is a good maximum for a Postgres install? We're running 8.0.3 on 64-bit SUSE on a dual Opteron box with 4G and have shared_buffers set at 12. I've moved it up and down (it was 16 when I got here) without any measurable performance difference. What I've read on the mailing list, is that usually the sweet spot is actually around 10k pages. 120k seems far too high. I believe that the major fixes to the buffer manager are more in 8.1 rather than 8.0, so you probably are hitting some problems. (The biggest problem was that there were places that require doing a complete scan through shared memory looking for dirty pages, or some such). The reason I ask is because I occasionally see large-ish queries take forever (like cancel-after-12-hours forever) and wondered if this could result from shared_buffers being too large. There are lots of possibilities for why these take so long, perhaps you would want to post them, and we can try to help. For instance, if you have a foreign key reference from one table to another, and don't have indexes on both sides, then deleting from the referenced table, will cause a sequential scan on the referring table for *each* deleted row. (IIRC). John =:- Thanks for your (and anyone else's) help! Martin Nickel signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Prepared statement not using index
Guido Neitzer wrote: Hi. I have an interesting problem with the JDBC drivers. When I use a select like this: SELECT t0.aktiv, t0.id, t0.ist_teilnehmer, t0.nachname, t0.plz, t0.vorname FROM public.dga_dienstleister t0 WHERE t0.plz like ?::varchar(256) ESCAPE '|' withBindings: 1:53111(plz) the existing index on the plz column is not used. When I the same select with a concrete value, the index IS used. I use PostgreSQL 8.0.3 on Mac OS X and the JDBC driver 8.0-312 JDBC 3. After a lot of other things, I tried using a 7.4 driver and with this, the index is used in both cases. Why can this happen? Is there a setting I might have not seen? Something I do wrong? cug I've had this problem in the past. In my case, the issue was that the column I was searching had a mixed blend of possible values. For example, with 1M rows, the number 3 occurred 100 times, but the number 18 occurred 700,000 times. So when I manually did a search for 3, it naturally realized that it could use an index scan, because it had the statistics to say it was very selective. If I manually did a search for 18, it switched to sequential scan, because it was not very selective (both are the correct plans). But if you create a prepared statement, parameterized on this number, postgres has no way of knowing ahead of time, whether you will be asking about 3 or 18, so when the query is prepared, it has to be pessimistic, and avoid worst case behavior, so it choses to always use a sequential scan. The only way I got around this was with writing a plpgsql function which used the EXECUTE syntax to dynamically re-plan part of the query. Hope this makes sense. This may or may not be your problem, without knowing more about you setup. But the symptoms seem similar. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Advise about how to delete entries
Kevin wrote: Arnau wrote: Hi all, COPY FROM a file with all the ID's to delete, into a temporary table, and do a joined delete to your main table (thus, only one query). I already did this, but I don't have idea about how to do this join, could you give me a hint ;-) ? Thank you very much maybe something like this: DELETE FROM statistics_sasme s LEFT JOIN temp_table t ON (s.statistic_id = t.statistic_id) WHERE t.statistic_id IS NOT NULL Why can't you do: DELETE FROM statistics_sasme s JOIN temp_table t ON (s.statistic_id = t.statistic_id); Or possibly: DELETE FROM statistics_sasme s WHERE s.id IN (SELECT t.statistic_id FROM temp_table t); I'm not sure how delete exactly works with joins, but the IN form should be approximately correct. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Caching by Postgres
gokulnathbabu manoharan wrote: Hi all, I like to know the caching policies of Postgresql. What parameter in the postgresql.conf affects the cache size used by the Postgresql? As far as I have searched my knowledge of the parameters are In general, you don't. The OS handles caching based on file usage. So if you are using the files, the OS should cache them. Just like it does with any other program. 1. shared_buffers - Sets the limit on the amount of shared memory used. If I take this is as the cache size then my performance should increase with the increase in the size of shared_buffers. But it seems it is not the case and my performance actually decreases with the increase in the shared_buffers. I have a RAM size of 32 GB. The table which I use more frequently has around 68 million rows. Can I cache this entire table in RAM? There is a portion of this which is used for caching. But I believe before 8.1 there was code that went linearly through all of the shared_buffers and checked for dirty/clean pages. So there was a tradeoff that the bigger you make it, the longer that search goes. So you got diminishing returns, generally around 10k shared buffers. I think it is better in 8.1, but if the OS is going to cache it anyway (since it does), then having a Postgres cache is just wasting memory, and not letting cache as much. So I'm guessing that with 8.1 there would be 2 sweet spots. Low shared_buffers (= 10k), and really high shared buffers (like all of available ram). But because postgres has been tuned for the former I would stick with it (I don't think shared_buffers can go 2GB, but that might just be work_mem/maintenance_work_mem). 2. work_mem - It is the amount of memory used by an operation. My guess is once the operation is complete this is freed and hence has nothing to do with the caching. 3. effective_cache_size - The parameter used by the query planner and has nothing to do with the actual caching. This is important from a planner issue. Because the planner can then expect that the OS is doing its job and caching the tables, so index scans are cheaper than they would be otherwise. John =:- So kindly help me in pointing me to the correct parameter to set. It will be great if you can point me to the docs that explains the implementation of caching in Postgresql which will help me in understanding things much clearly. Thanks in advance. Gokul. signature.asc Description: OpenPGP digital signature
Re: [PERFORM] extremly low memory usage
Jeremiah Jahn wrote: On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: Ron wrote: At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: Well, since you can get a read of the RAID at 150MB/s, that means that it is actual I/O speed. It may not be cached in RAM. Perhaps you could try the same test, only using say 1G, which should be cached. [EMAIL PROTECTED] pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=100 100+0 records in 100+0 records out real0m8.885s user0m0.299s sys 0m6.998s [EMAIL PROTECTED] pgsql]# time dd of=/dev/null if=testfile bs=1024 count=100 100+0 records in 100+0 records out real0m1.654s user0m0.232s sys 0m1.415s The write time seems about the same (but you only have 128MB of write cache), but your read jumped up to 620MB/s. So you drives do seem to be giving you 150MB/s. ... I'm actually curious about PCI bus saturation at this point. Old 32-bit 33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this is a higher performance system. But I'm really surprised that your write speed is that close to your read speed. (100MB/s write, 150MB/s read). The raid array I have is currently set up to use a single channel. But I have dual controllers In the array. And dual external slots on the card. The machine is brand new and has pci-e backplane. Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them doing raw sequential IO like this should be capable of at ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s BTW I'm using Seagate Cheetah 15K.4's Now, are the numbers that Ron is quoting in megabytes or megabits? I'm guessing he knows what he is talking about, and is doing megabytes. 80MB/s sustained seems rather high for a hard-disk. Though this page: http://www.storagereview.com/articles/200411/20041116ST3146754LW_2.html Does seem to agree with that statement. (Between 56 and 93MB/s) And since U320 is a 320MB/s bus, it doesn't seem like anything there should be saturating. So why the low performance _IF_ the controller setup is high powered enough to keep that kind of IO rate up. This will require a controller or controllers providing dual channel U320 bandwidth externally and quad channel U320 bandwidth internally. IOW, it needs a controller or controllers talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized IO buffer as well. AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic MegaRAID controllers. What I don't know is which exact one yours is, nor do I know if it (or any of the MegaRAID controllers) are high powered enough. PERC4eDC-PCI Express, 128MB Cache, 2-External Channels Do you know which card it is? Does it look like this one: http://www.lsilogic.com/products/megaraid/megaraid_320_2e.html Judging by the 320 speed, and 2 external controllers, that is my guess. They at least claim a theoretical max of 2GB/s. Which makes you wonder why reading from RAM is only able to get throughput of 600MB/s. Did you run it multiple times? On my windows system, I get just under 550MB/s for what should be cached, copying from /dev/zero to /dev/null I get 2.4GB/s (though that might be a no-op). On a similar linux machine, I'm able to get 1200MB/s for a cached file. (And 3GB/s for a zero=null copy). John =:- Talk to your HW supplier to make sure you have controllers adequate to your HD's. ...and yes, your average access time will be in the 5.5ms - 6ms range when doing a physical seek. Even with RAID, you want to minimize seeks and maximize sequential IO when accessing them. Best to not go to HD at all ;-) Well, certainly, if you can get more into RAM, you're always better off. For writing, a battery-backed write cache, and for reading lots of system RAM. I'm not really worried about the writing, it's the reading the reading that needs to be faster. Hope this helps, Ron Peacetree John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance pb vs SQLServer.
Qingqing Zhou wrote: Alvaro Herrera [EMAIL PROTECTED] writes Interesting; do they use an overwriting storage manager like Oracle, or a non-overwriting one like Postgres? They call this MVCC RLV(row level versioning). I think they use rollback segment like Oracle (a.k.a version store or tempdb in SQL Server). Some details are explained in their white paper:Database concurrency and row level versioning in SQL Server 2005. Regards, Qingqing I found the paper here: http://www.microsoft.com/technet/prodtechnol/sql/2005/cncrrncy.mspx And it does sound like they are doing it the Oracle way: When a record in a table or index is updated, the new record is stamped with the transaction sequence_number of the transaction that is doing the update. The previous version of the record is stored in the version store, and the new record contains a pointer to the old record in the version store. Old records in the version store may contain pointers to even older versions. All the old versions of a particular record are chained in a linked list, and SQL Server may need to follow several pointers in a list to reach the right version. Version records need to be kept in the version store only as long as there are there are operations that might require them. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] extremly low memory usage
Jeremiah Jahn wrote: here's an example standard query. Ireally have to make the first hit go faster. The table is clustered as well on full_name as well. 'Smith%' took 87 seconds on the first hit. I wonder if I set up may array wrong. I remeber see something about DMA access versus something else, and choose DMA access. LVM maybe? It would be nice if you would format your queries to be a little bit easier to read before posting them. However, I believe I am reading it correctly, to say that the index scan on identity is not your slow point. In fact, as near as I can tell, it only takes 52ms to complete. The expensive parts are the 4915 lookups into the litigant_details (each one takes approx 4ms for a total of ~20s). And then you do it again on case_data (average 3ms each * 4906 loops = ~15s). So there is no need for preloading your indexes on the identity table. It is definitely not the bottleneck. So a few design bits, which may help your database. Why is actor_id a text field instead of a number? You could try creating an index on litigant_details (actor_id, count_ori) so that it can do just an index lookup, rather than an index + filter. More importantly, though, the planner seems to think the join of identity to litigant_details will only return 1 row, not 5000. Do you regularly vacuum analyze your tables? Just as a test, try running: set enable_nested_loop to off; And then run EXPLAIN ANALYZE again, just to see if it is faster. You probably need to increase some statistics targets, so that the planner can design better plans. - Nested Loop (cost=0.00..20411.83 rows=2 width=173) (actual time=12.891..38317.017 rows=4906 loops=1) - Nested Loop (cost=0.00..20406.48 rows=1 width=159) (actual time=12.826..23232.106 rows=4906 loops=1) - Nested Loop (cost=0.00..20403.18 rows=1 width=138) (actual time=12.751..22885.439 rows=4906 loops=1) Join Filter: ((outer.case_id)::text = (inner.case_id)::text) - Index Scan using name_speed on identity (cost=0.00..1042.34 rows=4868 width=82) (actual time=0.142..52.538 rows=4915 loops=1) Index Cond: (((full_name)::text = 'MILLER'::character varying) AND ((full_name)::text 'MILLES'::character varying)) Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text)) - Index Scan using lit_actor_speed on litigant_details (cost=0.00..3.96 rows=1 width=81) (actual time=4.631..4.635 rows=1 loops=4915) Index Cond: ((outer.actor_id)::text = (litigant_details.actor_id)::text) Filter: ('IL081025J'::text = (court_ori)::text) - Seq Scan on court (cost=0.00..3.29 rows=1 width=33) (actual time=0.053..0.062 rows=1 loops=4906) Filter: ('IL081025J'::text = (id)::text) - Index Scan using case_speed on case_data (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058 rows=1 loops=4906) Index Cond: (('IL081025J'::text = (case_data.court_ori)::text) AND ((case_data.case_id)::text = (outer.case_id)::text)) John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Data Selection Slow From VB 6.0
Mahesh Shinde wrote: Hi ... To connect To the PostgreSQL I am using **PostgreSQL Win32 ODBC and OLEDB client drivers 1.0.0.2** ... Since in the database I have one Major table that Debtor table which is master table and having around 55 lac records. I have set debtorId as a primary key having index on it.I am developing a search screen to search a specific debtor info using this table. When I fire a query to search a debtor id, it took around 5 seconds to return an answer for a query whether entered debtor id is present in the database or not using ODBC. Where as when Explian the query on the database Index Scan using tbmstban_debtorid on tbmstbandetails (cost=0.00..6.01 rows=2 width=143) Index Cond: ((debtorid)::text = '234'::text) Are you checking this from the VB App? Or just going onto the server and running psql? (I'm guessing there is some way to run a flat query using VB. In which case you can just have the query run EXPLAIN ANALYZE, the return value is just the text, one line after another.) What I'm thinking is that it might be a locale/encoding issue. What is the encoding of your database? And what is the default locale and the locale that you are connecting as? Can you give us the EXPLAIN ANALYZE output so that we can see if the planner is doing what it thinks it is? It certainly sounds like either it is always doing a sequential scan, or something else is going on. 5 sec is a really long time for the type of query you are doing. Oh, and can you run the win32 psql client to see if it might be ODBC which is causing the problem? John =:- Query for the search criteria is *select * from tbmstdebtordetails where debtorid ='234'* Where as when I am using a like query to search a record starting with debtor id having a characters then it took around 10-15 sec to return a record set having records. query is *select * from tbmstdebtordetails where debtorid like '234%'* Explain output on the database Index Scan using tbmstban_debtorid on tbmstbandetails (cost=0.00..6.01 rows=2 width=143) Index Cond: ((debtorid)::text = '234%'::text) Thanks regards, Mahesh Shinde -- Codec Communications (I) Pvt. Ltd. PUNE (INDIA) T # 91-20-24221460/70(Ext 43) Desk No. 25143 Email – [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] signature.asc Description: OpenPGP digital signature
Re: [PERFORM] extremly low memory usage
Jeremiah Jahn wrote: I just put together a system with 6GB of ram on a 14 disk raid 10 array. When I run my usual big painful queries, I get very little to know memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used most of the time. the new devel box sits at around 250MB. I've switched to an 8.0 system on the new devel box, but the .conf really didn't change. Index usage is the same. Something seems wrong and I'm not sure why. How big is your actual database on disk? And how much of it is actually touched by your queries? It seems that your tough queries might only be exercising a portion of the database. If you really want to make memory usage increase try something like: find . -type f -print0 | xargs -0 cat /dev/null Which should read all the files. After doing that, does the memory usage increase? any thoughts, -jj- shared_buffers = 32768 # min 16, at least max_connections*2, 8KB each work_mem = 2097151 # min 64, size in KB This seems awfully high. 2GB Per sort? This might actually be flushing some of your ram, since it would get allocated and filled, and then freed when finished. Remember, depending on what you are doing, this amount can get allocated more than once per query. maintenance_work_mem = 819200 # min 1024, size in KB max_fsm_pages = 8 # min max_fsm_relations*16, 6 bytes each checkpoint_segments = 30# in logfile segments, min 1, 16MB each effective_cache_size = 360 -this is a little out of control, but would it have any real effect? It should just tell the planner that it is more likely to have buffers in cache, so index scans are slightly cheaper than they would otherwise be. random_page_cost = 2# units are one sequential page fetch cost log_min_duration_statement = 1 # -1 is disabled, in milliseconds. lc_messages = 'C' # locale for system error message strings lc_monetary = 'C' # locale for monetary formatting lc_numeric = 'C'# locale for number formatting lc_time = 'C' # locale for time formatting John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Need for speed
Ulrich Wisser wrote: Hello, one of our services is click counting for on line advertising. We do this by importing Apache log files every five minutes. This results in a lot of insert and delete statements. At the same time our customers shall be able to do on line reporting. What are you deleting? I can see having a lot of updates and inserts, but I'm trying to figure out what the deletes would be. Is it just that you completely refill the table based on the apache log, rather than doing only appending? Or are you deleting old rows? We have a box with Linux Fedora Core 3, Postgres 7.4.2 Intel(R) Pentium(R) 4 CPU 2.40GHz 2 scsi 76GB disks (15.000RPM, 2ms) I did put pg_xlog on another file system on other discs. Still when several users are on line the reporting gets very slow. Queries can take more then 2 min. If it only gets slow when you have multiple clients it sounds like your select speed is the issue, more than conflicting with your insert/deletes. I need some ideas how to improve performance in some orders of magnitude. I already thought of a box with the whole database on a ram disc. So really any idea is welcome. How much ram do you have in the system? It sounds like you only have 1 CPU, so there is a lot you can do to make the box scale. A dual Opteron (possibly a dual motherboard with dual core (but only fill one for now)), with 16GB of ram, and an 8-drive RAID10 system would perform quite a bit faster. How big is your database on disk? Obviously it isn't very large if you are thinking to hold everything in RAM (and only have 76GB of disk storage to put it in anyway). If your machine only has 512M, an easy solution would be to put in a bunch more memory. In general, your hardware is pretty low in overall specs. So if you are willing to throw money at the problem, there is a lot you can do. Alternatively, turn on statement logging, and then post the queries that are slow. This mailing list is pretty good at fixing poor queries. One thing you are probably hitting is a lot of sequential scans on the main table. If you are doing mostly inserting, make sure you are in a transaction, and think about doing a COPY. There is a lot more that can be said, we just need to have more information about what you want. John =:- Ulrich signature.asc Description: OpenPGP digital signature
Re: [PERFORM] PG8 Tuning
Alvaro Herrera wrote: On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote: However, you are absolutely correct in that it's *relative* advice, not absolute advice. If, for example, you're using a $100,000 EMC SAN as your storage you'll probably be better off giving it everything and letting its controller and cache handle disk allocation etc. On the other hand, if you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter a case where a separate xlog disk did not benefit an OLTP application. I've been asked this a couple of times and I don't know the answer: what happens if you give XLog a single drive (unmirrored single spindle), and that drive dies? So the question really is, should you be giving two disks to XLog? I can propose a simple test. Create a test database. Run postgres, insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog directory. Start postgres again, what does it do? I suppose to simulate more of a failure mode, you could kill -9 the postmaster (and all children processes) perhaps during an insert, and then delete pg_xlog. But I would like to hear from the postgres folks what they *expect* would happen if you ever lost pg_xlog. What about something like keeping pg_xlog on a ramdisk, and then rsyncing it to a hard-disk every 5 minutes. If you die in the middle, does it just restore back to the 5-minutes ago point, or does it get more thoroughly messed up? For some people, a 5-minute old restore would be okay, as long as you still have transaction safety, so that you can figure out what needs to be restored. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] choosing RAID level for xlogs
Anjan Dave wrote: Yes, that's true, though, I am a bit confused because the Clariion array document I am reading talks about how the write cache can eliminate the RAID5 Write Penalty for sequential and large IOs...resulting in better sequential write performance than RAID10. anjan Well, if your stripe size is 128k, and you have N disks in the RAID (N must be even and 4 for RAID10). With RAID5 you have a stripe across N-1 disks, and 1 parity entry. With RAID10 you have a stripe across N/2 disks, replicated on the second set. So if the average write size is 128k*N/2, then you will generally be using all of the disks during a write, and you can expect a the maximum scale up of about N/2 for RAID10. If your average write size is 128k*(N-1) then you can again write an entire stripe at a time and even the parity since you already know all of the information you don't have to do any reading. So you can get a maximum speed up of N-1. If you are doing infrequent smallish writes, it can be buffered by the write cache, and isn't disk limited at all. And the controller can write it out when it feels like it. So it should be able to do more buffered all-at-once writes. If you are writing a little bit more often (such that the cache fills up), depending on your write pattern, it is possible that all of the stripes are already in the cache, so again there is little penalty for the parity stripe. I suppose the worst case is if you were writing lots of very small chunks, all over the disk in random order. In which case each write encounters a 2x read penalty for a smart controller, or a Nx read penalty if you are going for more safety than speed. (You can read the original value, and the parity, and re-compute the parity with the new value (2x read penalty), but if there is corruption it would not be detected, so you might want to read all of the stripes in the block, and recompute the parity with the new data (Nx read penalty)). I think the issue for Postgres is that it writes 8k pages, which is quite small relative to the stripe size. So you don't tend to build up big buffers to write out the entire stripe at once. So if you aren't filling up your write buffer, RAID5 can do quite well with bulk loads. I also don't know about the penalties for a read followed immediately by a write. Since you will be writing to the same location, you know that you have to wait for the disk to spin back to the same location. At 10k rpm that is a 6ms wait time. For 7200rpm disks, it is 8.3ms. Just to say that there are some specific extra penalties when you are reading the location that you are going to write right away. Now a really smart controller with lots of data to write could read the whole circle on the disk, and then start writing out the entire circle, and not have any spin delay. But you would have to know the size of the circle, and that depends on what block you are on, and the heads arrangement and everything else. Though since hard-drives also have small caches in them, you could hide some of the spin delay, but not a lot, since you have to leave the head there until you are done writing, so while the current command would finish quickly, the next command couldn't start until the first actually finished. Writing large buffers hides all of these seek/spin based latencies, so you can get really good throughput. But a lot of DB action is small buffers randomly distributed, so you really do need low seek time, of which RAID10 is probably better than RAID5. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] choosing RAID level for xlogs
Anjan Dave wrote: Yes, that's true, though, I am a bit confused because the Clariion array document I am reading talks about how the write cache can eliminate the RAID5 Write Penalty for sequential and large IOs...resulting in better sequential write performance than RAID10. anjan To give a shorter statement after my long one... If you have enough cache that the controller can write out big chunks to the disk at a time, you can get very good sequential RAID5 performance, because the stripe size is large (so it can do a parallel write to all disks). But for small chunk writes, you suffer the penalty of the read before write, and possible multi-disk read (depends on what is in cache). RAID10 generally handles small writes better, and I would guess that 4disks would perform almost identically to 6disks, since you aren't usually writing enough data to span multiple stripes. If your battery-backed cache is big enough that you don't fill it, they probably perform about the same (superfast) since the cache hides the latency of the disks. If you start filling up your cache, RAID5 probably can do better because of the parallelization. But small writes followed by an fsync do favor RAID10 over RAID5. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] How many views is ok?
Petr Kavan wrote: I have database of company data, and some of them is table of information about employees. I need each employee to have access only to his own row. Postgre cannot do this by system of privileges, because that can give privileges only to whole tables. Possibility is to create a view for each employee that chooses only his data and give employee privileges to this view. But I am not sure if such number of views does not have some performance drawbacks or even if postgre can support it (I expect i can). I would need several tables protected like this and it can result in, say 1000 views in maximum. Because access to DB will go through PHP information system, other possibility to protect data is to let IS connect as more privileged than user really is, but let it retrieve only data for that user. View-approach seems far more clear than this, but im not sure if postgre can handle it without problems. We do a similar thing tying user to per-row permissions. We have 1 view per table, and it works fine. I would recommend that you do something similar. Basically, just make the view: CREATE VIEW just_me SECURITY DEFINER AS SELECT * FROM user_table WHERE username=session_user; REVOKE ALL FROM user_table; GRANT SELECT TO just_me TO PUBLIC; security definer, means that the 'just_me' view will be executed as the user who created the function (superuser). The REVOKE ALL (my syntax might be wrong) prevents users from querying the user tables directly. The 'session_user' makes the view use the name of the actual connected user (because of security definer, plain 'user' is the superuser) This should allow a user to see only their own row in the database. (Whichever rows that have username matching the connected name). Now, this only works if the php front end connects specifically as the given user (our system is setup to do this). If you don't do it this way, you'll always be stuck with the IS layer doing the restriction. Even if you create a view per user, if your PHP layer has the right to look at other tables/views, it doesn't really help. Good luck, John =:- Thanks for any reply :-) --- Petr Kavan Database Development ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance pb vs SQLServer.
Stéphane COEZ wrote: Hi, I have a perfomance issue : I run PG (8.0.3) and SQLServer2000 on a Windows2000 Server (P4 1,5Ghz 512Mo) I have a table (320 rows) and I run this single query : select cod from mytable group by cod I have an index on cod (char(4) - 88 different values) PG = ~ 20 sec. SQLServer = 8 sec the explain is : HashAggregate (cost=64410.09..64410.09 rows=55 width=8) - Seq Scan on mytable (cost=0.00..56325.27 rows=3233927 width=8) if I switch to enable_hashagg = false (just for a try...) the planner will choose my index : Group (cost=0.00..76514.01 rows=55 width=8) - Index Scan using myindex on mytable (cost=0.00..68429.20 rows=3233927 width=8) but performance will be comparable to previous test. So with or without using Index I have the same result. My guess is that this is part of a larger query. There isn't really much you can do. If you want all 3.2M rows, then you have to wait for them to be pulled in. What you generally can do for performance, is to restructure things, so that you *don't* have to touch all 3.2M rows. If you are just trying to determine what the unique entries are for cod, you probably are better off doing some normalization, and keeping a separate table of cod values. I'm guessing the reason your query is faster with SQLServer is because of how postgres handles MVCC. Basically, it still has to fetch the main page to determine if a row exists. While SQL server doesn't do MVCC, so it can just look things up in the index. You might also try a different query, something like: SELECT DISTINCT cod FROM mytable ORDER BY cod GROUP BY cod; (You may or may not want order by, or group by, try the different combinations.) It might be possible to have the planner realize that all you want is unique rows, just doing a group by doesn't give you that. John =:- Thanks for help. Stéphane COEZ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance pb vs SQLServer.
Steinar H. Gunderson wrote: On Sun, Aug 14, 2005 at 07:27:38PM -0500, John Arbash Meinel wrote: My guess is that this is part of a larger query. There isn't really much you can do. If you want all 3.2M rows, then you have to wait for them to be pulled in. To me, it looks like he'll get 88 rows, not 3.2M. Surely we must be able to do something better than a full sequential scan in this case? test=# create table foo ( bar char(4) ); CREATE TABLE test=# insert into foo values (''); INSERT 24773320 1 test=# insert into foo values (''); INSERT 24773321 1 test=# insert into foo values (''); INSERT 24773322 1 test=# select * from foo group by bar; bar -- (2 rows) I considered doing some odd magic with generate_series() and subqueries with LIMIT 1, but it was a bit too weird in the end :-) /* Steinar */ I think a plain GROUP BY is not smart enough to detect it doesn't need all rows (since it is generally used because you want to get aggregate values of other columns). I think you would want something like SELECT DISTINCT, possibly with an ORDER BY rather than a GROUP BY (which was my final suggestion). John =:- signature.asc Description: OpenPGP digital signature
[PERFORM] Odd Locking Problem
I'm having an odd case where my system is locking such that if I insert into a table during a transaction, if I start a new connection and transaction, it blocks while trying to do a similar insert until the first transaction is committed or rolled back. The schema is rather complex (currently 157 tables, 200 views), and I still haven't been able to create a small test case. Everything I've tried so far just works. The data is private, but the schema is open source, so I probably could work with someone on it. When I look at the pg_locks table, I seem to be blocked on: SELECT * FROM pg_locks WHERE granted = false; relation | database | transaction | pid | mode | granted --+--+-+---+--+- | | 1525932 | 30175 | ShareLock| f ... Which if I understand correctly, means that the current transaction is intentionally blocking waiting for the other transaction to finish. I'm currently running 8.0.3, but the database was first created under 7.4.? I confirmed this behavior on both systems. Under what circumstances would this occur? To try and outline the situation there is a main object table, which is the root object. It contains a group column which is used for access rights. There is a groupref table, which keeps track of the group rights for each user. (Each user has specific insert,update,select rights per group). The select rights are enforced by views (the tables are not publicly accessible, the views join against the groupref table to check for select permission). Insert and update rights are validated by BEFORE INSERT triggers. Most tables references the object table. Basically it is OO, but doesn't use the postgres inheritance (in our testing postgres inheritance didn't scale well for deep inheritance, and wasn't able to enforce uniqueness anyway.) The views present an OO appearance, and behind the scenes direct table foreign keys maintain referential integrity. I have checked using RAISE NOTICE and the BEFORE INSERT trigger gets all the way to the RETURN statement before things hang, so I haven't figured out what is actually hanging. I have a bzip'd version of the schema and just enough data to be useful available here: http://www.arbash-meinel.com/extras/schema_and_data.sql.bz2 This is the commands to replicate the locking: -- Connect as postgres -- Required before any inserts, so that the TEMP env table is -- created and filled out. select mf_setup_env(); -- Begin a transaction and insert some data BEGIN; INSERT INTO object(vgroup,otype,oname) VALUES ('test',1,'test'); -- Start a new shell, and connect again and do exactly the same thing -- as the above. -- It should hang until you either do END/ROLLBACK in the first -- connection. Thanks for any help, John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Odd Locking Problem
Alvaro Herrera wrote: On Thu, Aug 11, 2005 at 03:36:31PM -0500, John A Meinel wrote: I'm having an odd case where my system is locking such that if I insert into a table during a transaction, if I start a new connection and transaction, it blocks while trying to do a similar insert until the first transaction is committed or rolled back. Are there foreign keys here? I can duplicate the problem easily with them: -- session 1 create table a (a serial primary key); create table b (a int references a); insert into a values (1); begin; insert into b values (1); -- session 2 insert into b values (1); -- hangs Actually, there are but the insert is occurring into table 'a' not table 'b'. 'a' refers to other tables, but these should not be modified. If I commit on session 1, session 2 is unlocked. This is a known problem, solved in 8.1. A workaround for previous releases is to defer FK checks until commit: create table b (a int references a initially deferred); I'll try one of the CVS entries and see if it happens there. Good to hear there has been work done. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Mostly read performance
Jeffrey Tenny wrote: I have a largely table-append-only application where most transactions are read-intensive and many are read-only. The transactions may span many tables, and in some cases might need to pull 70 MB of data out of a couple of the larger tables. In 7.3, I don't seem to see any file system or other caching that helps with repeated reads of the 70MB of data. Secondary fetches are pretty much as slow as the first fetch. (The 70MB in this example might take place via 2000 calls to a parameterized statement via JDBC). Were there changes after 7.3 w.r.t. caching of data? I read this list and see people saying that 8.0 will use the native file system cache to good effect. Is this true? Is it supposed to work with 7.3? Is there something I need to do to get postgresql to take advatage of large ram systems? Thanks for any advice. Well, first off, the general recommendation is probably that 7.3 is really old, and you should try to upgrade to at least 7.4, though recommended to 8.0. The bigger questions: How much RAM do you have? How busy is your system? 8.0 doesn't really do anything to do make the system cache the data. What kernel are you using? Also, if your tables are small enough, and your RAM is big enough, you might already have everything cached. One way to flush the caches, is to allocate a bunch of memory, and then scan through it. Or maybe mmap a really big file, and access every byte. But if your kernel is smart enough, it could certainly deallocate pages after you stopped accessing them, so I can't say for sure that you can flush the memory cache. Usually, I believe these methods are sufficient. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Odd Locking Problem
Alvaro Herrera wrote: On Thu, Aug 11, 2005 at 03:36:31PM -0500, John A Meinel wrote: ... This is a known problem, solved in 8.1. A workaround for previous releases is to defer FK checks until commit: So I don't know exactly what the fix was, but I just tested, and my problem is indeed fixed with the latest CVS head. It no longer blocks. create table b (a int references a initially deferred); John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Speedier count(*)
Dan Harris wrote: I have a web page for my customers that shows them count of records and some min/max date ranges in each table of a database, as this is how we bill them for service. They can log in and check the counts at any time. I'd like for the counts to be as fresh as possible by keeping this dynamic, but I will use a periodic 'snapshot'/cron job if that is the only option to speed this up. I have thought about using the table statistics, but the estimate error is probably unacceptable because of the billing purposes. For some reason, the SQL Server we migrated the app from can return count(*) in a split second on multi-million row tables, even though it is a MUCH slower box hardware-wise, but it's now taking many seconds to run. I have read in the archives the problems MVCC brings into the count(*) dilemma forcing Pg to run a seq scan to get counts. Does SQLServer not use MVCC or have they found another approach for arriving at this number? Compounding all the min/max and counts from other tables and all those queries take about a minute to run. The tables will contain anywhere from 1 million to 40 million rows. I believe SQL Server doesn't use MVCC in the same way. At the very least, it stores some row information in the index, so it can get some info from just an index, without having to go to the actual page (MVCC requires a main page visit to determine visibility.) Depending on how much it impacts performance, you can create an INSERT/UPDATE trigger so that whenever a new entry is added, it automatically updates a statistics table. It would be maintained as you go, rather than periodically like a cron job. I would go Cron if things can be slightly out of date (like 1 hour at least), and you need updates inserts to not be slowed down. Otherwise I think the trigger is nicer, since it doesn't do redundant work, and means everything stays up-to-date. Also, I am using select ... group by ... order by .. limit 1 to get the min/max since I have already been bit by the issue of min() max() being slower. -Dan John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Table locking problems?
Dan Harris wrote: On Aug 10, 2005, at 12:49 AM, Steve Poe wrote: Dan, Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a clarification, since RAID 0 is still a single-point of failure even if RAID1 is on top of RAID0. Well, you tell me if I stated incorrectly. There are two raid enclosures with 7 drives in each. Each is on its own bus on a dual- channel controller. Each box has a stripe across its drives and the enclosures are mirrors of each other. I understand the controller could be a single point of failure, but I'm not sure I understand your concern about the RAID structure itself. In this configuration, if you have a drive fail on both controllers, the entire RAID dies. Lets label them A1-7, B1-7, because you stripe within a set, if a single one of A dies, and a single one of B dies, you have lost your entire mirror. The correct way of doing it, is to have A1 be a mirror of B1, and then stripe above that. Since you are using 2 7-disk enclosures, I'm not sure how you can do it well, since it is not an even number of disks. Though if you are using software RAID, there should be no problem. The difference is that in this scenario, *all* of the A drives can die, and you haven't lost any data. The only thing you can't lose is a matched pair (eg losing both A1 and B1 will cause complete data loss) I believe the correct notation for this last form is RAID 1 + 0 (RAID10) since you have a set of RAID1 drives, with a RAID0 on-top of them. How many users are connected when your update / delete queries are hanging? Have you done an analyze verbose on those queries? Most of the traffic is from programs we run to do analysis of the data and managing changes. At the time I noticed it this morning, there were 10 connections open to the database. That rarely goes above 20 concurrent. As I said in my other response, I believe that the log will only contain the query at the point the query finishes, so if it never finishes... Have you made changes to the postgresql.conf? kernel.vm settings? IO scheduler? I set shmmax appropriately for my shared_buffers setting, but that's the only kernel tweak. If you're not doing so already, you may consider running sar (iostat) to monitor when the hanging occurs if their is a memory / IO bottleneck somewhere. I will try that. Thanks When you discover that an update is hanging, can you get into the database, and see what locks currently exist? (SELECT * FROM pg_locks) That might help you figure out what is being locked and possibly preventing your updates. It is also possible that your UPDATE query is trying to do something funny (someone just recently was talking about an UPDATE that wanted to do a hash join against 12M rows). Which probably meant that it had to spill to disk, where a merge join would have worked better. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Table locking problems?
Dan Harris wrote: On Aug 9, 2005, at 3:51 PM, John A Meinel wrote: Dan Harris wrote: On Aug 10, 2005, at 12:49 AM, Steve Poe wrote: Dan, Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a clarification, since RAID 0 is still a single-point of failure even if RAID1 is on top of RAID0. Well, you tell me if I stated incorrectly. There are two raid enclosures with 7 drives in each. Each is on its own bus on a dual- channel controller. Each box has a stripe across its drives and the enclosures are mirrors of each other. I understand the controller could be a single point of failure, but I'm not sure I understand your concern about the RAID structure itself. In this configuration, if you have a drive fail on both controllers, the entire RAID dies. Lets label them A1-7, B1-7, because you stripe within a set, if a single one of A dies, and a single one of B dies, you have lost your entire mirror. The correct way of doing it, is to have A1 be a mirror of B1, and then stripe above that. Since you are using 2 7-disk enclosures, I'm not sure how you can do it well, since it is not an even number of disks. Though if you are using software RAID, there should be no problem. The difference is that in this scenario, *all* of the A drives can die, and you haven't lost any data. The only thing you can't lose is a matched pair (eg losing both A1 and B1 will cause complete data loss) I believe the correct notation for this last form is RAID 1 + 0 (RAID10) since you have a set of RAID1 drives, with a RAID0 on-top of them. I have read up on the difference now. I don't understand why it's a single point of failure. Technically any array could be a single point depending on your level of abstraction. In retrospect, I probably should have gone 8 drives in each and used RAID 10 instead for the better fault-tolerance, but it's online now and will require some planning to see if I want to reconfigure that in the future. I wish HP's engineer would have promoted that method instead of 0+1.. I wouldn't say that it is a single point of failure, but I *can* say that it is much more likely to fail. (2 drives rather than on average n drives) If your devices will hold 8 drives, you could simply do 1 8-drive, and one 6-drive. And then do RAID1 with pairs, and RAID0 across the resultant 7 RAID1 sets. I'm really surprised that someone promoted RAID 0+1 over RAID10. I think I've heard that there is a possible slight performance improvement, but really the failure mode makes it a poor tradeoff. John =:- -Dan signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Matthew Schumacher wrote: Okay, Here is the status of the SA updates and a question: Michael got SA changed to pass an array of tokens to the proc so right there we gained a ton of performance due to connections and transactions being grouped into one per email instead of one per token. Now I am working on making the proc even faster. Since we have all of the tokens coming in as an array, it should be possible to get this down to just a couple of queries. I have the proc using IN and NOT IN statements to update everything at once from a temp table, but it progressively gets slower because the temp table is growing between vacuums. At this point it's slightly slower than the old update or else insert on every token. I recommend that you drop and re-create the temp table. There is no reason to have it around, considering you delete and re-add everything. That means you never have to vacuum it, since it always only contains the latest rows. What I really want to do is have the token array available as a record so that I can query against it, but not have it take up the resources of a real table. If I could copy from an array into a record then I can even get rid of the loop. Anyone have any thoughts on how to do this? My one question here, is the inspam_count and inham_count *always* the same for all tokens? I would have thought each token has it's own count. Anyway, there are a few lines I would change: CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER, intokenary BYTEA[], inspam_count INTEGER, inham_count INTEGER, inatime INTEGER) RETURNS VOID AS ' DECLARE _token BYTEA; BEGIN -- create the table at the start of the procedure CREATE TEMP TABLE bayes_token_tmp (intoken bytea); -- You might also add primary key if you are going to be adding -- *lots* of entries, but it sounds like you are going to have -- less than 1 page, so it doesn't matter for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) LOOP _token := intokenary[i]; INSERT INTO bayes_token_tmp VALUES (_token); END LOOP; UPDATE bayes_token SET spam_count = greatest_int(spam_count + inspam_count, 0), ham_count = greatest_int(ham_count + inham_count , 0), atime = greatest_int(atime, 1000) WHERE id = inuserid AND --(token) IN (SELECT intoken FROM bayes_token_tmp); EXISTS (SELECT token FROM bayes_token_tmp WHERE intoken=token LIMIT 1); -- I would also avoid your intoken (NOT) IN (SELECT token FROM -- bayes_token) There are a few possibilities, but to me -- as your bayes_token table becomes big, this will start -- to be the slow point -- Rather than doing 2 NOT IN queries, it *might* be faster to do DELETE FROM bayes_token_tmp WHERE NOT EXISTS (SELECT token FROM bayes_token WHERE token=intoken); UPDATE bayes_vars SET -- token_count = token_count + (SELECT count(intoken) FROM -- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)), token_count = token_count + (SELECT count(intoken) FROM bayes_token_tmp) -- You don't need the where NOT IN, since we already removed those rows newest_token_age = greatest_int(newest_token_age, inatime), oldest_token_age = least_int(oldest_token_age, inatime) WHERE id = inuserid; INSERT INTO bayes_token SELECT inuserid, intoken, inspam_count, inham_count, inatime FROM bayes_token_tmp WHERE (inspam_count 0 OR inham_count 0) -- AND -- (intoken) NOT IN (SELECT token FROM bayes_token); -- You don't need either of those lines, again because we already -- filtered -- delete from bayes_token_tmp; -- And rather than deleting all of the entries just DROP TABLE bayes_token_tmp; RETURN; END; ' LANGUAGE 'plpgsql'; CREATE OR REPLACE FUNCTION greatest_int (integer, integer) RETURNS INTEGER IMMUTABLE STRICT AS 'SELECT CASE WHEN $1 $2 THEN $2 ELSE $1 END;' LANGUAGE SQL; CREATE OR REPLACE FUNCTION least_int (integer, integer) RETURNS INTEGER IMMUTABLE STRICT AS 'SELECT CASE WHEN $1 $2 THEN $1 ELSE $2 END;' LANGUAGE SQL; So to clarify, here is my finished function: CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER, intokenary BYTEA[], inspam_count INTEGER, inham_count INTEGER, inatime INTEGER) RETURNS VOID AS ' DECLARE _token BYTEA; BEGIN CREATE TEMP TABLE bayes_token_tmp (intoken bytea); for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) LOOP _token :=
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Tom Lane wrote: Matthew Schumacher [EMAIL PROTECTED] writes: for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) LOOP _token := intokenary[i]; INSERT INTO bayes_token_tmp VALUES (_token); END LOOP; UPDATE bayes_token SET spam_count = greatest_int(spam_count + inspam_count, 0), ham_count = greatest_int(ham_count + inham_count , 0), atime = greatest_int(atime, 1000) WHERE id = inuserid AND (token) IN (SELECT intoken FROM bayes_token_tmp); I don't really see why you think that this path is going to lead to better performance than where you were before. Manipulation of the temp table is never going to be free, and IN (sub-select) is always inherently not fast, and NOT IN (sub-select) is always inherently awful. Throwing a pile of simple queries at the problem is not necessarily the wrong way ... especially when you are doing it in plpgsql, because you've already eliminated the overhead of network round trips and repeated planning of the queries. So for an IN (sub-select), does it actually pull all of the rows from the other table, or is the planner smart enough to stop once it finds something? Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)? What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y) I would guess that the EXISTS/NOT EXISTS would be faster, though it probably would necessitate using a nested loop (at least that seems to be the way the query is written). I did some tests on a database with 800k rows, versus a temp table with 2k rows. I did one sequential test (1-2000, with 66 rows missing), and one sparse test (1-200, 10-100200, 20-200200, ... with 658 rows missing). If found that NOT IN did indeed have to load the whole table. IN was smart enough to do a nested loop. EXISTS and NOT EXISTS did a sequential scan on my temp table, with a SubPlan filter (which looks a whole lot like a Nested Loop). What I found was that IN performed about the same as EXISTS (since they are both effectively doing a nested loop), but that NOT IN took 4,000ms while NOT EXISTS was the same speed as EXISTS at around 166ms. Anyway, so it does seem like NOT IN is not a good choice, but IN seems to be equivalent to EXISTS, and NOT EXISTS is also very fast. Is this generally true, or did I just get lucky on my data? John =:- regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM object_t); QUERY PLAN - Nested Loop IN Join (cost=0.00..9851.68 rows=2140 width=4) (actual time=0.085..183.889 rows=1351 loops=1) - Seq Scan on ids (cost=0.00..31.40 rows=2140 width=4) (actual time=0.014..24.032 rows=2009 loops=1) - Index Scan using object_t_pkey on object_t (cost=0.00..4.58 rows=1 width=4) (actual time=0.071..0.071 rows=1 loops=2009) Index Cond: (outer.id = object_t.id) Total runtime: 184.823 ms (5 rows) Time: 186.931 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id); QUERY PLAN - Seq Scan on ids (cost=0.00..9824.93 rows=1070 width=4) (actual time=0.086..165.053 rows=1351 loops=1) Filter: (subplan) SubPlan - Index Scan using object_t_pkey on object_t o (cost=0.00..4.58 rows=1 width=4) (actual time=0.025..0.025 rows=1 loops=2009) Index Cond: (id = $0) Total runtime: 165.995 ms (6 rows) Time: 167.795 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM object_t); QUERY PLAN --- Seq Scan on ids (cost=36410.51..36447.26 rows=1070 width=4) (actual time=4168.247..4172.080 rows=658 loops=1) Filter: (NOT (hashed subplan)) SubPlan - Seq Scan on object_t (cost=0.00..34381.81 rows=811481 width=4) (actual time=0.044..2464.296 rows=811481 loops=1) Total runtime: 4210.784 ms (5 rows) Time: 4212.276 ms mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM object_t o WHERE o.id = ids.id); QUERY PLAN - Seq Scan on ids (cost=0.00..9824.93 rows=1070 width=4) (actual
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Matthew Schumacher wrote: Matthew Schumacher wrote: Tom Lane wrote: I don't really see why you think that this path is going to lead to better performance than where you were before. Manipulation of the temp table is never going to be free, and IN (sub-select) is always inherently not fast, and NOT IN (sub-select) is always inherently awful. Throwing a pile of simple queries at the problem is not necessarily the wrong way ... especially when you are doing it in plpgsql, because you've already eliminated the overhead of network round trips and repeated planning of the queries. regards, tom lane The reason why I think this may be faster is because I would avoid running an update on data that needs to be inserted which saves searching though the table for a matching token. Perhaps I should do the insert first, then drop those tokens from the temp table, then do my updates in a loop. I'll have to do some benchmarking... schu Tom, I think your right, whenever I do a NOT IN it does a full table scan against bayes_token and since that table is going to get very big doing the simple query in a loop that uses an index seems a bit faster. John, thanks for your help, it was worth a try, but it looks like the looping is just faster. Here is what I have so far in case anyone else has ideas before I abandon it: Surely this isn't what you have. You have *no* loop here, and you have stuff like: AND (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token); I'm guessing this isn't your last version of the function. As far as putting the CREATE TEMP TABLE inside the function, I think the problem is that the first time it runs, it compiles the function, and when it gets to the UPDATE/INSERT with the temporary table name, at compile time it hard-codes that table id. I tried getting around it by using EXECUTE which worked, but it made the function horribly slow. So I don't recommend it. Anyway, if you want us to evaluate it, you really need to send us the real final function. John =:- CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER, intokenary BYTEA[], inspam_count INTEGER, inham_count INTEGER, inatime INTEGER) RETURNS VOID AS ' DECLARE _token BYTEA; BEGIN UPDATE bayes_token SET spam_count = greatest_int(spam_count + inspam_count, 0), ham_count = greatest_int(ham_count + inham_count , 0), atime = greatest_int(atime, inatime) WHERE id = inuserid AND (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary)); UPDATE bayes_vars SET token_count = token_count + ( SELECT count(bayes_token_tmp) FROM bayes_token_tmp(intokenary) WHERE bayes_token_tmp NOT IN (SELECT token FROM bayes_token)), newest_token_age = greatest_int(newest_token_age, inatime), oldest_token_age = least_int(oldest_token_age, inatime) WHERE id = inuserid; INSERT INTO bayes_token SELECT inuserid, bayes_token_tmp, inspam_count, inham_count, inatime FROM bayes_token_tmp(intokenary) WHERE (inspam_count 0 OR inham_count 0) AND (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token); RETURN; END; ' LANGUAGE 'plpgsql'; CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS SETOF bytea AS ' BEGIN for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1) LOOP return next intokenary[i]; END LOOP; RETURN; end ' language 'plpgsql'; CREATE OR REPLACE FUNCTION greatest_int (integer, integer) RETURNS INTEGER IMMUTABLE STRICT AS 'SELECT CASE WHEN $1 $2 THEN $2 ELSE $1 END;' LANGUAGE SQL; CREATE OR REPLACE FUNCTION least_int (integer, integer) RETURNS INTEGER IMMUTABLE STRICT AS 'SELECT CASE WHEN $1 $2 THEN $1 ELSE $2 END;' LANGUAGE SQL; ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Matthew Schumacher wrote: John A Meinel wrote: Surely this isn't what you have. You have *no* loop here, and you have stuff like: AND (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token); I'm guessing this isn't your last version of the function. As far as putting the CREATE TEMP TABLE inside the function, I think the problem is that the first time it runs, it compiles the function, and when it gets to the UPDATE/INSERT with the temporary table name, at compile time it hard-codes that table id. I tried getting around it by using EXECUTE which worked, but it made the function horribly slow. So I don't recommend it. Anyway, if you want us to evaluate it, you really need to send us the real final function. John =:- It is the final function. It doesn't need a loop because of the bayes_token_tmp function I added. The array is passed to it and it returns a record set so I can work off of it like it's a table. So the function works the same way it before, but instead of using SELECT intoken from TEMPTABLE, you use SELECT bayes_token_tmp from bayes_token_tmp(intokenary). I think this is more efficient than the create table overhead, especially because the incoming record set won't be to big. Thanks, schu Well, I would at least recommend that you change the WHERE bayes_token_tmp NOT IN (SELECT token FROM bayes_token) with a WHERE NOT EXISTS (SELECT toke FROM bayes_token WHERE token=bayes_token_tmp) You might try experimenting with the differences, but on my system the NOT IN has to do a full sequential scan on bayes_token and load all entries into a list, while NOT EXISTS can do effectively a nested loop. The nested loop requires that there is an index on bayes_token(token), but I'm pretty sure there is anyway. Again, in my testing, it was a difference of 4200ms versus 180ms. (800k rows in my big table, 2k in the temp one) John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Is There A Windows Version of Performance Tuning Documents?
Lane Van Ingen wrote: I have in my possession some performance tuning documents authored by Bruce Momjian, Josh Berkus, and others. They give good information on utilities to use (like ipcs, sar, vmstat, etc) to evaluate disk, memory, etc. performance on Unix-based systems. Problem is, I have applications running on Windows 2003, and have worked mostly on Unix before. Was wondering if anyone knows where there might be a Windows performance document that tells what to use / where to look in Windows for some of this data. I am thinking that I may not seeing what I need in perfmon or the Windows task manager. Want to answer questions like: How much memory is being used for disk buffer cache? How to I lock shared memory for PostgreSQL (if possible at all)? How to determine if SWAP (esp. page-in) activity is hurting me? Does Windows use a 'unified buffer cache' or not? How do I determine how much space is required to do most of my sorts in RAM? I don't know of any specific documentation. I would mention the TaskManager as the first place I would look (Ctrl+Shift+Esc, or right click on the task bar). You can customize the columns that it shows in the process view, so you can get an idea if something is paging, how much I/O it is using, etc. I'm sure there are other better tools, but this one is pretty easy to get to, and shows quite a bit. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Planner incorrectly choosing seq scan over index scan
Meetesh Karia wrote: Hi all, We're using 8.0.3 and we're seeing a problem where the planner is choosing a seq scan and hash join over an index scan. If I set enable_hashjoin to off, then I get the plan I'm expecting and the query runs a lot faster. I've also tried lowering the random page cost (even to 1) but the planner still chooses to use the hash join. Does anyone have any thoughts/suggestions? I saw that there was a thread recently in which the planner wasn't correctly estimating the cost for queries using LIMIT. Is it possible that something similar is happening here (perhaps because of the sort) and that the patch Tom proposed would fix it? Thanks. Here are the various queries and plans: Normal settings ... QUERY PLAN Sort (cost=13430.57..13439.24 rows=3467 width=48) (actual time=1390.000..1390.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Merge Join (cost=9912.07..13226.72 rows=3467 width=48) (actual time=1344.000..1375.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..95.000 rows=50034 loops=1) This is where the planner is messing up, and mis-estimating the selectivity. It is expecting to get 280k rows, but only needs to get 50k. I assume lte_user is the bigger table, and that candidates617004 has some subset. Has lte_user and candidates617004 been recently ANALYZEd? All estimates, except for the expected number of rows from lte_user seem to be okay. Is user_id the primary key for lte_user? I'm trying to figure out how you can get 50k rows, by searching a primary key, against a 3.5k rows. Is user_id only part of the primary key for lte_user? Can you give us the output of: \d lte_user \d candidates617004 So that we have the description of the tables, and what indexes you have defined? Also, if you could describe the table layouts, that would help. John =:- - Sort (cost=9912.07..9920.73 rows=3467 width=40) (actual time=1156.000..1156.000 rows=3467 loops=1) Sort Key: c.sourceid - Hash Join (cost=8710.44..9708.21 rows=3467 width=40) (actual time=1125.000..1156.000 rows=3467 loops=1) Hash Cond: (outer.targetid = inner.user_id) - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) - Hash (cost=8011.95..8011.95 rows=279395 width=16) (actual time=1125.000..1125.000 rows=0 loops=1) - Seq Scan on lte_user t (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000 rows=279395 loops=1) Total runtime: 1406.000 ms enable_hashjoin disabled QUERY PLAN Sort (cost=14355.37..14364.03 rows=3467 width=48) (actual time=391.000..391.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Nested Loop (cost=271.52..14151.51 rows=3467 width=48) (actual time=203.000..359.000 rows=3467 loops=1) - Merge Join (cost=271.52..3490.83 rows=3467 width=40) (actual time=203.000..218.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000 rows=50034 loops=1) - Sort (cost=271.52..280.19 rows=3467 width=32) (actual time=15.000..30.000 rows=3467 loops=1) Sort Key: c.sourceid - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000 rows=3467 loops=1) - Index Scan using lte_user_pkey on lte_user t (cost=0.00..3.03 rows=1 width=16) (actual time=0.031..0.036 rows=1 loops=3467) Index Cond: (outer.targetid = t.user_id) Total runtime: 406.000 ms random_page_cost set to 1.5 -- QUERY PLAN Sort (cost=12702.62..12711.29 rows=3467 width=48) (actual time=1407.000..1407.000 rows=3467 loops=1) Sort Key: c.sourceid, c.targetid - Merge Join (cost=9912.07..12498.77 rows=3467 width=48) (actual time=1391.000..1407.000 rows=3467 loops=1) Merge Cond: (outer.user_id = inner.sourceid) - Index Scan using lte_user_pkey on lte_user s (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..46.000 rows=50034 loops=1) - Sort (cost=9912.07..9920.73 rows=3467 width=40) (actual time=1188.000..1188.000 rows=3467 loops=1) Sort Key: c.sourceid - Hash Join (cost=8710.44..9708.21 rows=3467 width=40) (actual time=1157.000..1188.000 rows=3467 loops=1) Hash Cond: (outer.targetid = inner.user_id) - Seq Scan on candidates617004 c (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..15.000 rows=3467 loops=1)
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
John Arbash Meinel wrote: Matthew Schumacher wrote: All it's doing is trying the update before the insert to get around the problem of not knowing which is needed. With only 2-3 of the queries implemented I'm already back to running about the same speed as the original SA proc that is going to ship with SA 3.1.0. All of the queries are using indexes so at this point I'm pretty convinced that the biggest problem is the sheer number of queries required to run this proc 200 times for each email (once for each token). I don't see anything that could be done to make this much faster on the postgres end, it's looking like the solution is going to involve cutting down the number of queries some how. One thing that is still very puzzling to me is why this runs so much slower when I put the data.sql in a transaction. Obviously transactions are acting different when you call a proc a zillion times vs an insert query. Well, I played with adding a COMMIT;BEGIN; statement to your exact test every 1000 lines. And this is what I got: Just for reference, I also tested this on my old server, which is a dual Celeron 450 with 256M ram. FC4 and Postgres 8.0.3 Unmodified: real54m15.557s user0m24.328s sys 0m14.200s With Transactions every 1000 selects, and vacuum every 5000: real8m36.528s user0m16.585s sys 0m12.569s With Transactions every 1000 selects, and vacuum every 1: real7m50.748s user0m16.183s sys 0m12.489s On this machine vacuum is more expensive, since it doesn't have as much ram. Anyway, on this machine, I see approx 7x improvement. Which I think is probably going to satisfy your spamassassin needs. John =:- PS Looking forward to having a spamassassin that can utilize my favorite db. Right now, I'm not using a db backend because it wasn't worth setting up mysql. Unmodified: real17m53.587s user0m6.204s sys 0m3.556s With BEGIN/COMMIT: real1m53.466s user0m5.203s sys 0m3.211s So I see the potential for improvement almost 10 fold by switching to transactions. I played with the perl script (and re-implemented it in python), and for the same data as the perl script, using COPY instead of INSERT INTO means 5s instead of 33s. I also played around with adding VACUUM ANALYZE every 10 COMMITS, which brings the speed to: real1m41.258s user0m5.394s sys 0m3.212s And doing VACUUM ANALYZE every 5 COMMITS makes it: real1m46.403s user0m5.597s sys 0m3.244s I'm assuming the slowdown is because of the extra time spent vacuuming. Overall performance might still be improving, since you wouldn't actually be inserting all 100k rows at once. ... This is all run on Ubuntu, with postgres 7.4.7, and a completely unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with 3GB of RAM). John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Matthew Schumacher wrote: All it's doing is trying the update before the insert to get around the problem of not knowing which is needed. With only 2-3 of the queries implemented I'm already back to running about the same speed as the original SA proc that is going to ship with SA 3.1.0. All of the queries are using indexes so at this point I'm pretty convinced that the biggest problem is the sheer number of queries required to run this proc 200 times for each email (once for each token). I don't see anything that could be done to make this much faster on the postgres end, it's looking like the solution is going to involve cutting down the number of queries some how. One thing that is still very puzzling to me is why this runs so much slower when I put the data.sql in a transaction. Obviously transactions are acting different when you call a proc a zillion times vs an insert query. Well, I played with adding a COMMIT;BEGIN; statement to your exact test every 1000 lines. And this is what I got: Unmodified: real17m53.587s user0m6.204s sys 0m3.556s With BEGIN/COMMIT: real1m53.466s user0m5.203s sys 0m3.211s So I see the potential for improvement almost 10 fold by switching to transactions. I played with the perl script (and re-implemented it in python), and for the same data as the perl script, using COPY instead of INSERT INTO means 5s instead of 33s. I also played around with adding VACUUM ANALYZE every 10 COMMITS, which brings the speed to: real1m41.258s user0m5.394s sys 0m3.212s And doing VACUUM ANALYZE every 5 COMMITS makes it: real1m46.403s user0m5.597s sys 0m3.244s I'm assuming the slowdown is because of the extra time spent vacuuming. Overall performance might still be improving, since you wouldn't actually be inserting all 100k rows at once. Just to complete the reference, the perl version runs as: 10:44:02 -- START 10:44:35 -- AFTER TEMP LOAD : loaded 120596 records 10:44:39 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token 10:44:41 -- AFTER bayes_vars UPDATE : updated 1 records 10:46:42 -- AFTER bayes_token UPDATE : updated 47537 records DONE My python version runs as: 00:22:54 -- START 00:23:00 -- AFTER TEMP LOAD : loaded 120596 records 00:23:03 -- AFTER bayes_token INSERT : inserted 49359 new records into bayes_token 00:23:06 -- AFTER bayes_vars UPDATE : updated 1 records 00:25:04 -- AFTER bayes_token UPDATE : updated 47537 records DONE The python is effectively just a port of the perl code (with many lines virtually unchanged), and really the only performance difference is that the initial data load is much faster with a COPY. This is all run on Ubuntu, with postgres 7.4.7, and a completely unchanged postgresql.conf. (But the machine is a dual P4 2.4GHz, with 3GB of RAM). John =:- Anyway, if anyone else has any ideas I'm all ears, but at this point it's looking like raw query speed is needed for this app and while I don't care for mysql as a database, it does have the speed going for it. schu ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Matthew Schumacher wrote: Tom Lane wrote: I looked into this a bit. It seems that the problem when you wrap the entire insertion series into one transaction is associated with the fact that the test does so many successive updates of the single row in bayes_vars. (VACUUM VERBOSE at the end of the test shows it cleaning up 49383 dead versions of the one row.) This is bad enough when it's in separate transactions, but when it's in one transaction, none of those dead row versions can be marked fully dead yet --- so for every update of the row, the unique-key check has to visit every dead version to make sure it's dead in the context of the current transaction. This makes the process O(N^2) in the number of updates per transaction. Which is bad enough if you just want to do one transaction per message, but it's intolerable if you try to wrap the whole bulk-load scenario into one transaction. I'm not sure that we can do anything to make this a lot smarter, but in any case, the real problem is to not do quite so many updates of bayes_vars. How constrained are you as to the format of the SQL generated by SpamAssassin? In particular, could you convert the commands generated for a single message into a single statement? I experimented with passing all the tokens for a given message as a single bytea array, as in the attached, and got almost a factor of 4 runtime reduction on your test case. BTW, it's possible that this is all just a startup-transient problem: once the database has been reasonably well populated, one would expect new tokens to be added infrequently, and so the number of updates to bayes_vars ought to drop off. regards, tom lane The spamassassins bayes code calls the _put_token method in the storage module a loop. This means that the storage module isn't called once per message, but once per token. Well, putting everything into a transaction per email might make your pain go away. If you saw the email I just sent, I modified your data.sql file to add a COMMIT;BEGIN every 1000 selects, and I saw a performance jump from 18 minutes down to less than 2 minutes. Heck, on my machine, the advanced perl version takes more than 2 minutes to run. It is actually slower than the data.sql with commit statements. I'll look into modifying it to so that the bayes code passes a hash of tokens to the storage module where they can loop or in the case of the pgsql module pass an array of tokens to a procedure where we loop and use temp tables to make this much more efficient. Well, you could do that. Or you could just have the bayes code issue BEGIN; when it starts processing an email, and a COMMIT; when it finishes. From my testing, you will see an enormous speed improvement. (And you might consider including a fairly frequent VACUUM ANALYZE) I don't have much time this weekend to toss at this, but will be looking at it on Monday. Good luck, John =:- Thanks, schu ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0
Josh Berkus wrote: Dennis, EXCEPTION WHEN unique_violation THEN I seem to remember that catching an exception in a PL/pgSQL procedure was a large performance cost. It'd be better to do UPDATE ... IF NOT FOUND. Actually, he was doing an implicit UPDATE IF NOT FOUND in that he was doing: UPDATE IF found THEN return; INSERT EXCEPT ... So really, the exception should never be triggered. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Two queries are better than one?
Karim Nassar wrote: I ran into a situation today maintaining someone else's code where the sum time running 2 queries seems to be faster than 1. The original code was split into two queries. I thought about joining them, but considering the intelligence of my predecessor, I wanted to test it. The question is, which technique is really faster? Is there some hidden setup cost I don't see with explain analyze? Yes, the time it takes your user code to parse the result, and create the new query. :) It does seem like you are taking an extra 0.1ms for the combined query, but that means you don't have another round trip to the database. So that would mean one less context switch, and you don't need to know what the cli_code is before you can get the cli_name. I would guess the overhead is the time for postgres to parse out the text, place another index query, and then combine the rows. It seems like this shouldn't take 0.1ms, but then again, that isn't very long. Also, did you run it *lots* of times to make sure that this isn't just noise? John =:- Postgres 7.4.7, Redhat AES 3 Each query individually: test= explain analyze test- select * from order WHERE ord_batch='343B' AND ord_id='12-645'; QUERY PLAN Index Scan using order_pkey on order (cost=0.00..6.02 rows=1 width=486) (actual time=0.063..0.066 rows=1 loops=1) Index Cond: ((ord_batch = '343B'::bpchar) AND (ord_id = '12-645'::bpchar)) Total runtime: 0.172 ms (3 rows) test= explain analyze test- select cli_name from client where cli_code='1837'; QUERY PLAN - Index Scan using client_pkey on client (cost=0.00..5.98 rows=2 width=39) (actual time=0.043..0.047 rows=1 loops=1) Index Cond: (cli_code = '1837'::bpchar) Total runtime: 0.112 ms (3 rows) Joined: test= explain analyze test-SELECT cli_name,order.* test- FROM order test- JOIN client ON (ord_client = cli_code) test- WHERE ord_batch='343B' AND ord_id='12-645'; QUERY PLAN -- Nested Loop (cost=0.00..12.00 rows=2 width=525) (actual time=0.120..0.128 rows=1 loops=1) - Index Scan using order_pkey on order (cost=0.00..6.02 rows=1 width=486) (actual time=0.064..0.066 rows=1 loops=1) Index Cond: ((ord_batch = '343B'::bpchar) AND (ord_id = '12-645'::bpchar)) - Index Scan using client_pkey on client (cost=0.00..5.98 rows=1 width=51) (actual time=0.023..0.026 rows=1 loops=1) Index Cond: (outer.ord_client = client.cli_code) Total runtime: 0.328 ms (6 rows) signature.asc Description: OpenPGP digital signature
[PERFORM] Cheap RAM disk?
I saw a review of a relatively inexpensive RAM disk over at anandtech.com, the Gigabyte i-RAM http://www.anandtech.com/storage/showdoc.aspx?i=2480 Basically, it is a PCI card, which takes standard DDR RAM, and has a SATA port on it, so that to the system, it looks like a normal SATA drive. The card costs about $100-150, and you fill it with your own ram, so for a 4GB (max size) disk, it costs around $500. Looking for solid state storage devices, the cheapest I found was around $5k for 2GB. Gigabyte claims that the battery backup can last up to 16h, which seems decent, if not really long (the $5k solution has a built-in harddrive so that if the power goes out, it uses the battery power to copy the ramdisk onto the harddrive for more permanent storage). Anyway, would something like this be reasonable as a drive for storing pg_xlog? With 4GB you could have as many as 256 checkpoint segments. I'm a little leary as it is definitely a version 1.0 product (it is still using an FPGA as the controller, so they were obviously pushing to get the card into production). But it seems like this might be a decent way to improve insert performance, without setting fsync=false. Probably it should see some serious testing (as in power spikes/pulled plugs, etc). I know the article made some claim that if you actually pull out the card it goes into high consumption mode which is somehow greater than if you leave it in the slot with the power off. Which to me seems like a lot of bull, and really means the 16h is only under best-case circumstances. But even 1-2h is sufficient to handle a simple power outage. And if you had a UPS with detection of power failure, you could always sync the ramdisk to a local partition before the power goes out. Though you could do that with a normal in-memory ramdisk (tmpfs) without having to buy the card. Though it does give you up-to an extra 4GB of ram, for machines which have already maxed out their slots. Anyway, I thought I would mention it to the list, to see if anyone else has heard of it, or has any thoughts on the matter. I'm sure there are some people who are using more expensive ram disks, maybe they have some ideas about what this device is missing. (other than costing about 1/10th the price) John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] faster INSERT with possible pre-existing row?
Dan Harris wrote: I am working on a process that will be inserting tens of million rows and need this to be as quick as possible. The catch is that for each row I could potentially insert, I need to look and see if the relationship is already there to prevent multiple entries. Currently I am doing a SELECT before doing the INSERT, but I recognize the speed penalty in doing to operations. I wonder if there is some way I can say insert this record, only if it doesn't exist already. To see if it exists, I would need to compare 3 fields instead of just enforcing a primary key. Even if this could be a small increase per record, even a few percent faster compounded over the whole load could be a significant reduction. Thanks for any ideas you might have. -Dan You could insert all of your data into a temporary table, and then do: INSERT INTO final_table SELECT * FROM temp_table WHERE NOT EXISTS (SELECT info FROM final_table WHERE id=id, path=path, y=y); Or you could load it into the temporary table, and then: DELETE FROM temp_table WHERE EXISTS (SELECT FROM final_table WHERE id...); And then do a plain INSERT INTO. I can't say what the specific performance increases would be, but temp_table could certainly be an actual TEMP table (meaning it only exists during the connection), and you could easily do a COPY into that table to load it up quickly, without having to check any constraints. Just a thought, John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Cheap RAM disk?
Luke Lonergan wrote: Yup - interesting and very niche product - it seems like it's only obvious application is for the Postgresql WAL problem :-) Well, you could do it for any journaled system (XFS, JFS, ext3, reiserfs). But yes, it seems specifically designed for a battery backed journal. Though the article reviews it for very different purposes. Though it was a Windows review, and I don't know of any way to make NTFS use a separate device for a journal. (Though I expect it is possible somehow). John =:- The real differentiator is the battery backup part. Otherwise, the filesystem caching is more effective, so put the RAM on the motherboard. - Luke signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Cheap RAM disk?
Alex Turner wrote: Also seems pretty silly to put it on a regular SATA connection, when all that can manage is 150MB/sec. If you made it connection directly to 66/64-bit PCI then it could actualy _use_ the speed of the RAM, not to mention PCI-X. Alex Turner NetEconomist Well, the whole point is to have it look like a normal SATA drive, even to the point that you can boot off of it, without having to load a single driver. Now, you could offer that you could recreate a SATA controller on the card, with a SATA bios, etc. And then you could get the increased speed, and still have bootable functionality. But it is a version 1.0 of a product, and I'm sure they tried to make it as cheap as possible (and within their own capabilities.) John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] faster INSERT with possible pre-existing row?
Matthew Nuzum wrote: On 7/26/05, Dan Harris [EMAIL PROTECTED] wrote: I am working on a process that will be inserting tens of million rows and need this to be as quick as possible. The catch is that for each row I could potentially insert, I need to look and see if the relationship is already there to prevent multiple entries. Currently I am doing a SELECT before doing the INSERT, but I recognize the speed penalty in doing to operations. I wonder if there is some way I can say insert this record, only if it doesn't exist already. To see if it exists, I would need to compare 3 fields instead of just enforcing a primary key. I struggled with this for a while. At first I tried stored procedures and triggers, but it took very long (over 24 hours for my dataset). After several iterations of rewritting it, first into C# then into Python I got the whole process down to under 30 min. My scenario is this: I want to normalize log data. For example, for the IP address in a log entry, I need to look up the unique id of the IP address, or if the IP address is new, insert it and then return the newly created entry. Multiple processes use the data, but only one process, run daily, actually changes it. Because this one process knows that the data is static, it selects the tables into in-memory hash tables (C#) or Dictionaries (Python) and then does the lookups there. It is *super* fast, but it uses a *lot* of ram. ;-) To limit the ram, I wrote a version of the python code that uses gdbm files instead of Dictionaries. This requires a newer version of Python (to allow a gdbm db to work just like a dictionary) but makes life easier in case someone is using my software on a lower end machine. This doubled the time of the lookups from about 15 minutes to 30, bringing the whole process to about 45 minutes. Did you ever try the temp table approach? You could: COPY all records into temp_table, with an empty row for ip_id -- Get any entries which already exist UPDATE temp_table SET ip_id = (SELECT ip_id from ipaddress WHERE add=add) WHERE EXISTS (SELECT ip_id FROM ipaddress WHERE add=add); -- Create new entries INSERT INTO ipaddress(add) SELECT add FROM temp_table WHERE ip_id IS NULL; -- Update the rest UPDATE temp_table SET ip_id = (SELECT ip_id from ipaddress WHERE add=add) WHERE ip_id IS NULL AND EXISTS (SELECT ip_id FROM ipaddress WHERE add=add); This would let the database do all of the updating work in bulk on it's side, rather than you pulling all the data out and doing it locally. An alternative would be something like: CREATE TEMP TABLE new_ids (address text, ip_id int); COPY all potentially new addresses into that table. -- Delete all entries which already exist DELETE FROM new_ids WHERE EXISTS (SELECT ip_id FROM ipaddresses WHERE add=new_ids.address); -- Now create the new entries INSERT INTO ipaddresses(add) SELECT address FROM new_ids; -- At this point you are guaranteed to have all addresses existing in -- the database If you then insert your full data into the final table, only leave the ip_id column as null. Then if you have a partial index where ip_id is NULL, you could use the command: UPDATE final_table SET ip_id = (SELECT ip_id FROM ipaddresses WHERE add=final_table.add) WHERE ip_id IS NULL; You could also do this in a temporary table, before bulk inserting into the final table. I don't know what you have tried, but I know that for Dan, he easily has 36M rows. So I don't think he wants to pull that locally and create a in-memory hash just to insert 100 rows or so. Also, for your situation, if you do keep a local cache, you could certainly save the cache between runs, and use a temp table to determine what new ids you need to add to it. Then you wouldn't have to pull the complete set each time. You just pull new values for entries you haven't added yet. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] COPY insert performance
Chris Isaacson wrote: I need COPY via libpqxx to insert millions of rows into two tables. One table has roughly have as many rows and requires half the storage. In production, the largest table will grow by ~30M rows/day. To test the COPY performance I split my transactions into 10,000 rows. I insert roughly 5000 rows into table A for every 10,000 rows into table B. Table A has one unique index: order_main_pk UNIQUE, btree (cl_ord_id) Table B has 1 unique index and 2 non-unique indexes: order_transition_pk UNIQUE, btree (collating_seq) order_transition_ak2 btree (orig_cl_ord_id) order_transition_ak3 btree (exec_id) Do you have any foreign key references? If you are creating a table for the first time (or loading a large fraction of the data), it is common to drop the indexes and foreign keys first, and then insert/copy, and then drop them again. Is InnoDB the backend with referential integrity, and true transaction support? I believe the default backend does not support either (so it is cheating to give you speed, which may be just fine for your needs, especially since you are willing to run fsync=false). I think moving pg_xlog to a dedicated drive (set of drives) could help your performance. As well as increasing checkpoint_segments. I don't know if you gain much by changing the bg_writer settings, if you are streaming everything in at once, you probably want to have it written out right away. My understanding is that bg_writer settings are for the case where you have mixed read and writes going on at the same time, and you want to make sure that the reads have time to execute (ie the writes are not saturating your IO). Also, is any of this tested under load? Having a separate process issue queries while you are loading in data. Traditionally MySQL is faster with a single process inserting/querying for data, but once you have multiple processes hitting it at the same time, it's performance degrades much faster than postgres. You also seem to be giving MySQL 512M of ram to work with, while only giving 2M/200M to postgres. (re)creating indexes uses maintenance_work_mem, but updating indexes could easily use work_mem. You may be RAM starved. John =:- My testing environment is as follows: -Postgresql 8.0.1 -libpqxx 2.5.0 -Linux 2.6.11.4-21.7-smp x86_64 -Dual Opteron 246 -System disk (postgres data resides on this SCSI disk) - Seagate (ST373453LC) - 15K, 73 GB (http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,549,00.html) -2nd logical disk - 10K, 36GB IBM SCSI (IC35L036UCDY10-0) - WAL reside on this disk -NO RAID *PostgreSQL* Here are the results of copying in 10M rows as fast as possible: (10K/transaction) Total Time:1129.556 s Rows/sec: 9899.922 Transaction1.2s225 Transaction1.5s 77 Transaction2.0s 4 Max Transaction 2.325s **MySQL** **I ran a similar test with MySQL 4.1.10a (InnoDB) which produced these results: (I used MySQL's INSERT INTO x VALUES (1,2,3)(4,5,6)(...,...,...) syntax) (10K/transaction) Total Time: 860.000 s Rows/sec:11627.91 Transaction1.2s 0 Transaction1.5s 0 Transaction2.0s 0 Max Transaction 1.175s Considering the configurations shown below, can anyone offer advice to close the 15% gap and the much worse variability I'm experiencing. Thanks My *postgresql.conf* has the following non-default values: # - # PostgreSQL configuration file # - listen_addresses = '*' # what IP interface(s) to listen on; max_connections = 100 #--- # RESOURCE USAGE (except WAL) #--- shared_buffers = 65536 # min 16, at least max_connections*2, 8KB each work_mem = 2048 # min 64, size in KB maintenance_work_mem = 204800 # min 1024, size in KB max_fsm_pages = 225 # min max_fsm_relations*16, 6 bytes each bgwriter_delay = 200 # 10-1 milliseconds between rounds bgwriter_percent = 10 # 0-100% of dirty buffers in each round bgwriter_maxpages = 1000 # 0-1000 buffers max per round #--- # WRITE AHEAD LOG #--- fsync = false # turns forced synchronization on or off wal_buffers = 64 # min 4, 8KB each checkpoint_segments = 40 # in logfile segments, min 1, 16MB each checkpoint_timeout = 600 # range 30-3600, in seconds #--- # QUERY TUNING #--- effective_cache_size = 65536 # typically 8KB each random_page_cost = 2 # units are one sequential page fetch cost #--- # ERROR
Re: [PERFORM] Vacuum Full Analyze taking so long
Tomeh, Husam wrote: Nothing was running except the job. The server did not look stressed out looking at top and vmstat. We have seen slower query performance when performing load tests, so I run the re-index on all application indexes and then issue a full vacuum. I ran the same thing on a staging server and it took less than 12 hours. Is there a possibility the DB pages are corrupted. Is there a command to verify that. (In Oracle, there's a dbverify command that checks for corruption on the data files level). The other question I have. What would be the proper approach to rebuild indexes. I re-indexes and then run vacuum/analyze. Should I not use the re-index approach, and instead, drop the indexes, vacuum the tables, and then create the indexes, then run analyze on tables and indexes?? I *think* if you are planning on dropping the indexes anyway, just drop them, VACUUM ANALYZE, and then recreate them, I don't think you have to re-analyze after you have recreated them. John =:- Thanks, signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Optimizer seems to be way off, why?
Dirk Lutzebäck wrote: Richard Huxton wrote: Dirk Lutzebäck wrote: Hi, I do not under stand the following explain output (pgsql 8.0.3): explain analyze select b.e from b, d where b.r=516081780 and b.c=513652057 and b.e=d.e; QUERY PLAN Nested Loop (cost=0.00..1220.09 rows=1 width=4) (actual time=0.213..2926.845 rows=324503 loops=1) - Index Scan using b_index on b (cost=0.00..1199.12 rows=1 width=4) (actual time=0.104..17.418 rows=3293 loops=1) Index Cond: (r = 516081780::oid) Filter: (c = 513652057::oid) - Index Scan using d_e_index on d (cost=0.00..19.22 rows=140 width=4) (actual time=0.009..0.380 rows=99 loops=3293) Index Cond: (outer.e = d.e) Total runtime: 3638.783 ms (7 rows) Why is the rows estimate for b_index and the nested loop 1? It is actually 3293 and 324503. I'm guessing (and that's all it is) that b.r and b.c have a higher correlation than the planner is expecting. That is, it expects the b.c=... to reduce the number of matching rows much more than it is. Try a query just on WHERE b.r=516081780 and see if it gets the estimate right for that. If it's a common query, it might be worth an index on (r,c) -- Richard Huxton Archonet Ltd Thanks Richard, dropping the join for b.c now gives better estimates (it also uses a different index now) although not accurate (off by factor 10). This query is embedded in a larger query which now got a 1000 times speed up (!) because I can drop b.c because it is redundant. Well, part of the problem is that the poorly estimated row is not 'b.e' but 'b.r', it expects to only find one row that matches, and instead finds 3293 rows. Now, that *could* be because it mis-estimates the selectivity of b.r b.c. It actually estimated the join with d approximately correctly. (It thought that for each row it would find 140, and it averaged 99). Though, why can't the planner see this correlation? I think somebody said the planner does not know about multiple column correlations, does it? The planner does not maintain cross-column statistics, so you are correct. I believe it assumes distributions are independent. So that if r=R is 10% selective, and c= is 20% selective, the total selectivity of r= AND c= is 2%. I could be wrong on this, but I think it is approximately correct. Now if you created the index on b(r,c), then it would have a much better idea of how selective that would be. At the very least, it could index on (r,c) rather than indexing on (r) and filtering by (c). Also, if you have very skewed data (where you have 1 value 100k times, and 50 values only 10times each), the planner can overestimate the low values, and underestimate the high one. (It uses random sampling, so it kind of depends where the entries are.) Have you tried increasing the statistics on b.r and or b.c? Do you have an index on b.c or just b.r? To see what the planner thinks, you might try: EXPLAIN ANALYZE select count(*) from b where r=516081780; That would tell you how selective the planner thinks the r= is. Regards, Dirk John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Looking for tips
Oliver Crosby wrote: Hi, I'm running Postgres 7.4.6 on a dedicated server with about 1.5gigs of ram. Running scripts locally, it takes about 1.5x longer than mysql, and the load on the server is only about 21%. I upped the sort_mem to 8192 (kB), and shared_buffers and effective_cache_size to 65536 (512MB), but neither the timing nor the server load have changed at all. FYI, I'm going to be working on data sets in the order of GB. I think I've gone about as far as I can with google.. can anybody give me some advice on how to improve the raw performance before I start looking at code changes? Thanks in advance. First, try to post in plain-text rather than html, it is easier to read. :) Second, if you can determine what queries are running slow, post the result of EXPLAIN ANALYZE on them, and we can try to help you tune them/postgres to better effect. Just a blanket question like this is hard to answer. Your new shared_buffers are probably *way* too high. They should be at most around 10% of ram. Since this is a dedicated server effective_cache_size should be probably ~75% of ram, or close to 1.2GB. There are quite a few things that you can tweak, so the more information you can give, the more we can help. For instance, if you are loading a lot of data into a table, if possible, you want to use COPY not INSERT. If you have a lot of indexes and are loading a significant portion, it is sometimes faster to drop the indexes, COPY the data in, and then rebuild the indexes. For tables with a lot of inserts/updates, you need to watch out for foreign key constraints. (Generally, you will want an index on both sides of the foreign key. One is required, the other is recommended for faster update/deletes). John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] slow joining very large table to smaller ones
Dan Harris wrote: On Jul 14, 2005, at 10:12 PM, John A Meinel wrote: My biggest question is why the planner things the Nested Loop would be so expensive. Have you tuned any of the parameters? It seems like something is out of whack. (cpu_tuple_cost, random_page_cost, etc...) here's some of my postgresql.conf. Feel free to blast me if I did something idiotic here. shared_buffers = 5 effective_cache_size = 1348000 random_page_cost = 3 work_mem = 512000 Unless you are the only person connecting to this database, your work_mem is very high. And if you haven't modified maintenance_work_mem it is probably very low. work_mem might be causing postgres to think it can fit all of a merge into ram, making it faster, I can't say for sure. max_fsm_pages = 8 This seems high, but it depends how many updates/deletes you get in-between vacuums. It may not be too excessive. VACUUM [FULL] VERBOSE replies with how many free pages are left, if you didn't use that already for tuning. Though it should be tuned based on a steady state situation. Not a one time test. log_min_duration_statement = 6 fsync = true ( not sure if I'm daring enough to run without this ) wal_buffers = 1000 checkpoint_segments = 64 checkpoint_timeout = 3000 These seem fine to me. Can you include the output of EXPLAIN SELECT both with and without SET join_collapselimit? Since your tables have grown, I can't compare the estimated number of rows, and costs very well. EXPLAIN without ANALYZE is fine, since I am wondering what the planner is thinking things cost. John =:- # FOR PG_AUTOVACUUM --# stats_command_string = true stats_row_level = true signature.asc Description: OpenPGP digital signature
Re: [PERFORM] slow joining very large table to smaller ones
Dan Harris wrote: I'm trying to improve the speed of this query: explain select recordtext from eventactivity inner join ( select incidentid from k_r where id = 94 ) a using ( incidentid ) inner join ( select incidentid from k_b where id = 107 ) b using ( incidentid ); You might try giving it a little bit more freedom with: EXPLAIN ANALYZE SELECT recordtext FROM eventactivity, k_r, k_b WHERE eventactivity.incidentid = k_r.incidentid AND eventactivity.incidentid = k_b.incidentid AND k_r.id = 94 AND k_b.id = 107 -- AND k_r.incidentid = k_b.incidentid ; I'm pretty sure that would give identical results, just let the planner have a little bit more freedom about how it does it. Also the last line is commented out, because I think it is redundant. You might also try: EXPLAIN ANALYZE SELECT recordtext FROM eventactivity JOIN k_r USING (incidentid) JOIN k_b USING (incidentid) WHERE k_r.id = 94 AND k_b.id = 107 ; Also, if possible give us the EXPLAIN ANALYZE so that we know if the planner is making accurate estimates. (You might send an EXPLAIN while waiting for the EXPLAIN ANALYZE to finish) You can also try disabling merge joins, and see how that changes things. QUERY PLAN -- Merge Join (cost=2747.29..4249364.96 rows=11968693 width=35) Merge Cond: ((outer.incidentid)::text = inner.?column2?) - Merge Join (cost=1349.56..4230052.73 rows=4413563 width=117) Merge Cond: ((outer.incidentid)::text = inner.?column2?) - Index Scan using eventactivity1 on eventactivity (cost=0.00..4051200.28 rows=44519781 width=49) - Sort (cost=1349.56..1350.85 rows=517 width=68) Sort Key: (k_b.incidentid)::text - Index Scan using k_b_idx on k_b (cost=0.00..1326.26 rows=517 width=68) Index Cond: (id = 107) - Sort (cost=1397.73..1399.09 rows=542 width=68) Sort Key: (k_r.incidentid)::text - Index Scan using k_r_idx on k_r (cost=0.00..1373.12 rows=542 width=68) Index Cond: (id = 94) (13 rows) There are many millions of rows in eventactivity. There are a few ten-thousand rows in k_r and k_b. There is an index on 'incidentid' in all three tables. There should only be less than 100 rows matched in k_r and k_b total. That part on its own is very very fast. But, it should have those 100 or so incidentids extracted in under a second and then go into eventactivity AFTER doing that. At least, that's my intention to make this fast. Well, postgres is estimating around 500 rows each, is that way off? Try just doing: EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107; EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94; And see if postgres estimates the number of rows properly. I assume you have recently VACUUM ANALYZEd, which means you might need to update the statistics target (ALTER TABLE k_b ALTER COLUMN incidientid SET STATISTICS 100) default is IIRC 10, ranges from 1-1000, higher is more accurate, but makes ANALYZE slower. Right now, it looks like pg is trying to sort the entire eventactivity table for the merge join which is taking several minutes to do. Can I rephrase this so that it does the searching through k_r and k_b FIRST and then go into eventactivity using the index on incidentid? It seems like that shouldn't be too hard to make fast but my SQL query skills are only average. To me, it looks like it is doing an index scan (on k_b.id) through k_b first, sorting the results by incidentid, then merge joining that with eventactivity. I'm guessing you actually want it to merge k_b and k_r to get extra selectivity before joining against eventactivity. I think my alternate forms would let postgres realize this. But if not, you could try: EXPLAIN ANALYZE SELECT recordtext FROM eventactivity JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid) WHERE k_r.id = 94 AND k_b.id = 107) USING (incidentid); I don't know how selective your keys are, but one of these queries should probably structure it better for the planner. It depends a lot on how selective your query is. If you have 100M rows, the above query looks like it expects k_r to restrict it to 44M rows, and k_r + k_b down to 11M rows, which really should be a seq scan ( 10% of the rows = seq scan). But if you are saying the selectivity is mis-estimated it could be different. John =:- Thanks -Dan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq signature.asc Description: OpenPGP digital signature
Re: [PERFORM] slow joining very large table to smaller ones
Dan Harris wrote: On Jul 14, 2005, at 9:42 AM, John A Meinel wrote: ... Did you try doing this to see how good the planners selectivity estimates are? Well, postgres is estimating around 500 rows each, is that way off? Try just doing: EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107; EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94; These should be fast queries. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] slow joining very large table to smaller ones
Dan Harris wrote: On Jul 14, 2005, at 9:42 AM, John A Meinel wrote: You might try giving it a little bit more freedom with: EXPLAIN ANALYZE SELECT recordtext FROM eventactivity, k_r, k_b WHERE eventactivity.incidentid = k_r.incidentid AND eventactivity.incidentid = k_b.incidentid AND k_r.id = 94 AND k_b.id = 107 -- AND k_r.incidentid = k_b.incidentid ; I'm pretty sure that would give identical results, just let the planner have a little bit more freedom about how it does it. Also the last line is commented out, because I think it is redundant. Ok, I tried this one. My ssh keeps getting cut off by a router somewhere between me and the server due to inactivity timeouts, so all I know is that both the select and explain analyze are taking over an hour to run. Here's the explain select for that one, since that's the best I can get. explain select recordtext from eventactivity,k_r,k_b where eventactivity.incidentid = k_r.incidentid and eventactivity.incidentid = k_b.incidentid and k_r.id = 94 and k_b.id = 107; QUERY PLAN -- Merge Join (cost=9624.61..4679590.52 rows=151009549 width=35) Merge Cond: ((outer.incidentid)::text = inner.?column2?) - Merge Join (cost=4766.92..4547684.26 rows=16072733 width=117) Merge Cond: ((outer.incidentid)::text = inner.?column2?) - Index Scan using eventactivity1 on eventactivity (cost=0.00..4186753.16 rows=46029271 width=49) - Sort (cost=4766.92..4771.47 rows=1821 width=68) Sort Key: (k_b.incidentid)::text - Index Scan using k_b_idx on k_b (cost=0.00..4668.31 rows=1821 width=68) Index Cond: (id = 107) - Sort (cost=4857.69..4862.39 rows=1879 width=68) Sort Key: (k_r.incidentid)::text - Index Scan using k_r_idx on k_r (cost=0.00..4755.52 rows=1879 width=68) Index Cond: (id = 94) (13 rows) If anything, the estimations have gotten worse. As now it thinks there will be 1800 rows returned each, whereas you were thinking it would be more around 100. Since you didn't say, you did VACUUM ANALYZE recently, right? ... You can also try disabling merge joins, and see how that changes things. Are there any negative sideaffects of doing this? If the planner is estimating things correctly, you want to give it the most flexibility of plans to pick from, because sometimes a merge join is faster (postgres doesn't pick things because it wants to go slower). The only reason for the disable flags is that sometimes the planner doesn't estimate correctly. Usually disabling a method is not the final solution, but a way to try out different methods, and see what happens to the results. Using: SET enable_mergejoin TO off; You can disable it just for the current session (not for the entire database). Which is the recommended way if you have a query that postgres is messing up on. (Usually it is correct elsewhere). Well, postgres is estimating around 500 rows each, is that way off? Try just doing: EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107; EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94; Once again, do this and post the results. We might just need to tweak your settings so that it estimates the number of rows correctly, and we don't need to do anything else. And see if postgres estimates the number of rows properly. I assume you have recently VACUUM ANALYZEd, which means you might need to update the statistics target (ALTER TABLE k_b ALTER COLUMN incidientid SET STATISTICS 100) default is IIRC 10, ranges from 1-1000, higher is more accurate, but makes ANALYZE slower. ... EXPLAIN ANALYZE SELECT recordtext FROM eventactivity JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid) WHERE k_r.id = 94 AND k_b.id = 107) USING (incidentid); This one looks like the same plan as the others: explain select recordtext from eventactivity join ( select incidentid from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id = 107 ) a using (incidentid ); Well, the planner is powerful enough to flatten nested selects. To make it less intelligent you can do: SET join_collapse_limit 1; or SET join_collapse_limit 0; Which should tell postgres to not try and get tricky with your query. Again, *usually* the planner knows better than you do. So again just do it to see what you get. The problem is that if you are only using EXPLAIN SELECT, you will probably get something which *looks* worse. Because if it looked better, the planner would have used it. That is why you really need the EXPLAIN ANALYZE, so that you can see where the planner is incorrect in it's estimates. QUERY PLAN
Re: [PERFORM] slow joining very large table to smaller ones
Tom Lane wrote: John A Meinel [EMAIL PROTECTED] writes: What I don't understand is that the planner is actually estimating that joining against the new table is going to *increase* the number of returned rows. It evidently thinks that incidentid in the k_r table is pretty nonunique. We really need to look at the statistics data to see what's going on. regards, tom lane Okay, sure. What about doing this, then: EXPLAIN ANALYZE SELECT recordtext FROM eventactivity JOIN (SELECT DISTINCT incidentid FROM k_r JOIN k_b USING (incidentid) WHERE k_r.id = ?? AND k_b.id = ??) USING (incidentid) ; Since I assume that eventactivity is the only table with recordtext, and that you don't get any columns from k_r and k_b, meaning it would be pointless to get duplicate incidentids. I may be misunderstanding what the query is trying to do, but depending on what is in k_r and k_b, is it possible to use a UNIQUE INDEX rather than just an index on incidentid? There is also the possibility of EXPLAIN ANALYZE SELECT recordtext FROM eventactivtity JOIN (SELECT incidentid FROM k_r WHERE k_r.id = ?? UNION SELECT incidentid FROM k_b WHERE k_b.id = ??) USING (incidentid) ; But both of these would mean that you don't actually want columns from k_r or k_b, just a unique list of incident ids. But first, I agree, we should make sure the pg_stats values are reasonable. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] lots of updates on small table
Alison Winters wrote: Hi, Our application requires a number of processes to select and update rows from a very small (10 rows) Postgres table on a regular and frequent basis. These processes often run for weeks at a time, but over the space of a few days we find that updates start getting painfully slow. We are running a full vacuum/analyze and reindex on the table every day, Full vacuum, eh? I wonder if what you really need is very frequent non-full vacuum. Say, once in 15 minutes (exact rate depending on dead tuple rate.) Is there a difference between vacuum and vacuum full? Currently we have a cron job going every hour that does: VACUUM FULL VERBOSE ANALYZE plc_fldio REINDEX TABLE plc_fldio VACUUM FULL exclusively locks the table (so that nothing else can happen) and the compacts it as much as it can. You almost definitely want to only VACUUM every 15min, maybe VACUUM FULL 1/day. VACUUM FULL is more for when you haven't been VACUUMing often enough. Or have major changes to your table. Basically VACUUM marks rows as empty and available for reuse, VACUUM FULL removes empty space (but requires a full lock, because it is moving rows around). If anything, I would estimate that VACUUM FULL would be hurting your performance. But it may happen fast enough not to matter. The most recent output was this: INFO: --Relation public.plc_fldio-- INFO: Pages 1221: Changed 3, reaped 256, Empty 0, New 0; Tup 108137: Vac 4176, Keep/VTL 108133/108133, UnUsed 19, MinLen 84, MaxLen 84; Re-using: Free/Avail. Space 445176/371836; EndEmpty/Avail. Pages 0/256. CPU 0.04s/0.14u sec elapsed 0.18 sec. INFO: Index plcpage_idx: Pages 315; Tuples 108137: Deleted 4176. CPU 0.03s/0.04u sec elapsed 0.14 sec. INFO: Rel plc_fldio: Pages: 1221 -- 1221; Tuple(s) moved: 0. CPU 0.03s/0.04u sec elapsed 0.36 sec. INFO: Analyzing public.plc_fldio VACUUM REINDEX We'll up it to every 15 minutes, but i don't know if that'll help because even with the current vacuuming the updates are still getting slower and slower over the course of several days. What really puzzles me is why restarting the processes fixes it. Does PostgreSQL keep some kind of backlog of transactions all for one database connection? Isn't it normal to have processes that keep a single database connection open for days at a time? I believe that certain locks are grabbed per session. Or at least there is some command that you can run, which you don't want to run in a maintained connection. (It might be VACUUM FULL, I don't remember which one it is). But the fact that your application works at all seems to be that it isn't acquiring any locks. I know VACUUM cannot clean up any rows that are visible in one of the transactions, I don't know if this includes active connections or not. Regarding the question another poster asked: all the transactions are very short. The table is essentially a database replacement for a shared memory segment - it contains a few rows of byte values that are constantly updated byte-at-a-time to communicate data between different industrial control processes. Thanks for the thoughts everyone, Alison Is it possible to have some sort of timer that would recognize it has been connected for too long, drop the database connection, and reconnect? I don't know that it would solve anything, but it would be something you could try. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] slow joining very large table to smaller ones
Dan Harris wrote: On Jul 14, 2005, at 7:15 PM, John A Meinel wrote: Is the distribution of your rows uneven? Meaning do you have more rows with a later id than an earlier one? There are definitely some id's that will have many times more than the others. If I group and count them, the top 10 are fairly dominant in the table. That usually skews the estimates. Since the estimate is more of an average (unless the statistics are higher). Hmm.. How to do it permanantly? Well you could always issue set join_collapse set 1; select * from But obviously that isn't what you prefer. :) I think there are things you can do to make merge join more expensive than a nested loop, but I'm not sure what they are. Maybe someone else has some ideas to encourage this behavior for future work? Setting it on a per-connection basis is doable, but would add some burden to us in code. My biggest question is why the planner things the Nested Loop would be so expensive. Have you tuned any of the parameters? It seems like something is out of whack. (cpu_tuple_cost, random_page_cost, etc...) What I really don't understand is that the estimates dropped as well. The actual number of estimate rows drops to 3k instead of 1M. The real question is why does the planner think it will be so expensive? select count(*) from k_b join k_r using (incidentid) where k_b.id=107 and k_r.id=94; count --- 373 Well, this says that they are indeed much more selective. Each one has 1k rows, but together you end up with only 400. Is this a bad thing? Is this not selective enough to make it much faster? Yes, being more selective is what makes it faster. But the planner doesn't seem to notice it properly. Overall, I'm much happier now after seeing the new plan come about, if I can find a way to make that join_collapse behavior permanent, I can certainly live with these numbers. I'm sure there are pieces to tune, but I've reached my limits of parameters to tweak :) Thanks again for your continued efforts. -Dan John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] join and query planner
Dario Pudlo wrote: (first at all, sorry for my english) Hi. - Does left join restrict the order in which the planner must join tables? I've read about join, but i'm not sure about left join... - If so: Can I avoid this behavior? I mean, make the planner resolve the query, using statistics (uniqueness, data distribution) rather than join order. My query looks like: SELECT ... FROM a, b, LEFT JOIN c ON (c.key = a.key) LEFT JOIN d on (d.key=a.key) WHERE (a.key = b.key) AND (b.column = 100) b.column has a lot better selectivity, but planner insist on resolve first c.key = a.key. Of course, I could rewrite something like: SELECT ... FROM (SELECT ... FROM a,b LEFT JOIN d on (d.key=a.key) WHERE (b.column = 100) ) as aa LEFT JOIN c ON (c.key = aa.key) but this is query is constructed by an application with a multicolumn filter. It's dynamic. It means that a user could choose to look for c.column = 1000. And also, combinations of filters. So, I need the planner to choose the best plan... Probably forcing the other join earlier could help: SELECT ... FROM a JOIN b ON (a.key = b.key) LEFT JOIN c ON (c.key = a.key) ... I think the problem is that postgresql can't break JOIN syntax very easily. But you can make the JOIN earlier. John =:- I've already change statistics, I clustered tables with cluster, ran vacuum analyze, changed work_mem, shared_buffers... Greetings. TIA. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Sorting on longer key is faster ?
Chris Travers wrote: John A Meinel wrote: jobapply wrote: The 2 queries are almost same, but ORDER BY x||t is FASTER than ORDER BY x.. How can that be possible? Btw: x and x||t are same ordered phoeniks= explain analyze SELECT * FROM test WHERE i20 ORDER BY x || t; QUERY PLAN What types are x and t, I have the feeling x || t is actually a boolean, so it is only a True/False sort, while ORDER BY x has to do some sort of string comparison (which might actually be a locale depended comparison, and strcoll can be very slow on some locales) Am I reading this that wrong? I would think that x || t would mean concatenate x and t. Sorry, I think you are right. I was getting my operators mixed up. This is interesting. I never through of writing a multicolumn sort this way I'm also surprised that the sort is faster with a merge operation. Are you using UNICODE as the database format? I'm just wondering if it is doing something funny like casting it to an easier to sort type. Best Wishes, Chris Travers Metatron Technology Consulting PS Don't forget to Reply All so that your messages go back to the list. signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Sorting on longer key is faster ?
jobapply wrote: The 2 queries are almost same, but ORDER BY x||t is FASTER than ORDER BY x.. How can that be possible? Btw: x and x||t are same ordered phoeniks= explain analyze SELECT * FROM test WHERE i20 ORDER BY x || t; QUERY PLAN I also thought of another possibility. Are there a lot of similar entries in X? Meaning that the same value is repeated over and over? It is possible that the sort code has a weakness when sorting equal values. For instance, if it was doing a Hash aggregation, you would have the same hash repeated. (It isn't I'm just mentioning a case where it might affect something). If it is creating a tree representation, it might cause some sort of pathological worst-case behavior, where all entries keep adding to the same side of the tree, rather than being more balanced. I don't know the internals of postgresql sorting, but just some ideas. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] CURSOR slowes down a WHERE clause 100 times?
Niccolo Rigacci wrote: Hi to all, I have a performace problem with the following query: BEGIN; DECLARE mycursor BINARY CURSOR FOR SELECT toponimo, wpt FROM wpt_comuni_view WHERE ( wpt setSRID('BOX3D(4.83 36, 20.16 47.5)'::BOX3D, 4326) ); FETCH ALL IN mycursor; END; I get the results in about 108 seconds (8060 rows). If I issue the SELECT alone (without the CURSOR) I get the same results in less than 1 second. The wpt_comuni_view is a VIEW of a 3 tables JOIN, and the wpt field is a PostGIS geometry column. The is the PostGIS overlaps operator. If I CURSOR SELECT from a temp table instead of the JOIN VIEW the query time 1 second. If I omit the WHERE clause the CURSOR fetches results in 1 second. Can the CURSOR on JOIN affects so heavly the WHERE clause? I suspect that - with the CURSOR - a sequential scan is performed on the entire data set for each fetched record... Any idea? What does it say if you do EXPLAIN ANALYZE SELECT... both with and without the cursor? It may not say much for the cursor, but I think you can explain analyze the fetch statements. It is my understanding that Cursors generally favor using an slow-startup style plan, which usually means using an index, because it expects that you won't actually want all of the data. A seqscan is not always slower, especially if you need to go through most of the data. Without an explain analyze it's hard to say what the planner is thinking and doing. This is the definition of the VIEW: CREATE VIEW wpt_comuni_view AS SELECT istat_wpt.oid, istat_wpt.id, istat_wpt.toponimo, istat_comuni.residenti, istat_wpt.wpt FROM istat_comuni JOIN istat_comuni2wpt USING (idprovincia, idcomune) JOIN istat_wpt ON (idwpt = id); Thank you for any hint. You might also try comparing your CURSOR to a prepared statement. There are a few rare cases where preparing is worse than issuing the query directly, depending on your data layout. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] planner picking more expensive plan
Sam Mason wrote: Hi, I've just been referred here after a conversion on IRC and everybody seemed to think I've stumbled upon some strangeness. The planner (in PG version 8.0.2) is choosing what it thinks is a more expensive plan. I've got a table of animals (about 3M rows) and their movements (about 16M rows), and I'm trying to execute this query: SELECT a.birthlocnid, m.locnid FROM animals a LEFT JOIN movements m ON (a.animalid = m.animalid AND m.mtypeid=0) LIMIT 10; Why are you using LIMIT without having an ORDER BY? What are actually trying to get out of this query? Is it just trying to determine where the 'home' locations are? It just seems like this query isn't very useful. As it doesn't restrict by animal id, and it just gets 10 randomly selected animals where m.mtypeid=0. And why a LEFT JOIN instead of a normal join? Anyway, the general constraints you are applying seem kind of confusing. What happens if you change the plan to: SELECT a.birthlocnid, m.locnid FROM animals a LEFT JOIN movements m ON (a.animalid = m.animalid AND m.mtypeid=0) ORDER BY a.animalid LIMIT 10; I would guess that this would help the planner realize it should try to use an index, since it can realize that it wants only a few rows by a.animalid in order. Though I also recognize that you aren't returning a.animalid so you don't really know which animals you are returning. I get the feeling you are trying to ask something like do animals stay at their birth location, or at least how are animals moving around. I don't know what m.typeid = 0 means, but I'm guessing it is something like where their home is. Anyway, I would say you need to put a little bit more restriction in, so the planner can figure out how to get only 10 rows. John =:- If I have work_mem set to something small (1000) it uses this plan: QUERY PLAN Limit (cost=0.00..202.52 rows=10 width=8) (actual time=0.221..0.600 rows=10 loops=1) - Merge Left Join (cost=0.00..6628.30 rows=3302780 width=8) (actual time=0.211..0.576 rows=10 loops=1) Merge Cond: (outer.animalid = inner.animalid) - Index Scan using animals_pkey on animals a (cost=0.00..10198983.91 rows=3302780 width=8) (actual time=0.112..0.276 rows=10 loops=1) - Index Scan using movement_animal on movements m (cost=0.00..56642740.73 rows=3107737 width=8) (actual time=0.088..0.235 rows=10 loops=1) Filter: (mtypeid = 0) Total runtime: 0.413 ms But if I increase work_mem to 1 it uses this plan: QUERY PLAN Limit (cost=565969.42..566141.09 rows=10 width=8) (actual time=27769.047..27769.246 rows=10 loops=1) - Merge Right Join (cost=565969.42..57264070.77 rows=3302780 width=8) (actual time=27769.043..27769.228 rows=10 loops=1) Merge Cond: (outer.animalid = inner.animalid) - Index Scan using movement_animal on movements m (cost=0.00..56642740.73 rows=3107737 width=8) (actual time=0.022..0.154 rows=10 loops=1) Filter: (mtypeid = 0) - Sort (cost=565969.42..574226.37 rows=3302780 width=8) (actual time=27768.991..27769.001 rows=10 loops=1) Sort Key: a.animalid - Seq Scan on animals a (cost=0.00..77086.80 rows=3302780 width=8) (actual time=0.039..5620.651 rows=3303418 loops=1) Total runtime: 27851.097 ms I've tried playing with the statistics as people suggested on IRC but to no effect. There was some discussion about why it would be doing this, but nothing obvious came out of it. SHOW ALL output is at the end of this mail but it should be pretty standard apart from: shared_buffers = 1 work_mem = 8192 max_connections = 100 effective_cache_size = 1 Hope that's enough information to be useful. Thanks. Sam signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Planner constants for RAM resident databases
Emil Briggs wrote: I'm working with an application where the database is entirely resident in RAM (the server is a quad opteron with 16GBytes of memory). It's a web application and handles a high volume of queries. The planner seems to be generating poor plans for some of our queries which I can fix by raising cpu_tuple_cost. I have seen some other comments in the archives saying that this is a bad idea but is that necessarily the case when the database is entirely resident in RAM? Emil Generally, the key knob to twiddle when everything fits in RAM is random_page_cost. If you truly have everything in RAM you could set it almost to 1. 1 means that it costs exactly the same to go randomly through the data then it does to go sequential. I would guess that even in RAM it is faster to go sequential (since you still have to page and deal with L1/L2/L3 cache, etc). But the default random_page_cost of 4 is probably too high for you. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Planner constants for RAM resident databases
Emil Briggs wrote: I just mentioned random_page_cost, but you should also tune effective_cache_size, since that is effectively most of your RAM. It depends what else is going on in the system, but setting it as high as say 12-14GB is probably reasonable if it is a dedicated machine. With random_page_cost 1.5-2, and higher effective_cache_size, you should be doing pretty well. John =:- I tried playing around with these and they had no effect. It seems the only thing that makes a difference is cpu_tuple_cost. I'm surprised. I know cpu_tuple_cost can effect it as well, but usually the recommended way to get indexed scans is the above two parameters. When you do explain analyze of a query that you have difficulties with, how are the planner's estimates. Are the estimated number of rows about equal to the actual number of rows? If the planner is mis-estimating, there is a whole different set of tuning to do to help it estimate correctly. John =:- PS Use reply-all so that your comments go to the list. signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Too slow querying a table of 15 million records
Tobias Brox wrote: [EMAIL PROTECTED] - Tue at 08:33:58PM +0200] I use FreeBSD 4.11 with PostGreSQL 7.3.8. (...) database= explain select date_trunc('hour', time),count(*) as total from test where p1=53 and time now() - interval '24 hours' group by date_trunc order by date_trunc ; I haven't looked through all your email yet, but this phenomena have been up at the list a couple of times. Try replacing now() - interval '24 hours' with a fixed time stamp, and see if it helps. pg7 will plan the query without knowledge of what now() - interval '24 hours' will compute to. This should be fixed in pg8. The grandparent was a mailing list double send. Notice the date is 1 week ago. It has already been answered (though your answer is still correct). John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] tricky query
Merlin Moncure wrote: I need a fast way (sql only preferred) to solve the following problem: I need the smallest integer that is greater than zero that is not in the column of a table. In other words, if an 'id' column has values 1,2,3,4,6 and 7, I need a query that returns the value of 5. I've already worked out a query using generate_series (not scalable) and pl/pgsql. An SQL only solution would be preferred, am I missing something obvious? Merlin Not so bad. Try something like this: SELECT min(id+1) as id_new FROM table WHERE (id+1) NOT IN (SELECT id FROM table); Now, this requires probably a sequential scan, but I'm not sure how you can get around that. Maybe if you got trickier and did some ordering and limits. The above seems to give the right answer, though. I don't know how big you want to scale to. You might try something like: SELECT id+1 as id_new FROM t WHERE (id+1) NOT IN (SELECT id FROM t) ORDER BY id LIMIT 1; John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] tricky query
Merlin Moncure wrote: Not so bad. Try something like this: SELECT min(id+1) as id_new FROM table WHERE (id+1) NOT IN (SELECT id FROM table); Now, this requires probably a sequential scan, but I'm not sure how you can get around that. Maybe if you got trickier and did some ordering and limits. The above seems to give the right answer, though. it does, but it is still faster than generate_series(), which requires both a seqscan and a materialization of the function. I don't know how big you want to scale to. big. :) merlin See my follow up post, which enables an index scan. On my system with 90k rows, it takes no apparent time. (0.000ms) John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] tricky query
John A Meinel wrote: Merlin Moncure wrote: I need a fast way (sql only preferred) to solve the following problem: I need the smallest integer that is greater than zero that is not in the column of a table. In other words, if an 'id' column has values 1,2,3,4,6 and 7, I need a query that returns the value of 5. I've already worked out a query using generate_series (not scalable) and pl/pgsql. An SQL only solution would be preferred, am I missing something obvious? Merlin Not so bad. Try something like this: SELECT min(id+1) as id_new FROM table WHERE (id+1) NOT IN (SELECT id FROM table); Now, this requires probably a sequential scan, but I'm not sure how you can get around that. Maybe if you got trickier and did some ordering and limits. The above seems to give the right answer, though. I don't know how big you want to scale to. You might try something like: SELECT id+1 as id_new FROM t WHERE (id+1) NOT IN (SELECT id FROM t) ORDER BY id LIMIT 1; John =:- Well, I was able to improve it to using appropriate index scans. Here is the query: SELECT t1.id+1 as id_new FROM id_test t1 WHERE NOT EXISTS (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1; I created a test table which has 90k randomly inserted rows. And this is what EXPLAIN ANALYZE says: QUERY PLAN Limit (cost=0.00..12.10 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1) - Index Scan using id_test_pkey on id_test t1 (cost=0.00..544423.27 rows=45000 width=4) (actual time=0.000..0.000 rows=1 loops=1) Filter: (NOT (subplan)) SubPlan - Index Scan using id_test_pkey on id_test t2 (cost=0.00..6.01 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=15) Index Cond: (id = ($0 + 1)) Total runtime: 0.000 ms (7 rows) The only thing I have is a primary key index on id_test(id); John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] tricky query
Merlin Moncure wrote: John Meinel wrote: See my follow up post, which enables an index scan. On my system with 90k rows, it takes no apparent time. (0.000ms) John =:- Confirmed. Hats off to you, the above some really wicked querying. IIRC I posted the same question several months ago with no response and had given up on it. I think your solution (smallest X1 not in X) is a good candidate for general bits, so I'm passing this to varlena for review :) SELECT t1.id+1 as id_new FROM id_test t1 WHERE NOT EXISTS (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1; Merlin Just be aware that as your table fills it's holes, this query gets slower and slower. I've been doing some testing. And it starts at 0.00 when the first entry is something like 3, but when you start getting to 16k it starts taking more like 200 ms. So it kind of depends how your table fills (and empties I suppose). The earlier query was slower overall (since it took 460ms to read in the whole table). I filled up the table such that 63713 is the first empty space, and it takes 969ms to run. So actually if your table is mostly full, the first form is better. But if you are going to have 100k rows, with basically random distribution of empties, then the NOT EXISTS works quite well. Just be aware of the tradeoff. I'm pretty sure the WHERE NOT EXISTS will always use a looping structure, and go through the index in order. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] tricky query
Merlin Moncure wrote: On Tue, Jun 28, 2005 at 12:02:09 -0400, Merlin Moncure [EMAIL PROTECTED] wrote: Confirmed. Hats off to you, the above some really wicked querying. IIRC I posted the same question several months ago with no response and had given up on it. I think your solution (smallest X1 not in X) is a good candidate for general bits, so I'm passing this to varlena for review :) SELECT t1.id+1 as id_new FROM id_test t1 WHERE NOT EXISTS (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1; You need to rework this to check to see if row '1' is missing. The above returns the start of the first gap after the first row that isn't missing. Correct. In fact, I left out a detail in my original request in that I had a starting value (easily supplied with where clause)...so what I was really looking for was a query which started at a supplied value and looped forwards looking for an empty slot. John's supplied query is a drop in replacement for a plpgsql routine which does exactly this. The main problem with the generate_series approach is that there is no convenient way to determine a supplied upper bound. Also, in some corner cases of my problem domain the performance was not good. Merlin Actually, if you already have a lower bound, then you can change it to: SELECT t1.id+1 as id_new FROM id_test t1 WHERE t1.id id_min AND NOT EXISTS (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1; This would actually really help performance if you have a large table and then empty entries start late. On my system, where the first entry is 64k, doing where id 6 speeds it up back to 80ms instead of 1000ms. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] max_connections / shared_buffers / effective_cache_size
Puddle wrote: Hello, I'm a Sun Solaris sys admin for a start-up company. I've got the UNIX background, but now I'm having to learn PostgreSQL to support it on our servers :) Server Background: Solaris 10 x86 PostgreSQL 8.0.3 Dell PowerEdge 2650 w/4gb ram. This is running JBoss/Apache as well (I KNOW the bad juju of running it all on one box, but it's all we have currently for this project). I'm dedicating 1gb for PostgreSQL alone. So, far I LOVE it compared to MySQL it's solid. The only things I'm kind of confused about (and I've been searching for answers on lot of good perf docs, but not too clear to me) are the following: 1.) shared_buffers I see lot of reference to making this the size of available ram (for the DB). However, I also read to make it the size of pgdata directory. I notice when I load postgres each daemon is using the amount of shared memory (shared_buffers). Our current dataset (pgdata) is 85mb in size. So, I'm curious should this size reflect the pgdata or the 'actual' memory given? I currently have this at 128mb You generally want shared_buffers to be no more than 10% of available ram. Postgres expects the OS to do it's own caching. 128M/4G = 3% seems reasonable to me. I would certainly never set it to 100% of ram. 2.) effective_cache_size - from what I read this is the 'total' allowed memory for postgresql to use correct? So, if I am willing to allow 1GB of memory should I make this 1GB? This is the effective amount of caching between the actual postgres buffers, and the OS buffers. If you are dedicating this machine to postgres, I would set it to something like 3.5G. If it is a mixed machine, then you have to think about it. This does not change how postgres uses RAM, it changes how postgres estimates whether an Index scan will be cheaper than a Sequential scan, based on the likelihood that the data you want will already be cached in Ram. If you dataset is only 85MB, and you don't think it will grow, you really don't have to worry about this much. You have a very small database. 3.) max_connections, been trying to figure 'how' to determine this #. I've read this is buffer_size+500k per a connection. ie. 128mb(buffer) + 500kb = 128.5mb per connection? Max connections is just how many concurrent connections you want to allow. If you can get away with lower, do so. Mostly this is to prevent connections * work_mem to get bigger than your real working memory and causing you to swap. I was curious about 'sort_mem' I can't find reference of it in the 8.0.3 documentation, has it been removed? sort_mem changed to work_mem in 8.0, same thing with vacuum_mem - maintenance_work_mem. work_mem and max_stack_depth set to 4096 maintenance_work_mem set to 64mb Depends how much space you want to give per connection. 4M is pretty small for a machine with 4G of RAM, but if your DB is only 85M it might be plenty. work_mem is how much memory a sort/hash/etc will use before it spills to disk. So look at your queries. If you tend to sort most of your 85M db in a single query, you might want to make it a little bit more. But if all of your queries are very selective, 4M could be plenty. I would make maintenance_work_mem more like 512M. It is only used for CREATE INDEX, VACUUM, etc. Things that are not generally done by more than one process at a time. And it's nice for them to have plenty of room to run fast. Thanks for any help on this. I'm sure bombardment of newbies gets old :) -William Good luck, John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Speed with offset clause
Yves Vindevogel wrote: Hi again all, My queries are now optimised. They all use the indexes like they should. However, there's still a slight problem when I issue the offset clause. We have a table that contains 600.000 records We display them by 25 in the webpage. So, when I want the last page, which is: 600k / 25 = page 24000 - 1 = 23999, I issue the offset of 23999 * 25 This take a long time to run, about 5-10 seconds whereas offset below 100 take less than a second. Can I speed this up ? Met vriendelijke groeten, Bien à vous, Kind regards, *Yves Vindevogel* *Implements* Postgres has the optimization that it will plan a query, and once it reaches the limit, it can stop even though there is more data available. The problem you are having is that it has to go through offset rows first, before it can apply the limit. If you can, (as mentioned in the other post), try to refine your index so that you can reverse it for the second half of the data. This is probably tricky, as you may not know how many rows you have (or the amount might be changing). A potentially better thing, is if you have an index you are using, you could use a subselect so that the only portion that needs to have 60k rows is a single column. Maybe an example: Instead of saying: SELECT * FROM table1, table2 WHERE table1.id = table2.id ORDER BY table1.date OFFSET x LIMIT 25; You could do: SELECT * FROM (SELECT id FROM table1 OFFSET x LIMIT 25) as subselect JOIN table1 ON subselect.id = table1.id , table2 WHERE table1.id = table2.id; That means that the culling process is done on only a few rows of one table, and the rest of the real merging work is done on only a few rows. It really depends on you query, though, as what rows you are sorting on has a big influence on how well this will work. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Limit clause not using index
Yves Vindevogel wrote: Hi, I have a very simple query on a big table. When I issue a limit and/or offset clause, the query is not using the index. Can anyone explain me this ? You didn't give enough information. What does you index look like that you are expecting it to use? Generally, you want to have matching columns. So you would want CREATE INDEX blah ON tblprintjobs(loginuser, desceventdate, desceventtime); Next, you should post EXPLAIN ANALYZE instead of regular explain, so we can have an idea if the planner is actually making correct estimations. John =:- rvponp=# explain select * from tblprintjobs order by loginuser, desceventdate, desceventtime offset 25 limit 25 ; QUERY PLAN --- Limit (cost=349860.62..349860.68 rows=25 width=206) - Sort (cost=349860.56..351416.15 rows=622236 width=206) Sort Key: loginuser, desceventdate, desceventtime - Seq Scan on tblprintjobs (cost=0.00..25589.36 rows=622236 width=206) (4 rows) rvponp=# explain select * from tblprintjobs order by loginuser, desceventdate, desceventtime ; QUERY PLAN - Sort (cost=349860.56..351416.15 rows=622236 width=206) Sort Key: loginuser, desceventdate, desceventtime - Seq Scan on tblprintjobs (cost=0.00..25589.36 rows=622236 width=206) (3 rows) Met vriendelijke groeten, Bien à vous, Kind regards, *Yves Vindevogel* *Implements* signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Do Views execute underlying query everytime ??
Amit V Shah wrote: After I sent out this email, I found this article from google http://jonathangardner.net/PostgreSQL/materialized_views/matviews.html Looks like we can control as to when the views refresh... I am still kind of confused, and would appreciate help !! The create/drop table does sound a solution that can work, but the thing is I want to get manual intervention out, and besides, my work flow is very complex so this might not be an option for me :-( Thanks, Amit Just to make it clear, a view is not the same as a materialized view. A view is just a set of rules to the planner so that it can simplify interactions with the database. A materialized view is a query which has been saved into a table. To set it up properly, really depends on what your needs are. 1. How much time can elapse between an update to the system, and an update to the materialized views? 2. How many updates / (sec, min, hour, month) do you expect. Is insert performance critical, or secondary. For instance, if you get a lot of updates, but you can have a 1 hour lag between the time a new row is inserted and the view is updated, you can just create a cron job that runs every hour to regenerate the materialized view. If you don't get many updates, but you need them to show up right away, then you can add triggers to the affected tables, such that inserting/updating to a specific table causes an update to the materialized view. There are quite a few potential tradeoffs. Rather than doing a materialized view, you could just improve your filters. If you are doing a query to show people the results, you generally have some sort of upper bound on how much data you can display. Humans don't like reading more than 100 or 1000 rows. So create your normal query, and just take on a LIMIT 100 at the end. If you structure your query properly, and have appropriate indexes, you should be able to make the LIMIT count, and allow you to save a lot of overhead of generating rows that you don't use. I would probably start by posting the queries you are currently using, along with an EXPLAIN ANALYZE, and a description of what you actually need from the query. Then this list can be quite helpful in restructuring your query to make it faster. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Limit clause not using index
Yves Vindevogel wrote: rvponp=# explain analyze select * from tblPrintjobs order by loginuser, desceventdate, desceventtime ; QUERY PLAN Sort (cost=345699.06..347256.49 rows=622972 width=203) (actual time=259438.952..268885.586 rows=622972 loops=1) Sort Key: loginuser, desceventdate, desceventtime - Seq Scan on tblprintjobs (cost=0.00..25596.72 rows=622972 width=203) (actual time=21.155..8713.810 rows=622972 loops=1) Total runtime: 271583.422 ms (4 rows) Can you post it with the limit? I realize the query takes a long time, but that is the more important query to look at. Also, just as a test, if you can, try dropping most of the indexes except for the important one. It might be that the planner is having a hard time because there are too many permutations to try. I believe if you drop the indexes inside a transaction, they will still be there for other queries, and if you rollback instead of commit, you won't lose anything. BEGIN; DROP INDEX ... EXPLAIN ANALYZE SELECT *... ROLLBACK; John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] slow growing table
Jone C wrote: On second thought... Does a VACUUM FULL help? If so, you might want to increase your FSM settings. Thank you for the reply, sorry for delay I was on holiday. I tried that it had no effect. I benchmarked 2x before, peformed VACUUM FULL on the table in question post inserts, then benchmarked 2x after. Same results... Should I try your suggestion on deleting the indexes? This table needs to be accessible for reads at all times however though... thank you kindly I believe dropping an index inside a transaction is only visible to that transaction. (Can someone back me up on this?) Which means if you did: BEGIN; DROP INDEX index in question; CREATE INDEX same index ON same stuff; COMMIT; The only problem is that if you are using a unique or primary key index, a foreign key which is referencing that index would have to be dropped and re-created as well. So you could have a pretty major cascade effect. A better thing to do if your table only has one (or at least only a few) indexes, would be to CLUSTER, which is effectively a VACUUM FULL + a REINDEX (plus sorting the rows so that they are in index order). It holds a full lock on the table, and takes a while, but when you are done, things are cleaned up quite a bit. You might also try just a REINDEX on the indexes in question, but this also holds a full lock on the table. (My DROP + CREATE might also as well, I'm not really sure, I just think of it as a way to recreate without losing it for other transactions) John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Another question on indexes (drop and recreate)
Yves Vindevogel wrote: Hi, I have another question regarding indexes. I have a table with a lot of indexes on it. Those are needed to perform my searches. Once a day, a bunch of records is inserted in my table. Say, my table has 1.000.000 records and I add 10.000 records (1% new) What would be faster. 1) Dropping my indexes and recreating them after the inserts 2) Just inserting it and have PG manage the indexes Met vriendelijke groeten, Bien à vous, Kind regards, *Yves Vindevogel* *Implements* I'm guessing for 1% new that (2) would be faster. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Another question on indexes (drop and recreate)
Yves Vindevogel wrote: And, after let's say a week, would that index still be optimal or would it be a good idea to drop it in the weekend and recreate it. It depends a little bit on the postgres version you are using. If you are only ever adding to the table, and you are not updating it or deleting from it, I think the index is always optimal. Once you start deleting from it there are a few cases where older versions would not properly re-use the empty entries, requiring a REINDEX. (Deleting low numbers and always adding high numbers was one of the cases) However, I believe that as long as you vacuum often enough, so that the system knows where the unused entries are, you don't ever have to drop and re-create the index. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Another question on indexes (drop and recreate)
Yves Vindevogel wrote: I only add records, and most of the values are random Except the columns for dates, I doubt that you would need to recreate indexes. That really only needs to be done in pathological cases, most of which have been fixed in the latest postgres. If you are only inserting (never updating or deleting), the index can never bloat, since you are only adding new stuff. (You cannot get dead items to bloat your index if you never delete anything.) John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?
Alex Stapleton wrote: On 20 Jun 2005, at 15:59, Jacques Caron wrote: ... ANALYZE is not a very expensive operation, however VACUUM can definitely be a big strain and take a long time on big tables, depending on your setup. I've found that partitioning tables (at the application level) can be quite helpful if you manage to keep each partition to a reasonable size (under or close to available memory), especially if the partitioning scheme is somehow time- related. YMMV. Jacques. That's not currently an option as it would require a pretty large amount of work to implement. I think we will have to keep that in mind though. Remember, you can fake it with a low-level set of tables, and then wrap them into a UNION ALL view. So you get something like: CREATE VIEW orig_table AS SELECT * FROM table_2005_04 UNION ALL SELECT * FROM table_2005_05 UNION ALL SELECT * FROM table_2005_06 ... ; Then at least your individual operations are fast. As you insert, you can create a rule that on insert into orig_table do instead ... insert into table_2005_07 (or whatever the current table is). It takes a little bit of maintenance on the DB admin's part, since every month they have to create a new table, and then update all of the views and triggers. But it is pretty straightforward. If you are doing append-only inserting, then you have the nice feature that only the last table is ever modified, which means that the older tables don't really need to be vacuumed or analyzed. And even if you have to have each table modified as you go, you still can break up a VACUUM into only doing one of the sub tables at a time. I don't know you db schema, but I thought I would mention that true partitioning isn't implemented yet, you can still get something very similar with views, triggers and rules. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] How does the transaction buffer work?
Veikko Mkinen wrote: Hey, How does Postgres (8.0.x) buffer changes to a database within a transaction? I need to insert/update more than a thousand rows (mayde even more than 1 rows, ~100 bytes/row) in a table but the changes must not be visible to other users/transactions before every row is updated. One way of doing this that I thought of was start a transaction, delete everything and then just dump new data in (copy perhaps). The old data would be usable to other transactions until I commit my insert. This would be the fastest way, but how much memory would this use? Will this cause performance issues on a heavily loaded server with too little memory even to begin with :) Postgres does indeed keep track of who can see what. Such that changes won't be seen until a final commit. If you are trying to insert bulk data, definitely consider COPY. But UPDATE should also be invisible until the commit. So if you are only changing data, there really isn't any reason to do a DELETE and INSERT. Especially since you'll have problems with foreign keys at the DELETE stage. John =:- -veikko signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Index ot being used
Kevin Grittner wrote: It sure would be nice if the optimizer would consider that it had the leeway to add any column which was restricted to a single value to any point in the ORDER BY clause. Without that, the application programmer has to know what indexes are on the table, rather than being able to just worry about the set of data they want. Obviously, if a column can have only one value in the result set, adding to any point in the ORDER BY can't change anything but performance. That sure sounds like something which should fall within the scope of an optimizer. It really should be a DBA function to add or drop indexes to tune the performance of queries, without requiring application programmers to modify the queries for every DBA adjustment. (When you have a database with over 350 tables and thousands of queries, you really begin to appreciate the importance of this.) I agree that having a smarter optimizer, which can recognize when an index can be used for ORDER BY would be useful. I don't know if there are specific reasons why not, other than just not being implemented yet. It might be tricky to get it correct (for instance, how do you know which columns can be added, which ones will be constant) Perhaps you could just potentially add the WHERE items if they have an equality constraint with a constant. But I'm guessing there are more cases than that where the optimization could be performed. Also, the more options you give the planner, the longer it takes on average to plan any single query. Yes, it is beneficial for this use case, but does that balance out slowing down all the other queries by a tiny bit. I'm guessing the optimization wasn't as important as some of the others that have been done, so it hasn't been implemented yet. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] PostgreSQL using the wrong Index
Alex Stapleton wrote: Oh, we are running 7.4.2 btw. And our random_page_cost = 1 Which is only correct if your entire db fits into memory. Also, try updating to a later 7.4 version if at all possible. On 13 Jun 2005, at 14:02, Alex Stapleton wrote: We have two index's like so l1_historical=# \d N_intra_time_idx Index N_intra_time_idx Column |Type +- time | timestamp without time zone btree Just so you are aware, writing this as: We have an index on N_intra(time) and one on N_Intra(symbol, time) is a lot more succinct. l1_historical=# \d N_intra_pkey Index N_intra_pkey Column |Type +- symbol | text time | timestamp without time zone unique btree (primary key) and on queries like this select * from N_intra where symbol='SOMETHING WHICH DOESNT EXIST' order by time desc limit 1; PostgreSQL takes a very long time to complete, as it effectively scans the entire table, backwards. And the table is huge, about 450 million rows. (btw, there are no triggers or any other exciting things like that on our tables in this db.) but on things where the symbol does exist in the table, it's more or less fine, and nice and fast. What happens if you do: SELECT * FROM N_intra WHERE symbol='doesnt exist' ORDER BY symbol, time DESC LIMIT 1; Yes, symbol is constant, but it frequently helps the planner realize it can use an index scan if you include all terms in the index in the ORDER BY clause. Whilst the option the planner has taken might be faster most of the time, the worst case scenario is unacceptable for obvious reasons. I've googled for trying to force the use of a specific index, but can't find anything relevant. Does anyone have any suggestions on getting it to use an index which hopefully will have better worst case performance? Try the above first. You could also create a new index on symbol CREATE INDEX N_intra_symbol_idx ON N_intra(symbol); Then the WHERE clause should use the symbol index, which means it can know quickly that an entry doesn't exist. I'm not sure how many entries you have per symbol, though, so this might cause problems in the ORDER BY time portion. I'm guessing what you really want is to just do the ORDER BY symbol, time. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] faster search
Clark Slater wrote: Hi- Would someone please enlighten me as to why I'm not seeing a faster execution time on the simple scenario below? there are 412,485 rows in the table and the query matches on 132,528 rows, taking almost a minute to execute. vaccuum analyze was just run. Well, if you are matching 130k out of 400k rows, then a sequential scan is certainly prefered to an index scan. And then you have to sort those 130k rows by partnumber. This *might* be spilling to disk depending on what your workmem/sortmem is set to. I would also say that what you would really want is some way to get the whole thing from an index. And I think the way to do that is: CREATE INDEX test_partnum_listid_typeid_idx ON test(partnumber, productlistid, typeid); VACUUM ANALYZE test; EXPLAIN ANALYZE SELECT * FROM test WHERE productlistid=3 AND typeid=9 ORDER BY partnumber, productlistid, typeid LIMIT 15 ; The trick is that you have to match the order by exactly with the index, so the planner realizes it can do an indexed lookup to get the information. You could also just create an index on partnumber, and see how that affects your original query. I think the planner could use an index lookup on partnumber to get the ordering correct. But it will have to do filtering after the fact based on productlistid and typeid. With my extended index, I think the planner can be smarter and lookup all 3 by the index. Thanks! Clark Good luck, John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Postgresql on an AMD64 machine
Neil Conway wrote: Tom Arthurs wrote: Yes, shared buffers in postgres are not used for caching Shared buffers in Postgres _are_ used for caching, they just form a secondary cache on top of the kernel's IO cache. Postgres does IO through the filesystem, which is then cached by the kernel. Increasing shared_buffers means that less memory is available for the kernel to cache IO -- increasing shared_buffers has been shown to be a net performance loss beyond a certain point. Still, there is value in shared_buffers as it means we can avoid a read() system call for hot pages. We can also do better buffer replacement in the PG shared buffer than the kernel can do (e.g. treating IO caused by VACUUM specially). As I recall, one of the performance problems with a large shared_buffers is that there are some commands which require looking at *all* of the shared buffer space. So the larger it gets, the longer those functions take. My biggest challenge with solaris/sparc is trying to reduce context switching. It would be interesting to see if this is improved with current sources, as Tom's bufmgr rewrite should have hopefully have reduced this problem. These might be what was fixed with Tom's rewrite. I don't really know. John =:- -Neil ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance nightmare with dspam (urgent) (resolved)
Casey Allen Shobe wrote: On Wednesday 01 June 2005 20:19, Casey Allen Shobe wrote: ... Long-term, whenever we hit the I/O limit again, it looks like we really don't have much of a solution except to throw more hardware (mainly lots of disks in RAID0's) at the problem. :( Fortunately, with the above two changes I/O usage on the PG data disk is a quarter of what it was, so theoretically we should be able to quadruple the number of users on current hardware. Be very careful in this situation. If any disks in a RAID0 fails, the entire raid is lost. You *really* want a RAID10. It takes more drives, but then if anything dies you don't lose everything. If you are running RAID0 and you *really* want performance, and aren't concerned about safety (at all), you could also set fsync=false. That should also speed things up. But you are really risking corruption/data loss on your system. Our plan forward is to increase the number of disks in the two redundant mail servers, so that each has a single ultra320 disk for O/S and pg_xlog, and a 3-disk RAID0 for the data. This should triple our current capacity. I don't know if you can do it, but it would be nice to see this be 1 RAID1 for OS, 1 RAID10 for pg_xlog, and another RAID10 for data. That is the recommended performance layout. It takes quite a few drives (minimum of 10). But it means your data is safe, and your performance should be very good. The general opinion of the way dspam uses the database among people I've talked to on #postgresql is not very good, but of course the dspam folk blame PostgreSQL and say to use MySQL if you want reasonable performance. Makes it real fun to be a DSpam+PostgreSQL user when limits are reached, since everyone denies responsibility. Fortunately, PostgreSQL people are pretty helpful even if they think the client software sucks. :) I can't say how dspam uses the database. But they certainly could make assumptions about how certain actions are done by the db, which are not quite true with postgres. (For instance MySQL can use an index to return information, because Postgres supports transactions, it cannot, because even though a row is in the index, it may not be visible to the current transaction.) They also might be doing stuff like select max(row) instead of select row ORDER BY row DESC LIMIT 1. In postgres the former will be a sequential scan, the latter will be an index scan. Though I wonder about select max(row) ORDER BY row DESC LIMIT 1. to me, that should still return the right answer, but I'm not sure. Cheers, Good luck, John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Performance nightmare with dspam (urgent) (resolved)
Michael Stone wrote: On Mon, Jun 06, 2005 at 10:08:23AM -0500, John A Meinel wrote: I don't know if you can do it, but it would be nice to see this be 1 RAID1 for OS, 1 RAID10 for pg_xlog, That's probably overkill--it's a relatively small sequential-write partition with really small writes; I don't see how pg_xlog would benefit from raid10 as opposed to raid1. Mike Stone pg_xlog benefits from being super fast. Because it has to be fully synced before the rest of the data can be committed. Yes they are small, but if you can make it fast, you eliminate that overhead. It also benefits from having it's own spindle, because you eliminate the seek time. (Since it is always appending) Anyway, my point is that pg_xlog isn't necessarily tiny. Many people seem to set it as high as 100-200, and each one is 16MB. But one other thing to consider is to make pg_xlog on a battery backed ramdisk. Because it really *can* use the extra speed. I can't say that a ramdisk is more cost effective than faster db disks. But if you aren't using many checkpoint_segments, it seems like you could get a 1GB ramdisk, and probably have a pretty good performance boost. (I have not tested this personally, though). Since he is using the default settings (mostly) for dspam, he could probably get away with something like a 256MB ramdisk. The only prices I could find with a few minutes of googleing was: http://www.cenatek.com/store/category.cfm?Category=15 Which is $1.6k for 2GB. But there is also a product that is being developed, which claims $60 for the PCI card, you supply the memory. It has 4 DDR slots http://www.engadget.com/entry/1234000227045399/ And you can get a 128MB SDRAM ECC module for around $22 http://www.newegg.com/Product/Product.asp?Item=N82E16820998004 So that would put the total cost of a 512MB battery backed ramdisk at $60 + 4*22 = $150. That certainly seems less than what you would pay for the same speed in hard-drives. Unfortunately the Giga-byte iRam seems to just be in the demo stage. But if they aren't lying in the press releases, it would certainly be something to keep an eye on. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Postgresql and Software RAID/LVM
Marty Scholes wrote: Has anyone ran Postgres with software RAID or LVM on a production box? What have been your experience? Yes, we have run for a couple years Pg with software LVM (mirroring) against two hardware RAID5 arrays. We host a production Sun box that runs 24/7. My experience: * Software RAID (other than mirroring) is a disaster waiting to happen. If the metadata for the RAID set gives out for any reason (CMOS scrambles, card dies, power spike, etc.) then you are hosed beyond belief. In most cases it is almost impossible to recover. With mirroring, however, you can always boot and operate on a single mirror, pretending that no LVM/RAID is underway. In other words, each mirror is a fully functional copy of the data which will operate your server. Isn't this actually more of a problem for the meta-data to give out in a hardware situation? I mean, if the card you are using dies, you can't just get another one. With software raid, because the meta-data is on the drives, you can pull it out of that machine, and put it into any machine that has a controller which can read the drives, and a similar kernel, and you are back up and running. * Hardware RAID5 is a terrific way to boost performance via write caching and spreading I/O across multiple spindles. Each of our external arrays operates 14 drives (12 data, 1 parity and 1 hot spare). While RAID5 protects against single spindle failure, it will not hedge against multiple failures in a short time period, SCSI contoller failure, SCSI cable problems or even wholesale failure of the RAID controller. All of these things happen in a 24/7 operation. Using software RAID1 against the hardware RAID5 arrays hedges against any single failure. No, it hedges against *more* than one failure. But you can also do a RAID1 over a RAID5 in software. But if you are honestly willing to create a full RAID1, just create a RAID1 over RAID0. The performance is much better. And since you have a full RAID1, as long as both drives of a pairing don't give out, you can lose half of your drives. If you want the space, but you feel that RAID5 isn't redundant enough, go to RAID6, which uses 2 parity locations, each with a different method of storing parity, so not only is it more redundant, you have a better chance of finding problems. * Software mirroring gives you tremendous ability to change the system while it is running, by taking offline the mirror you wish to change and then synchronizing it after the change. That certainly is a nice ability. But remember that LVM also has the idea of snapshoting a running system. I don't know the exact details, just that there is a way to have some processes see the filesystem as it existed at an exact point in time. Which is also a great way to handle backups. On a fully operational production server, we have: * restriped the RAID5 array * replaced all RAID5 media with higher capacity drives * upgraded RAID5 controller * moved all data from an old RAID5 array to a newer one * replaced host SCSI controller * uncabled and physically moved storage to a different part of data center Again, all of this has taken place (over the years) while our machine was fully operational. So you are saying that you were able to replace the RAID controller without turning off the machine? I realize there does exist hot-swappable PCI cards, but I think you are overstating what you mean by fully operational. For instance, it's not like you can access your data while it is being physically moved. I do think you had some nice hardware. But I know you can do all of this in software as well. It is usually a price/performance tradeoff. You spend quite a bit to get a hardware RAID card that can keep up with a modern CPU. I know we have an FC raid box at work which has a full 512MB of cache on it, but it wasn't that much cheaper than buying a dedicated server. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] postgresql-8.0.1 performance tuning
Martin Fandel wrote: Hi @ all, i'm trying to tune my postgresql-db but i don't know if the values are right set. I use the following environment for the postgres-db: # Hardware cpu: 2x P4 3Ghz ram: 1024MB DDR 266Mhz partitions: /dev/sda3 23G 9,6G 13G 44% / /dev/sda1 11G 156M 9,9G 2% /var /dev/sdb1 69G 13G 57G 19% /var/lib/pgsql /dev/sda is in raid 1 (2x 35GB / 1upm / sca) /dev/sdb is in raid 10 (4x 35GB / 1upm / sca) # /Hardware You probably want to put the pg_xlog file onto /dev/sda rather than having it in /dev/sdb. Having it separate from the data usually boosts performance a lot. I believe you can just mv it to a different directory, and then recreate it as a symlink. (Stop the database first :) # Config /etc/sysctl.conf: kernel.shmall = 786432000 kernel.shmmax = 786432000 Not really sure about these two. /etc/fstab: /dev/sdb1 /var/lib/pgsql reiserfs acl,user_xattr,noatime,data=writeback 1 2 Seems decent. /var/lib/pgsql/data/postgresql.conf superuser_reserved_connections = 2 shared_buffers = 3000 work_mem= 131072 maintenance_work_mem= 131072 These both seem pretty large. But it depends on how many concurrent connections doing sorting/hashing/etc you expect. If you are only expecting 1 connection, these are probably fine. Otherwise with 1GB of RAM I would probably make work_mem more like 4096/8192. Remember, running out of work_mem means postgres will spill to disk, slowing that query. Running out of RAM causes the system to swap, making everything slow. max_stack_depth = 2048 max_fsm_pages = 2 max_fsm_relations = 1000 max_files_per_process = 1000 vacuum_cost_delay = 10 vacuum_cost_page_hit= 1 vacuum_cost_page_miss = 10 vacuum_cost_page_dirty = 20 vacuum_cost_limit = 200 bgwriter_delay = 200 bgwriter_percent= 1 bgwriter_maxpages = 100 fsync = true wal_sync_method = fsync wal_buffers = 64 commit_delay= 0 commit_siblings = 5 checkpoint_segments = 256 checkpoint_timeout = 900 checkpoint_warning = 30 effective_cache_size= 1 random_page_cost= 4 cpu_tuple_cost = 0.01 cpu_index_tuple_cost= 0.001 cpu_operator_cost = 0.0025 geqo= true geqo_threshold = 12 geqo_effort = 5 geqo_pool_size = 0 geqo_generations= 0 geqo_selection_bias = 2.0 deadlock_timeout= 1000 max_locks_per_transaction = 64 # /Config # Transactions we have about 115-300 transactions/min in about 65 tables. # /Transactions I'm really new at using postgres. So i need some experience to set this parameters in the postgresql- and the system-config. I can't find standard calculations for this. :/ The postgresql-documentation doesn't help me to set the best values for this. The database must be high-availble. I configured rsync to sync the complete /var/lib/pgsql-directory to my hot-standby. On the hotstandby i will make the dumps of the database to improve the performance of the master-db. I didn't think an rsync was completely valid. Probably you should look more into Slony. http://slony.info It is a single-master asynchronous replication system. I believe it is pretty easy to setup, and does what you really want. In my tests the synchronization works fine. I synchronised the hole directory and restarted the database of the hotstandby. While restarting, postgresql turned back the old (not archived) wals and the database of my hotstandby was consistent. Is this solution recommended? Or must i use archived wal's with real system-snapshots? best regards, Martin Fandel John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] slow queries, possibly disk io
Josh Close wrote: I have some queries that have significan't slowed down in the last couple days. It's gone from 10 seconds to over 2 mins. The cpu has never gone over 35% in the servers lifetime, but the load average is over 8.0 right now. I'm assuming this is probably due to disk io. I need some help setting up postgres so that it doesn't need to go to disk. I think the shared_buffers and effective_cache_size values are the one's I need to look at. Would setting shmmax and smmall to 90% or so of available mem and putting a lot for postgres be helpful? Setting shared buffers above something like 10-30% of memory is counter productive. Effective cach size says this: Sets the planner's assumption about the effective size of the disk cache (that is, the portion of the kernel's disk cache that will be used for PostgreSQL data files). Does that mean the total available ram? Or what's left over from shared_buffers? I've tried different things and not much has been working. Is there a good way to ensure that most of the tables accessed in postgres will be cached in mem and not have to go to disk? If I'm joining a lot of tables, should the sort_mem be set high also? Do shared_buffers, effective_cache_size, and sort_mem all use different mem? Or are they seperate? Increasing sort_mem can help with various activities, but increasing it too much can cause you to swap, which kills performance. The caution is that you will likely use at least 1 sort_mem per connection, and can likely use more than one if the query is complicated. effective_cache_size changes how Postgres plans queries, but given the same query plan, it doesn't change performance at all. I've looked for information and haven't found any useful pages about this. Any help would be greatly appreciated. Thanks. -Josh John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Inner join on two OR conditions dont use index
Jocelyn Turcotte wrote: Hi all i dont know if this is normal, but if yes i would like to know why and how I could do it another way other than using unions. The only thing that *might* work is if you used an index on both keys. So if you did: CREATE INDEX rt_edge_start_end_node ON rt_edge(start_node_id,end_node_id); The reason is that in an OR construct, you have to check both for being true. So in the general case where you don't know the correlation between the columns, you have to check all of the entries, because even if you know the status of one side of the OR, you don't know the other. Another possibility would be to try this index: CREATE INDEX rt_edge_stare_or_end ON rt_edge(start_node_id OR end_node_id); I'm not sure how smart the planner can be, though. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Select performance vs. mssql
Michael Stone wrote: On Tue, May 24, 2005 at 04:35:14PM -0700, Josh Berkus wrote: Pretty much. There has been discussion about allowing index-only access to frozen tables, i.e. archive partitions. But it all sort of hinges on someone implementing it and testing Is there any way to expose the planner estimate? For some purposes it's enough to just give a rough ballpark (e.g., a google-esque results 1-10 of approximately 1000) so a user knows whether its worth even starting to page through. Mike Stone ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster Well, you could always do: EXPLAIN SELECT ... And then parse out the rows= in the first line. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Can anyone explain this: duplicate dbs.
SpaceBallOne wrote: Wondering if someone could explain a pecularity for me: We have a database which takes 1000ms to perform a certain query on. If I pg_dump that database then create a new database (e.g. tempdb) and upload the dump file (thus making a duplicate) then the same query only takes 190ms !! Vacuum, vacuum analyse, and vacuum full analyse does not seem to have an impact on these times. Can anyone explain why this may be occurring and how I might be able to keep the original database running at the same speed as tempdb? Thanks in advance, Dave. What version of postgres? There are a few possibilities. If you are having a lot of updates to the table, you can get index bloat. And vacuum doesn't fix indexes. You have to REINDEX to do that. Though REINDEX has the same lock that VACUUM FULL has, so you need to be a little careful with it. Probably better is to do CLUSTER, as it does a REINDEX and a sort, so your table ends up nicer when you are done. Also, older versions of postgres had a worse time with index bloat. One thing that caused a lot of problem is a table that you insert into over time, so that all the values are incrementing. If you are deleting older entries, that area won't be re-used because they fall at the back end. I believe newer versions have been fixed. By the way, I think doing: CREATE DATABASE tempdb WITH TEMPLATE = originaldb; Is a much faster way of doing dump and load. I *think* it would recreate indexes, etc. If it just does a copy it may not show the dump/restore improvement. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Can anyone explain this: duplicate dbs.
SpaceBallOne wrote: What version of postgres? 8.0.2 ... but I think I've seen this before on 7.3 ... There are a few possibilities. If you are having a lot of updates to the table, you can get index bloat. And vacuum doesn't fix indexes. You have to REINDEX to do that. Though REINDEX has the same lock that VACUUM FULL has, so you need to be a little careful with it. Probably better is to do CLUSTER, as it does a REINDEX and a sort, so your table ends up nicer when you are done. Thanks, will try those next time this problem crops up (i just deleted / recreated the database to speed things for its users in the office ... probably should have held off to see if I could find a solution first!). Yes, the database / table-in-question does have a lot of updates, deletes, and new rows (relatively speaking for a small business). Would CLUSTER / REINDEX still have an effect if our queries were done via sequential scan? This is a old database (as in built by me when i was just starting to learn unix / postgres) so the database design is pretty horrible (little normalisation, no indexes). Well, my first recommendation is to put in some indexes. :) They are relatively easy to setup and can drastically improve select performance. What version of postgres are you using? What does it say at the end of VACUUM FULL ANALYZE VERBOSE, that should tell you how many free pages were reclaimed and how big your free space map should be. If you only did 1 VACUUM FULL, you might try another, as it sounds like your tables aren't properly filled. I'm pretty sure vacuum only removes empty pages/marks locations for the free space map so they can be re-used, while vacuum full will move entries around to create free pages. It sounds like it didn't do it properly. But even so, CLUSTER is still your friend, as it allows you to presort the rows in your tables. Have taken Chris's advice onboard too and setup cron to do a vacuumdb hourly instead of my weekly vacuum. Cheers, Dave. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] seqential vs random io
David Parker wrote: I just got a question from one our QA guys who is configuring a RAID 10 disk that is destined to hold a postgresql database. The disk configuration procedure is asking him if he wants to optimize for sequential or random access. My first thought is that random is what we would want, but then I started wondering if it's not that simple, and my knowledge of stuff at the hardware level is, well, limited. If it were your QA guy, what would you tell him? - DAP Random. Sequential is always pretty fast, it is random that hurts. The only time I would say sequential is if you were planning on streaming large files (like iso images) with low load. But for a DB, even a sequential scan will probably not be that much data. At least, that's my 2c. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Optimize complex join to use where condition before
Greg Stark wrote: Sebastian Hennebrueder [EMAIL PROTECTED] writes: User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) ... Nested Loop (cost=1349.13..1435.29 rows=1 width=2541) (actual time=1640.000..3687.000 rows=62 loops=1) Join Filter: (inner.fid = outer.faufgaben_id) - Index Scan using idx_tauf_mit_mitid on taufgaben_mitarbeiter am (cost=0.00..80.65 rows=35 width=4) (actual time=0.000..0.000 rows=765 loops=1) Is it really Mozilla Thunderbird that's causing this new craptastic mangling of plans in people's mails? I was assuming it was some new idea of how to mess up people's mail coming out of Exchange or Lotus or some other such corporate messaging software that only handled SMTP mail as an afterthought. This is, uh, disappointing. Are you talking about the quotes, or just the fact that it is wrapped? I don't know where the quotes came from, but in Thunderbird if you are writing in text mode (not html) it defaults to wrapping the text at something like 78 characters. That includes copy/paste text. If you want it to *not* wrap, it turns out that Paste as quotation will not wrap, but then you have to remove the from the beginning of every line. In html mode, it also defaults to wrapping, but if you switch to PREFORMAT text first, it doesn't wrap. At least, those are the tricks that I've found. Safest bet is to just use an attachment, though. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Optimize complex join to use where condition before
Sebastian Hennebrueder wrote: I found a solution to improve my query. I do not know why but the statistics for all column has been 0. I changed this to 10 for index columns and to 20 for all foreign key columns. and to 100 for foreign key columns. I set the random page cost to 2 and now the query runs as expected. Many thanks to all of the posts in my and in other threads which helped a lot. Sebastian I think 0 = use default. But still, changing to 20 and 100 probably fixes your problems. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] tuning Postgres for large data import (using Copy from)
Marc Mamin wrote: Hello, I'm not an expert, but I'll give some suggestions. I'd like to tune Postgres for large data import (using Copy from). I believe that COPY FROM file is supposed to be faster than COPY FROM STDIN, but file must be available to the backend process. If you can do it, you should think about it, as it eliminates the communication between the client and the backend. here are a few steps already done: 1) use 3 different disks for: -1: source data -2: index tablespaces -3: data tablespaces Make sure pg_xlog is on it's own filesystem. It contains the write-ahead-log, and putting it by itself keeps the number of seeks down. If you are constrained, I think pg_xlog is more important than moving the index tablespaces. 2) define all foreign keys as initially deferred 3) tune some parameters: max_connections =20 shared_buffers =3 work_mem = 8192 maintenance_work_mem = 32768 checkpoint_segments = 12 (I also modified the kernel accordingly) Don't forget to increase your free space map if you are going to be doing deletes frequently. 4) runs VACUUM regulary The server runs RedHat and has 1GB RAM In the production (which may run on a better server), I plan to: - import a few millions rows per day, - keep up to ca 100 millions rows in the db - delete older data I've seen a few posting on hash/btree indexes, which say that hash index do not work very well in Postgres; currently, I only use btree indexes. Could I gain performances whole using hash indexes as well ? I doubt it. How does Postgres handle concurrent copy from on: same table / different tables ? I think it is better with different tables. If using the same table, and there are indexes, it has to grab a lock for updating the index, which causes contention between 2 processes writing to the same table. I'd be glad on any further suggestion on how to further increase my performances. Since you are deleting data often, and copying often, I might recommend using a partition scheme with a view to bind everything together. That way you can just drop the old table rather than doing a delete. I don't know how this would affect foreign key references. But basically you can create a new table, and do a copy without having any indexes, then build the indexes, analyze, update the view. And when deleting you can update the view, and drop the old table. Something like this: CREATE TABLE table_2005_05_11 AS (blah); COPY FROM ... ; CREATE INDEX blah ON table_2005_05_11(blah); CREATE OR REPLACE VIEW table AS SELECT * FROM table_2005_05_10 UNION ALL SELECT * FROM table_2005_05_11; VACUUM ANALYZE table_2005_05_11; ... John =:- Marc signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Partitioning / Clustering
Alex Turner wrote: Ok - my common sense alarm is going off here... There are only 6.446 billion people worldwide. 100 Billion page views would require every person in the world to view 18 pages of yahoo every day. Not very likely. http://www.internetworldstats.com/stats.htm suggests that there are around 1 billion people actualy on the internet. That means each and every person on the internet has to view 100 pages per day of yahoo. pretty unlikely IMHO. I for one don't even use Yahoo ;) 100 million page views per day suggests that 1 in 100 people on the internet each viewed 10 pages of a site. Thats a pretty high percentage if you ask me. In general I think your point is valid. Just remember that it probably also matters how you count page views. Because technically images are a separate page (and this thread did discuss serving up images). So if there are 20 graphics on a specific page, that is 20 server hits just for that one page. I could easily see an image heavy site getting 100 hits / page. Which starts meaning that if 1M users hit 10 pages, then you get 1M*10*100 = 1G. I still think 100G views on a single website is a lot, but 100M is certainly possible. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Recommendations for set statistics
Sebastian Hennebrueder wrote: Hello, I could not find any recommandations for the level of set statistics and what a specific level does actually mean. What is the difference between 1, 50 and 100? What is recommanded for a table or column? Default I believe is 10. The higher the number, the more statistics are kept, with a maximum of 1000. The default is a little bit low for columns used in foreign keys, though frequently it is okay. When problems start, try setting them to 100 or 200. Higher is more accurate, but takes longer to compute, *and* takes longer when planning the optimal query method. It can be worth it, though. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Optimizer wrongly picks Nested Loop Left Join
Edin Kadribasic wrote: Hi, I have a query that is giving the optimizer (and me) great headache. When its in the good mood the optimizer chooses Hash Left Join and the query executes in 13ms or so, but sometimes (more and more often) it chooses Nested Loop Left Join and the execution time goes up to 2-30sec. The query: SELECT COUNT(DISTINCT a.tid) FROM axp_temp_order_match a LEFT OUTER JOIN ( SELECT ol.tid, ds.orid FROM axp_dayschedule ds JOIN axp_order_line ol ON ol.olid = ds.olid JOIN axp_order o ON ds.orid = o.orid WHERE o.status = 100 AND ds.day between '2005-05-12' and '2005-05-12' AND ds.used = '1' ) b ON (a.tid = b.tid) WHERE b.tid IS NULL AND a.sid = 16072; Unfortunately, because Hash Join doesn't report the number of rows (rows=0 always), it's hard to tell how good the estimator is. But I *can* say that the NestLoop estimation is way off. Good plan: = Aggregate (cost=221.93..221.93 rows=1 width=4) (actual time=34.262..34.266 rows=1 loops=1) - Hash Left Join (cost=9.07..220.86 rows=426 width=4) (actual time=34.237..34.237 rows=0 loops=1) Hash Cond: (outer.tid = inner.tid) Filter: (inner.tid IS NULL) - Index Scan using axp_temp_order_match_idx1 on axp_temp_order_match a (cost=0.00..209.65 rows=426 width=4) (actual time=0.277..0.512 rows=6 loops=1) Index Cond: (sid = 16072) - Hash (cost=9.07..9.07 rows=1 width=4) (actual time=32.777..32.777 rows=0 loops=1) - Nested Loop (cost=0.00..9.07 rows=1 width=4) (actual time=0.208..31.563 rows=284 loops=1) - Nested Loop (cost=0.00..6.05 rows=1 width=4) (actual time=0.178..20.684 rows=552 loops=1) - Index Scan using axp_dayschedule_day_idx on axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual time=0.036..3.973 rows=610 loops=1) Index Cond: ((day = '2005-05-12'::date) AND (day = '2005-05-12'::date)) Filter: (used = B'1'::bit) - Index Scan using axp_order_orid_key on axp_order o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=610) Index Cond: (outer.orid = o.orid) Filter: (status = 100) - Index Scan using axp_order_line_pk on axp_order_line ol (cost=0.00..3.01 rows=1 width=8) (actual time=0.006..0.008 rows=1 loops=552) Index Cond: (ol.olid = outer.olid) Total runtime: 34.581 ms Bad plan (same query different values): === Aggregate (cost=11.54..11.54 rows=1 width=4) (actual time=11969.281..11969.285 rows=1 loops=1) - Nested Loop Left Join (cost=0.00..11.53 rows=1 width=4) (actual time=25.730..11967.180 rows=338 loops=1) See here, it thinks it will only have to do 1 nestloop, which would be quite fast, but it hast to do 338. Join Filter: (outer.tid = inner.tid) Filter: (inner.tid IS NULL) - Index Scan using axp_temp_order_match_idx1 on axp_temp_order_match a (cost=0.00..2.45 rows=1 width=4) (actual time=0.027..2.980 rows=471 loops=1) Index Cond: (sid = 16092) - Nested Loop (cost=0.00..9.07 rows=1 width=4) (actual time=0.088..24.350 rows=285 loops=471) Same thing here. - Nested Loop (cost=0.00..6.04 rows=1 width=8) (actual time=0.067..15.649 rows=317 loops=471) And here. - Index Scan using axp_dayschedule_day_idx on axp_dayschedule ds (cost=0.00..3.02 rows=1 width=8) (actual time=0.015..3.557 rows=606 loops=471) This estimate is way off too, but it is off in both plans. Index Cond: ((day = '2005-05-13'::date) AND (day = '2005-05-13'::date)) Filter: (used = B'1'::bit) - Index Scan using axp_order_line_pk on axp_order_line ol (cost=0.00..3.01 rows=1 width=8) (actual time=0.006..0.008 rows=1 loops=285426) This is probably what is killing you. It is doing a single lookup 285k times. The above plan only does it 552 times. Index Cond: (ol.olid = outer.olid) - Index Scan using axp_order_orid_key on axp_order o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=149307) Index Cond: (outer.orid = o.orid) Filter: (status = 100) Total runtime: 11969.443 ms Please note that sometimes when I get bad plan in the logfile, I just re-run the query and the optimizer chooses the more efficient one. Sometime it does not. You work_mem is quite high relative to your total Ram, hopefully you don't have many allowed concurrent connections. But that is a side point. I assume the tables are freshly VACUUM ANALYZEd. Have you tried altering the statistics for the columns, one of them to look at is axp_dayschedule(day). That one seems to be consistently incorrect. Perhaps because a between with the same
Re: [PERFORM] full outer performance problem
Kim Bisgaard wrote: Hi, I'm having problems with the query optimizer and FULL OUTER JOIN on PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins. I might be naive, but I think that it should be possible? I have two BIG tables (virtually identical) with 3 NOT NULL columns Station_id, TimeObs, Temp_, with indexes on (Station_id, TimeObs) and valid ANALYSE (set statistics=100). I want to join the two tables with a FULL OUTER JOIN. When I specify the query as: select temp_max_60min,temp_dry_at_2m from station s natural join temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) where s.wmo_id=6065 and timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate; I get the correct results, BUT LOUSY performance, and the following explain: Nested Loop Left Join (cost=5.84..163484.08 rows=1349 width=12) (actual time=66146.815..119005.381 rows=1 loops=1) Filter: (COALESCE(outer.timeobs, inner.timeobs) = '2004-01-01 00:00:00'::timestamp without time zone) - Hash Join (cost=5.84..155420.24 rows=1349 width=16) (actual time=8644.449..110836.038 rows=109826 loops=1) Well, the estimate here is quite a bit off. It thinks you will be getting 1349 (which is probably why it picked a nested loop plan), but then it is getting 109826 rows. I'm guessing it is misunderstanding the selectivity of the timeobs column. Hash Cond: (outer.station_id = inner.station_id) - Seq Scan on temp_dry_at_2m a (cost=0.00..120615.94 rows=6956994 width=16) (actual time=0.024..104548.515 rows=6956994 loops=1) - Hash (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 rows=0 loops=1) - Index Scan using wmo_idx on station (cost=0.00..5.84 rows=1 width=4) (actual time=0.105..0.108 rows=1 loops=1) Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone = startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone = enddate)) - Index Scan using temp_max_60min_idx on temp_max_60min b (cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 loops=109826) Index Cond: ((outer.station_id = b.station_id) AND (outer.timeobs = b.timeobs)) Total runtime: 119005.499 ms (11 rows) I think the bigger problem is that a full outer join says grab all rows, even if they are null. What about this query: SELECT temp_max_60min,temp_dry_at_2m FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs) LEFT JOIN temp_max_60min b USING (station_id, timeobs) where s.wmo_id=6065 and timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate; After that, you should probably have a multi-column index on (station_id, timeobs), which lets postgres use just that index for the lookup, rather than using an index and then a filter. (Looking at your next query you might already have that index). If I change the query to (and thus negates the full outer join): This is the same query, I think you messed up your copy and paste. select temp_max_60min,temp_dry_at_2m from station s natural join temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) where s.wmo_id=6065 and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate; I get wrong results (In the case where one of the records is missing in one of the tables), BUT GOOD performance, and this query plan: Nested Loop (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 rows=1 loops=1) - Nested Loop (cost=0.00..11.82 rows=1 width=24) (actual time=65.517..65.526 rows=1 loops=1) - Index Scan using wmo_idx on station (cost=0.00..5.83 rows=1 width=4) (actual time=0.022..0.026 rows=1 loops=1) Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp without time zone = startdate) AND ('2004-01-01 00:00:00'::timestamp without time zone = enddate)) - Index Scan using temp_max_60min_idx on temp_max_60min b (cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 loops=1) Index Cond: ((outer.station_id = b.station_id) AND (b.timeobs = '2004-01-01 00:00:00'::timestamp without time zone)) - Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a (cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 loops=1) Index Cond: ((outer.station_id = a.station_id) AND (a.timeobs = '2004-01-01 00:00:00'::timestamp without time zone)) Total runtime: 79.340 ms (9 rows) If further info like EXPLAIN VERBOSE is useful please say so and I will provide it. Thanks in advance! Kim Bisgaard. I still feel like you will have a problem with an outer join in this circumstance, because it will have to scan all of both tables. I think what you are wanting is give me everything where station_id = X, and there is a row in either a or b. I think my LEFT JOIN example does that, but I also think there would be a