Re: [PERFORM] Optimization idea
2010/4/26 Vlad Arkhipov arhi...@dc.baikal.ru: On Thu, Apr 22, 2010 at 10:37 PM, Vlad Arkhipov arhi...@dc.baikal.ru wrote: I don't think this is just an issue with statistics, because the same problem arises when I try executing a query like this: I'm not sure how you think this proves that it isn't a problem with statistics, but I think what you should be focusing on here, looking back to your original email, is that the plans that are actually much faster have almost as much estimated cost as the slower one. Since all your data is probably fully cached, at a first cut, I might try setting random_page_cost and seq_page_cost to 0.005 or so, and adjusting effective_cache_size to something appropriate. ...Robert Ok. I thougth it's quite obvious because of these two queries. I can't understand why the estimated rows count is 40040 in the first plan. In the first query, the planner doesn't use the information of the 2,3,4. It just does a : I'll bet I'll have 2 rows in t1 (I think it should say 3, but it doesn't) So it divide the estimated number of rows in the t2 table by 5 (different values) and multiply by 2 (rows) : 40040. In the second query the planner use a different behavior : it did expand the value of t1.t to t2.t for each join relation and find a costless plan. (than the one using seqscan on t2) We are here in corner case situation where n_distinc valuest statistics on the column and where we might be able to improve the planner decision. I believe I have already read something on this topic on -hackers... test=# explain analyze select * from t2 join t1 on t1.t = t2.t where t1.t in (2,3,4); QUERY PLAN -- Hash Join (cost=1.09..2319.87 rows=40040 width=32) (actual time=0.050..356.269 rows=400 loops=1) Hash Cond: (t2.t = t1.t) - Seq Scan on t2 (cost=0.00..1543.00 rows=100100 width=16) (actual time=0.013..176.087 rows=100100 loops=1) - Hash (cost=1.07..1.07 rows=2 width=16) (actual time=0.023..0.023 rows=3 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB - Seq Scan on t1 (cost=0.00..1.07 rows=2 width=16) (actual time=0.006..0.014 rows=3 loops=1) Filter: (t = ANY ('{2,3,4}'::bigint[])) Total runtime: 356.971 ms (8 rows) test=# explain analyze select * from t2 join t1 on t1.t = t2.t where t1.t = 2 union all select * from t2 join t1 on t1.t = t2.t where t1.t = 3 union all select * from t2 join t1 on t1.t = t2.t where t1.t = 4; QUERY PLAN Append (cost=0.00..112.42 rows=407 width=32) (actual time=0.048..3.487 rows=400 loops=1) - Nested Loop (cost=0.00..47.51 rows=197 width=32) (actual time=0.045..1.061 rows=200 loops=1) - Seq Scan on t1 (cost=0.00..1.06 rows=1 width=16) (actual time=0.011..0.014 rows=1 loops=1) Filter: (t = 2) - Index Scan using t_idx on t2 (cost=0.00..44.48 rows=197 width=16) (actual time=0.026..0.382 rows=200 loops=1) Index Cond: (pg_temp_2.t2.t = 2) - Nested Loop (cost=0.00..32.67 rows=117 width=32) (actual time=0.019..0.599 rows=100 loops=1) - Seq Scan on t1 (cost=0.00..1.06 rows=1 width=16) (actual time=0.003..0.006 rows=1 loops=1) Filter: (t = 3) - Index Scan using t_idx on t2 (cost=0.00..30.43 rows=117 width=16) (actual time=0.010..0.211 rows=100 loops=1) Index Cond: (pg_temp_2.t2.t = 3) - Nested Loop (cost=0.00..28.17 rows=93 width=32) (actual time=0.017..0.534 rows=100 loops=1) - Seq Scan on t1 (cost=0.00..1.06 rows=1 width=16) (actual time=0.005..0.008 rows=1 loops=1) Filter: (t = 4) - Index Scan using t_idx on t2 (cost=0.00..26.18 rows=93 width=16) (actual time=0.007..0.187 rows=100 loops=1) Index Cond: (pg_temp_2.t2.t = 4) Total runtime: 4.190 ms (17 rows) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Cédric Villemain -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] autovacuum strategy / parameters
On Apr 22, 2:55 pm, robertmh...@gmail.com (Robert Haas) wrote: On Wed, Apr 21, 2010 at 11:06 AM, Rick richard.bran...@ca.com wrote: I have a DB with small and large tables that can go up to 15G. For performance benefits, it appears that analyze has much less cost than vacuum, but the same benefits? Err, no. ANALYZE gathers statistics for the query planner; VACUUM clears out old, dead tuples so that space can be reused by the database system. I can’t find any clear recommendations for frequencies and am considering these parameters: Autovacuum_vacuum_threshold = 5 Autovacuum_analyze_threshold = 1 Autovacuum_vacuum_scale_factor = 0.01 Autovacuum_analyze_scale_factor = 0.005 This appears it will result in table analyzes occurring around 10,000 to 85,000 dead tuples and vacuum occuring around 50,000 to 200,000, depending on the table sizes. Can anyone comment on whether this is the right strategy and targets to use? I'm not that familiar with tuning these parameters but increasing the default thesholds by a thousand-fold doesn't seem like a good idea. Small tables will never get vacuumed or analyzed at all. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performa...@postgresql.org) To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-performance The problem is with the autovacuum formula: In a loop, autovacuum checks to see if number of dead tuples ((number of live tuples * autovacuum_vacuum_scale_factor) + autovacuum_vacuum_threshold), and if so, it runs VACUUM. If not, it sleeps. It works the same way for ANALYZE. So, in a large table, the scale_factor is the dominant term. In a small table, the threshold is the dominant term. But both are taken into account. The default values are set for small tables; it is not being run for large tables. The question boils down to exactly what is the max number of dead tuples that should be allowed to accumulate before running analyze? Since vacuum just recovers space, that doesn't seem to be nearly as critical for performance? -Rick -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Planner issue on sorting joining of two tables with limit
Hello, everybody! I'm using PostgreSQL 8.4.3, compiled by Visual C++ build 1400, 32-bit on Windows XP SP3. I use following data model for issue reproducing. CREATE TABLE test1 ( id integer NOT NULL, value double precision, CONSTRAINT test1_pkey PRIMARY KEY (id) ); CREATE INDEX test1_value_idx ON test1(value); CREATE TABLE test2 ( id integer NOT NULL, id1 integer NOT NULL REFERENCES test2 (id), value double precision, CONSTRAINT test2_pkey PRIMARY KEY (id) ); CREATE INDEX test2_id1_value_idx ON test2(id1, value); Following statements generate 200 rows of test data into test1 table and 100 rows of test data into test2 table. INSERT INTO test1 (id, value) (SELECT x, random() from generate_series(1,200) x); INSERT INTO test2 (id, id1, value) (SELECT x, (random()*200)::int + 1, random() from generate_series(1,100) x); The following statements return top 10 rows from joining of two tables ordered by table1.value and table2.value. The SELECT t1.value AS value1, t2.value AS value2 FROM test1 t1 JOIN test2 t2 ON t2.id1 = t1.id ORDER BY t1.value, t2.value LIMIT 10 value1|value2 -+- 0.00562104489654303 | 0.00039379671216011 0.00562104489654303 | 0.000658359378576279 0.00562104489654303 | 0.000668979249894619 0.00562104489654303 | 0.000768951140344143 0.00562104489654303 | 0.00121330376714468 0.00562104489654303 | 0.00122168939560652 0.00562104489654303 | 0.00124016962945461 0.00562104489654303 | 0.00134057039394975 0.00562104489654303 | 0.00169069319963455 0.00562104489654303 | 0.00171623658388853 (10 rows) The statement plan doesn't use indexes. So the statement it is slow. Limit (cost=50614.88..50614.91 rows=10 width=16) (actual time=8388.331..8388.382 rows=10 loops=1) - Sort (cost=50614.88..53102.45 rows=995025 width=16) (actual time=8388.324..8388.340 rows=10 loops=1) Sort Key: t1.value, t2.value Sort Method: top-N heapsort Memory: 17kB - Hash Join (cost=6.50..29112.75 rows=995025 width=16) (actual time=0.982..6290.516 rows=997461 loops=1) Hash Cond: (t2.id1 = t1.id) - Seq Scan on test2 t2 (cost=0.00..15406.00 rows=100 width=12) (actualtime=0.088..2047.910 rows=100 loops=1) - Hash (cost=4.00..4.00 rows=200 width=12) (actual time=0.870..0.870 rows=200 loops=1) - Seq Scan on test1 t1 (cost=0.00..4.00 rows=200 width=12) (actual time=0.010..0.428 rows=200 loops=1) Total runtime: 8388.473 ms I can remove ordering by test2.value. SELECT t1.value AS value1, t2.value AS value2 FROM test1 t1 JOIN test2 t2 ON t2.id1 = t1.id ORDER BY t1.value LIMIT 10 Then the result is the same. value1|value2 -+- 0.00562104489654303 | 0.00039379671216011 0.00562104489654303 | 0.000658359378576279 0.00562104489654303 | 0.000668979249894619 0.00562104489654303 | 0.000768951140344143 0.00562104489654303 | 0.00121330376714468 0.00562104489654303 | 0.00122168939560652 0.00562104489654303 | 0.00124016962945461 0.00562104489654303 | 0.00134057039394975 0.00562104489654303 | 0.00169069319963455 0.00562104489654303 | 0.00171623658388853 (10 rows) The statement plan uses indexes and statement runs fast. This plan is exactly what I need. Limit (cost=0.00..0.62 rows=10 width=16) (actual time=0.049..0.148 rows=10 loops=1) - Nested Loop (cost=0.00..62109.86 rows=995025 width=16) (actual time=0.044..0.107 rows=10 loops=1) - Index Scan using test1_value_idx on test1 t1 (cost=0.00..19.19 rows=200 width=12) (actual time=0.017..0.017 rows=1 loops=1) - Index Scan using test2_id1_value_idx on test2 t2 (cost=0.00..248.27 rows=4975 width=12) (actual time=0.013..0.042 rows=10 loops=1) Index Cond: (t2.id1 = t1.id) Total runtime: 0.224 ms So PostgreSQL planner can produce the plan I need but it doesn't produce this plan when I specify particular second ordering column. So is there any way to make planner produce desired plan when particular second ordering column is specified? With best regards, Korotkov Alexander.
Re: [PERFORM] Planner issue on sorting joining of two tables with limit
=?KOI8-R?B?68/Sz9TLz9cg4czFy9PBzsTS?= aekorot...@gmail.com writes: So PostgreSQL planner can produce the plan I need but it doesn't produce this plan when I specify particular second ordering column. Well, no, because that plan wouldn't produce the specified ordering; or at least it would be a lucky coincidence if it did. It's only sorting on t1.value. So is there any way to make planner produce desired plan when particular second ordering column is specified? Not when the ordering columns come from two different tables. (If they were in the same table then scanning a two-column index could produce the demanded sort order.) I don't see any way to satisfy this query without an explicit sort step, which means it has to look at the whole join output. If you're willing to make assumptions like the required 10 rows will be within the first 100 t1.value rows then you could nest an ORDER BY t1.value LIMIT 100 query inside something that did an ORDER BY with both columns. But this fails if you have too many duplicate t1.value values, and your test case suggests that you might have a lot of them. In any case it would stop being fast if you make the inner LIMIT very large. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] autovacuum strategy / parameters
Rick wrote: So, in a large table, the scale_factor is the dominant term. In a small table, the threshold is the dominant term. But both are taken into account. Correct. The default values are set for small tables; it is not being run for large tables. So decrease the scale factor and leave threshold alone. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] tmpfs and postgres memory
I have a 16G box and tmpfs is configured to use 8G for tmpfs . Is a lot of memory being wasted that can be used for Postgres ? (I am not seeing any performance issues, but I am not clear how Linux uses the tmpfs and how Postgres would be affected by the reduction in memory) Sriram -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] tmpfs and postgres memory
On Mon, Apr 26, 2010 at 5:24 PM, Anj Adu fotogra...@gmail.com wrote: I have a 16G box and tmpfs is configured to use 8G for tmpfs . Is a lot of memory being wasted that can be used for Postgres ? (I am not seeing any performance issues, but I am not clear how Linux uses the tmpfs and how Postgres would be affected by the reduction in memory) Like Solaris, tmpfs is from swap and swap is both memory and disk so there is no guarantee when you're using it that it will be the fast memory based file system you're looking for. What you may be wanting is ramfs. Unlike tmpfs, it is 100% memory. Another difference is though you may mount a ramfs file system specifying a size, no real size is enforced. If you have 2GB of memory and attempt to copy 2GB of files to a ramfs mount point the system will do it until all the space, i.e. memory, is gone. Both tmpfs and ramfs come with a price, that is at the flick of a switch, loss of power or other cause that resets or reboots the system all data is lost. That reason doesn't necessarily mean you can't use a memory based file system it just limits it's applications. Personally, I'd find a way to tell PostgreSQL about the memory before toying with tmpfs or ramfs but I'm sure our applications are different. -Greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance