Re: [HACKERS] Odd out of memory problem.
On Thu, Mar 29, 2012 at 7:38 PM, Peter Eisentraut pete...@gmx.net wrote: On tis, 2012-03-27 at 00:53 +0100, Greg Stark wrote: Hm. So my original plan was dependent on adding the state-merge function we've talked about in the past. Not all aggregate functions necessarily can support such a function but I think all or nearly all the builtin aggregates can. Certainly min,max, count, sum, avg, stddev, array_agg can which are most of what people do. That would be a function which can take two state variables and produce a new state variable. This information could also be useful to have in PL/Proxy (or similar FDWs) to be able to integrate aggregate computation into the language. Currently, you always have to do the state merging yourself. I don't know exactly how PL/Proxy or pgpool accomplish the multi-phase aggregate, but in theory the proposal above is state-merge function, so it doesn't apply to general aggregate results that passed through the final function. Of course some functions that don't have final functions are ok to call state-merge function on the results. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On fre, 2012-03-30 at 22:59 -0700, Hitoshi Harada wrote: I don't know exactly how PL/Proxy or pgpool accomplish the multi-phase aggregate, They don't. but in theory the proposal above is state-merge function, so it doesn't apply to general aggregate results that passed through the final function. Of course some functions that don't have final functions are ok to call state-merge function on the results. You're right, it's not quite the same thing. But perhaps it could be kept in mind if someone wants to develop things in this area. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On tis, 2012-03-27 at 00:53 +0100, Greg Stark wrote: Hm. So my original plan was dependent on adding the state-merge function we've talked about in the past. Not all aggregate functions necessarily can support such a function but I think all or nearly all the builtin aggregates can. Certainly min,max, count, sum, avg, stddev, array_agg can which are most of what people do. That would be a function which can take two state variables and produce a new state variable. This information could also be useful to have in PL/Proxy (or similar FDWs) to be able to integrate aggregate computation into the language. Currently, you always have to do the state merging yourself. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On Mon, Mar 26, 2012 at 5:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark st...@mit.edu writes: On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Could you give us a brain dump on the sketch? I've never seen how to do it without unreasonable overhead. Hm. So my original plan was dependent on adding the state-merge function we've talked about in the past. Not all aggregate functions necessarily can support such a function but I think all or nearly all the builtin aggregates can. Certainly min,max, count, sum, avg, stddev, array_agg can which are most of what people do. That would be a function which can take two state variables and produce a new state variable. I'd rather not invent new requirements for aggregate implementations if we can avoid it. However now that I've started thinking about it further I think you could solve it with less complexity by cheating in various ways. For example if you limit the hash size to 1/2 of work_mem then you when you reach that limit you could just stuff any tuple that doesn't match a hash entry into a tuplesort with 1/2 of work_mem and do the regular level break logic on the output of that. Or just start dumping such tuples into a tuplestore, while continuing to process tuples that match the hashagg entries that are already in existence. Once the input is exhausted, read out the hashagg entries we have, flush the hashagg table, start reading from the tuplestore. Repeat as needed. I like this idea because the only thing you give up is predictability of the order of output of aggregated entries, which is something that a hashagg isn't guaranteeing anyway. In particular, we would still have a guarantee that any one aggregate evaluation processes the matching tuples in arrival order, which is critical for some aggregates. The main problem I can see is that if we start to flush after work_mem is X% full, we're essentially hoping that the state values for the existing aggregates won't grow by more than 1-X%, which is safe for many common aggregates but fails for some like array_agg(). Ultimately, for ones like that, it'd probably be best to never consider hashing at all. I guess we could invent an unsafe for hash aggregation flag for aggregates that have unbounded state-size requirements. According to what I've learned in the last couple of months, array_agg is not ready for fallback ways like dumping to tuplestore. Even merge-state is not able for them. The problem is that the executor doesn't know how to serialize/deserialize the internal type trans value. So in one implementation, the existence of merge function is a flag to switch back to sort grouping not hash aggregate and array_agg is one of such aggregate functions. That said, if you invent a new flag to note the aggregate is not dump-ready, it'd be worth inventing state merge function to aggregate infrastructure anyway. So I can imagine a way without state-merge function nor dumping to tuplestore would be to sort hash table content the rest of inputs so that we can switch to sort grouping. Since we have hash table, we can definitely sort them in memory, and we can put to disk everything that comes later than the fallback and read it after the scan finishes. Now we have sorted state values and sorted input, we can continue the rest of work. Anyway the memory usage problem is not only array_agg and hash aggregate. Even if you can say the hash table exceeds X% of the work_mem, how can you tell other operators such like Sort are not using the rest of memory? One approach I could see to avoid this is assigning arbitrary amount of memory to each operator from work_mem and calculate it locally. But this approach is going to skew occasionally and not perfect, either. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On Tue, Mar 27, 2012 at 3:22 AM, Hitoshi Harada umi.tan...@gmail.com wrote: According to what I've learned in the last couple of months, array_agg is not ready for fallback ways like dumping to tuplestore. Even merge-state is not able for them. The problem is that the executor doesn't know how to serialize/deserialize the internal type trans value. So in one implementation, the existence of merge function is a flag to switch back to sort grouping not hash aggregate and array_agg is one of such aggregate functions. That said, if you invent a new flag to note the aggregate is not dump-ready, it'd be worth inventing state merge function to aggregate infrastructure anyway. So I can imagine a way without state-merge function nor dumping to tuplestore would be to sort hash table content the rest of inputs so that we can switch to sort grouping. Since we have hash table, we can definitely sort them in memory, and we can put to disk everything that comes later than the fallback and read it after the scan finishes. Now we have sorted state values and sorted input, we can continue the rest of work. It's a little bit tricky to make this work - you have to get all of the values out of the hash-table you've built and stick them into a Tuplesort object - but I think it can be made to work, and it seems more elegant than anything else proposed so far. I also agree with you and with Greg Stark that it would be good to invent a state-merge function. Although it wouldn't apply to every case, it would make some very common cases a lot more efficient, both in run time and in memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 03/26/2012 01:54 PM, Andrew Dunstan wrote: On 03/26/2012 01:34 PM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 03/26/2012 01:06 PM, Heikki Linnakangas wrote: Is it possible this job is inserting and then updating (or deleteing) the row it just inserted and doing a large number of such insert/update operations all within the same transaction? Or perhaps it's updating the same row over and over again? It's all in a single transaction. In fact the solution I'm currently testing and seems to be working involves breaking it up into batches of a few thousand LOs restored per batch. Hm. The test case is just a straight pg_restore of lots and lots of LOs? What pg_dump version was the dump made with? 8.4.8, same as the target. We get the same issue whether we restore direct to the database from pg_restore or via a text dump. Following this up, the workaround of making small batches of LOs did solve that memory issue. Here's what I did: pg_restore --list full_export.dmp | grep BLOB bloblist pg_restore --use-list=bloblist full_export.dmp | \ perl -n e ' $n++ if /lo_open/; ' \ -e ' print end; begin;\n if (/lo_open/ ($n % 1000 == 0)); ' \ -e ' print ;' | \ psql -t -q -v ON_ERROR_STOP dbname=adtest /dev/null That's a fairly ugly hack to have to use. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
hello, does the problem show up on 2% of all problems after 2 weeks or so? we had a similar problem on UNIX as well. it even materialized on 100 identical boxes (on 2% of them). it pops up randomly and never stops … i checked some code paths. some of those messages are direct output via stderr (not even elog). unfortunately i did not manage to find a box where i could GDB to attack the problem . it was 8.4.8 as well. do you see a certain workload which would make the problem reproducable? regards, hans On Mar 26, 2012, at 5:03 PM, Andrew Dunstan wrote: I'm not sure if this is a bug, but I have wrestling with this problem for a client. Platform is Windows Servers 2003 64 bit, PostgreSQL 8.4.8., 4Gb RAM, running on an Amazon VM. Shared buffers: 512Mb, work_mem: 25Mb. There are only a handful of connections to the database, and no other activity. We are seeing the error shown below. The table in question has two columns (Oid, int) and roughly 43m rows. The only other thing remarkable about the settings is that effective_cache_size is set to 5Gb, which is clearly too high, but surely that shouldn't cause a memory error. I'm really perplexed as to why this fairly simple query should cause an out of memory error: select loid, max(pageno) from ldata group by loid order by 2 desc limit 10; I can't see what I might be missing. cheers andrew TopMemoryContext: 49816 total in 6 blocks; 5384 free (7 chunks); 44432 used TopTransactionContext: 8192 total in 1 blocks; 7696 free (0 chunks); 496 used Record information cache: 8192 total in 1 blocks; 1800 free (0 chunks); 6392 used Type information cache: 8192 total in 1 blocks; 1800 free (0 chunks); 6392 used Operator class cache: 8192 total in 1 blocks; 3848 free (0 chunks); 4344 used Operator lookup cache: 24576 total in 2 blocks; 14072 free (6 chunks); 10504 used MessageContext: 40960 total in 3 blocks; 29920 free (6 chunks); 11040 used smgr relation table: 8192 total in 1 blocks; 2816 free (0 chunks); 5376 used TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0 chunks); 16 used Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used PortalHeapMemory: 1024 total in 1 blocks; 920 free (0 chunks); 104 used ExecutorState: 8192 total in 1 blocks; 2144 free (1 chunks); 6048 used TupleSort: 40984 total in 3 blocks; 24208 free (10 chunks); 16776 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used AggContext: 864018432 total in 127 blocks; 3400 free (110 chunks); 864015032 used TupleHashTable: 619175960 total in 95 blocks; 821528 free (331 chunks); 618354432 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used Relcache by OID: 8192 total in 1 blocks; 3376 free (0 chunks); 4816 used CacheMemoryContext: 667696 total in 20 blocks; 169960 free (2 chunks); 497736 used pg_shdepend_reference_index: 1024 total in 1 blocks; 240 free (0 chunks); 784 used pg_depend_depender_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used pg_depend_reference_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used pg_largeobject_loid_pn_index: 1024 total in 1 blocks; 280 free (0 chunks); 744 used pg_database_datname_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_index_indrelid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_enum_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 240 free (0 chunks); 784 used pg_foreign_server_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_statistic_relid_att_index: 1024 total in 1 blocks; 240 free (0 chunks); 784 used pg_cast_source_target_index: 1024 total in 1 blocks; 240 free (0 chunks); 784 used pg_language_name_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_authid_oid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_amop_fam_strat_index: 1024 total in 1 blocks; 88 free (0 chunks); 936 used pg_index_indexrelid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_ts_template_tmplname_index: 1024 total in 1 blocks; 280 free (0 chunks); 744 used pg_ts_config_map_index: 1024 total
Re: [HACKERS] Odd out of memory problem.
Andrew Dunstan and...@dunslane.net writes: I'm really perplexed as to why this fairly simple query should cause an out of memory error: select loid, max(pageno) from ldata group by loid order by 2 desc limit 10; Looks like the group by/aggregate step is eating lots of memory: AggContext: 864018432 total in 127 blocks; 3400 free (110 chunks); 864015032 used TupleHashTable: 619175960 total in 95 blocks; 821528 free (331 chunks); 618354432 used A guess is that there are a huge number of distinct values of loid but the planner fails to realize that and tries to use a hash aggregation. Could we see EXPLAIN output for this query? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 03/26/2012 11:18 AM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: I'm really perplexed as to why this fairly simple query should cause an out of memory error: select loid, max(pageno) from ldata group by loid order by 2 desc limit 10; Looks like the group by/aggregate step is eating lots of memory: AggContext: 864018432 total in 127 blocks; 3400 free (110 chunks); 864015032 used TupleHashTable: 619175960 total in 95 blocks; 821528 free (331 chunks); 618354432 used A guess is that there are a huge number of distinct values of loid but the planner fails to realize that and tries to use a hash aggregation. Could we see EXPLAIN output for this query? Currently it shows: Limit (cost=19443025.87..19443025.89 rows=10 width=8 - Sort (cost=19443025.87..19446451.29 rows=1370168 width=8) Sort Key: (max(pageno)) - GroupAggregate (cost=18537785.99..19413417.03 rows=1370168 width=8) - Sort (cost=18537785.99..18823953.97 rows=114467192 width=8) Sort Key: loid - Seq Scan on ldata (cost=0.00..1651163.92 rows=114467192 width=8) The table might have been analysed since I ran the query, though. To answer Hans' question, we have seen the problem in other contexts. We first noticed this problem in a failure to restore large objects when running pg_restore. The database has 43,998,486 LOs on 114,467,137 pages. The largest of these is 2160 pages. We're currently running a test to see if we can successfully restore LOs by doing them in smaller batches rather than in a single transaction. However, this one seemed even odder than the LO problem. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
Andrew Dunstan and...@dunslane.net writes: On 03/26/2012 11:18 AM, Tom Lane wrote: Could we see EXPLAIN output for this query? Currently it shows: Limit (cost=19443025.87..19443025.89 rows=10 width=8 - Sort (cost=19443025.87..19446451.29 rows=1370168 width=8) Sort Key: (max(pageno)) - GroupAggregate (cost=18537785.99..19413417.03 rows=1370168 width=8) - Sort (cost=18537785.99..18823953.97 rows=114467192 width=8) Sort Key: loid - Seq Scan on ldata (cost=0.00..1651163.92 rows=114467192 width=8) The table might have been analysed since I ran the query, though. That plan should not create a tuple hash table, so I think it's almost certain that the plan changed. It might be interesting to remove the pg_statistic rows for the table and then see what plan you get. To answer Hans' question, we have seen the problem in other contexts. We first noticed this problem in a failure to restore large objects when running pg_restore. [ scratches head... ] I don't understand how or why pg_restore would be executing such a query. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On Mon, Mar 26, 2012 at 4:03 PM, Andrew Dunstan and...@dunslane.net wrote: TupleHashTable: 619175960 total in 95 blocks; 821528 free (331 chunks); 618354432 used I think the plan you showed isn't the plan that's running out of memory. I think it's running out of memory because it's using a Hash Aggregate and underestimating the number of distinct elements. I would set enable_hash_aggregate=false for this query (not for the whole server, hash aggregates are quite useful in general). This is a long-standing problem with hash aggregates. The problem is that we don't have a clever way to fall back to a non-hash-aggregate if the original estimated memory usage turns out to be way off. Just spilling to disk the way hash joins do would be horrendously expensive, we think. If we had an approach to doing so it would be very useful because simply running out of memory is kind of rude. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 03/26/2012 12:11 PM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 03/26/2012 11:18 AM, Tom Lane wrote: Could we see EXPLAIN output for this query? Currently it shows: Limit (cost=19443025.87..19443025.89 rows=10 width=8 - Sort (cost=19443025.87..19446451.29 rows=1370168 width=8) Sort Key: (max(pageno)) - GroupAggregate (cost=18537785.99..19413417.03 rows=1370168 width=8) - Sort (cost=18537785.99..18823953.97 rows=114467192 width=8) Sort Key: loid - Seq Scan on ldata (cost=0.00..1651163.92 rows=114467192 width=8) The table might have been analysed since I ran the query, though. That plan should not create a tuple hash table, so I think it's almost certain that the plan changed. It might be interesting to remove the pg_statistic rows for the table and then see what plan you get. Yeah, that gets us: Limit (cost=2223492.78..2223492.81 rows=10 width=8) - Sort (cost=2223492.78..2223493.28 rows=200 width=8) Sort Key: (max(pageno)) - HashAggregate (cost=2223485.96..2223488.46 rows=200 width=8) - Seq Scan on ldata (cost=0.00..1651154.64 rows=114466264 width=8) To answer Hans' question, we have seen the problem in other contexts. We first noticed this problem in a failure to restore large objects when running pg_restore. [ scratches head... ] I don't understand how or why pg_restore would be executing such a query. It's not. I was explaining that we have seen memory failures in *other* contexts, not just this query. The restore fails after many hours on a call to lo_write(). cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 03/26/2012 12:20 PM, Greg Stark wrote: On Mon, Mar 26, 2012 at 4:03 PM, Andrew Dunstanand...@dunslane.net wrote: TupleHashTable: 619175960 total in 95 blocks; 821528 free (331 chunks); 618354432 used I think the plan you showed isn't the plan that's running out of memory. I think it's running out of memory because it's using a Hash Aggregate and underestimating the number of distinct elements. I would set enable_hash_aggregate=false for this query (not for the whole server, hash aggregates are quite useful in general). This is a long-standing problem with hash aggregates. The problem is that we don't have a clever way to fall back to a non-hash-aggregate if the original estimated memory usage turns out to be way off. Just spilling to disk the way hash joins do would be horrendously expensive, we think. If we had an approach to doing so it would be very useful because simply running out of memory is kind of rude. Ugh, Ok, thanks for the explanation. So it looks like I'm running into two separate memory problems. Here's a trace from the other one cheers andrew TopMemoryContext: 49816 total in 6 blocks; 6840 free (9 chunks); 42976 used Operator class cache: 8192 total in 1 blocks; 3848 free (0 chunks); 4344 used Filesystem: 8192 total in 1 blocks; 7872 free (0 chunks); 320 used TopTransactionContext: 213901336 total in 10 blocks; 1796344 free (25 chunks); 212104992 used Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381 chunks); 750329768 used MessageContext: 73728 total in 4 blocks; 44920 free (15 chunks); 28808 used smgr relation table: 8192 total in 1 blocks; 2816 free (0 chunks); 5376 used TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0 chunks); 16 used Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used PortalHeapMemory: 1024 total in 1 blocks; 920 free (0 chunks); 104 used ExecutorState: 8192 total in 1 blocks; 5880 free (1 chunks); 2312 used ExprContext: 24576 total in 2 blocks; 13664 free (8 chunks); 10912 used Relcache by OID: 8192 total in 1 blocks; 3376 free (0 chunks); 4816 used CacheMemoryContext: 405552 total in 19 blocks; 14296 free (5 chunks); 391256 used pg_largeobject_loid_pn_index: 1024 total in 1 blocks; 240 free (0 chunks); 784 used pg_index_indrelid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_enum_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 280 free (0 chunks); 744 used pg_foreign_server_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_statistic_relid_att_index: 1024 total in 1 blocks; 280 free (0 chunks); 744 used pg_cast_source_target_index: 1024 total in 1 blocks; 240 free (0 chunks); 784 used pg_language_name_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_authid_oid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_amop_fam_strat_index: 1024 total in 1 blocks; 88 free (0 chunks); 936 used pg_index_indexrelid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_ts_template_tmplname_index: 1024 total in 1 blocks; 280 free (0 chunks); 744 used pg_ts_config_map_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_opclass_oid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_auth_members_member_role_index: 1024 total in 1 blocks; 280 free (0 chunks); 744 used pg_ts_dict_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_conversion_default_index: 1024 total in 1 blocks; 128 free (0 chunks); 896 used pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 128 free (0 chunks); 896 used pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 240 free (0 chunks); 784 used pg_enum_typid_label_index: 1024 total in 1 blocks; 280 free (0 chunks); 744 used pg_ts_config_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_user_mapping_oid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used pg_opfamily_am_name_nsp_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_type_oid_index: 1024 total in 1 blocks; 304 free (0 chunks); 720 used pg_aggregate_fnoid_index: 1024 total in 1 blocks; 344 free (0 chunks); 680 used
Re: [HACKERS] Odd out of memory problem.
Andrew Dunstan and...@dunslane.net writes: On 03/26/2012 12:11 PM, Tom Lane wrote: That plan should not create a tuple hash table, so I think it's almost certain that the plan changed. It might be interesting to remove the pg_statistic rows for the table and then see what plan you get. Yeah, that gets us: Limit (cost=2223492.78..2223492.81 rows=10 width=8) - Sort (cost=2223492.78..2223493.28 rows=200 width=8) Sort Key: (max(pageno)) - HashAggregate (cost=2223485.96..2223488.46 rows=200 width=8) - Seq Scan on ldata (cost=0.00..1651154.64 rows=114466264 width=8) Hm. This illustrates that it's not too prudent to rely on a default numdistinct estimate to decide that a hash aggregation is safe :-(. We had probably better tweak the cost estimation rules to not trust that. Maybe, if we have a default estimate, we should take the worst case estimate that the column might be unique? That could still burn us if the rowcount estimate was horribly wrong, but those are not nearly as shaky as numdistinct estimates ... [ scratches head... ] I don't understand how or why pg_restore would be executing such a query. It's not. I was explaining that we have seen memory failures in *other* contexts, not just this query. The restore fails after many hours on a call to lo_write(). Seems probably unrelated then. Have you got a memory-usage dump for that case? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstan and...@dunslane.net wrote: Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381 chunks); 750329768 used I think you'll have to catch Heikki's attention to get a good answer to this. Is it possible this job is inserting and then updating (or deleteing) the row it just inserted and doing a large number of such insert/update operations all within the same transaction? Or perhaps it's updating the same row over and over again? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On Mon, Mar 26, 2012 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hm. This illustrates that it's not too prudent to rely on a default numdistinct estimate to decide that a hash aggregation is safe :-(. We had probably better tweak the cost estimation rules to not trust that. Maybe, if we have a default estimate, we should take the worst case estimate that the column might be unique? That could still burn us if the rowcount estimate was horribly wrong, but those are not nearly as shaky as numdistinct estimates ... The selectivity API should include some way of indicating the accuracy of the answer, as well as the answer itself. That way we could respond better in a wide range of circumstances. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 26.03.2012 19:51, Greg Stark wrote: On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstanand...@dunslane.net wrote: Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381 chunks); 750329768 used I think you'll have to catch Heikki's attention to get a good answer to this. Is it possible this job is inserting and then updating (or deleteing) the row it just inserted and doing a large number of such insert/update operations all within the same transaction? Or perhaps it's updating the same row over and over again? .. and all that in different subtransactions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On Mon, Mar 26, 2012 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hm. This illustrates that it's not too prudent to rely on a default numdistinct estimate to decide that a hash aggregation is safe :-(. We had probably better tweak the cost estimation rules to not trust that. Maybe, if we have a default estimate, we should take the worst case estimate that the column might be unique? That could still burn us if the rowcount estimate was horribly wrong, but those are not nearly as shaky as numdistinct estimates ... Perhaps we should have two work_mem settings -- one for the target to aim for and one for a hard(er) limit that we should ensure the worst case falls under? I have a sketch for how to handle spilling hash aggregates to disk in my head. I'm not sure if it's worth the amount of complexity it would require but I'll poke around a bit and see if it works out well. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 26.03.2012 19:59, Heikki Linnakangas wrote: On 26.03.2012 19:51, Greg Stark wrote: On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstanand...@dunslane.net wrote: Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381 chunks); 750329768 used I think you'll have to catch Heikki's attention to get a good answer to this. Is it possible this job is inserting and then updating (or deleteing) the row it just inserted and doing a large number of such insert/update operations all within the same transaction? Or perhaps it's updating the same row over and over again? .. and all that in different subtransactions. sorry, scratch that, they don't need to be in different subtransactions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
Greg Stark st...@mit.edu writes: I have a sketch for how to handle spilling hash aggregates to disk in my head. I'm not sure if it's worth the amount of complexity it would require but I'll poke around a bit and see if it works out well. It'd be awfully nice if those could spill to disk. I think that currently that's the only plan type where a misestimate can lead to hard failure rather than just slower-than-you'd-like. Which is not nice considering that the estimates are necessarily just estimates. Could you give us a brain dump on the sketch? I've never seen how to do it without unreasonable overhead. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 03/26/2012 01:06 PM, Heikki Linnakangas wrote: On 26.03.2012 19:59, Heikki Linnakangas wrote: On 26.03.2012 19:51, Greg Stark wrote: On Mon, Mar 26, 2012 at 5:41 PM, Andrew Dunstanand...@dunslane.net wrote: Combo CIDs: 755490840 total in 100 blocks; 5161072 free (381 chunks); 750329768 used I think you'll have to catch Heikki's attention to get a good answer to this. Is it possible this job is inserting and then updating (or deleteing) the row it just inserted and doing a large number of such insert/update operations all within the same transaction? Or perhaps it's updating the same row over and over again? .. and all that in different subtransactions. sorry, scratch that, they don't need to be in different subtransactions. It's all in a single transaction. In fact the solution I'm currently testing and seems to be working involves breaking it up into batches of a few thousand LOs restored per batch. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
Andrew Dunstan and...@dunslane.net writes: On 03/26/2012 01:06 PM, Heikki Linnakangas wrote: Is it possible this job is inserting and then updating (or deleteing) the row it just inserted and doing a large number of such insert/update operations all within the same transaction? Or perhaps it's updating the same row over and over again? It's all in a single transaction. In fact the solution I'm currently testing and seems to be working involves breaking it up into batches of a few thousand LOs restored per batch. Hm. The test case is just a straight pg_restore of lots and lots of LOs? What pg_dump version was the dump made with? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On 03/26/2012 01:34 PM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 03/26/2012 01:06 PM, Heikki Linnakangas wrote: Is it possible this job is inserting and then updating (or deleteing) the row it just inserted and doing a large number of such insert/update operations all within the same transaction? Or perhaps it's updating the same row over and over again? It's all in a single transaction. In fact the solution I'm currently testing and seems to be working involves breaking it up into batches of a few thousand LOs restored per batch. Hm. The test case is just a straight pg_restore of lots and lots of LOs? What pg_dump version was the dump made with? 8.4.8, same as the target. We get the same issue whether we restore direct to the database from pg_restore or via a text dump. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
Andrew Dunstan and...@dunslane.net writes: On 03/26/2012 01:34 PM, Tom Lane wrote: Hm. The test case is just a straight pg_restore of lots and lots of LOs? What pg_dump version was the dump made with? 8.4.8, same as the target. We get the same issue whether we restore direct to the database from pg_restore or via a text dump. I believe I see the issue: when creating/loading LOs, we first do a lo_create (which in 8.4 makes a page zero tuple in pg_largeobject containing zero bytes of data) and then lo_write, which will do a heap_update to overwrite that tuple with data. This is at the next command in the same transaction, so the original tuple has to receive a combo CID. Net result: we accumulate one new combo CID per large object loaded in the same transaction. You can reproduce this without any pg_dump involvement at all, using something like create table mylos (id oid); insert into mylos select lo_import('/tmp/junk') from generate_series(1,100); The problem is gone in 9.0 and up because now we use a pg_largeobject_metadata entry instead of a pg_largeobject row to flag the existence of an empty large object. I don't see any very practical backend fix for the problem in 8.x. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark st...@mit.edu writes: I have a sketch for how to handle spilling hash aggregates to disk in my head. I'm not sure if it's worth the amount of complexity it would require but I'll poke around a bit and see if it works out well. It'd be awfully nice if those could spill to disk. I think that currently that's the only plan type where a misestimate can lead to hard failure rather than just slower-than-you'd-like. Which is not nice considering that the estimates are necessarily just estimates. Could you give us a brain dump on the sketch? I've never seen how to do it without unreasonable overhead. Hm. So my original plan was dependent on adding the state-merge function we've talked about in the past. Not all aggregate functions necessarily can support such a function but I think all or nearly all the builtin aggregates can. Certainly min,max, count, sum, avg, stddev, array_agg can which are most of what people do. That would be a function which can take two state variables and produce a new state variable. If you have this then you can sort and spill the hash table and start a new hash table and keep going. When you're done you merge the hash tables using something like heap merge but applying the state merge function. However now that I've started thinking about it further I think you could solve it with less complexity by cheating in various ways. For example if you limit the hash size to 1/2 of work_mem then you when you reach that limit you could just stuff any tuple that doesn't match a hash entry into a tuplesort with 1/2 of work_mem and do the regular level break logic on the output of that. Or if you can count on tuplesort to be a stable sort (and I think it might be already) then you could just spill the sorted hash entries and then switch to doing a tapesort for the rest. When you finish merging you can read them back and do level break logic like normal. If there is a partially computed state it will be the first one in the equal group. If you can't count on tuplesort to be stable you could dump the partially computed states to a different tape and do a merge between it and the sorted output of the main data set. The holy grail of this kind of merging of the two algorithms would be something that keeps the hash table going and maintains an lru of hash entries. When it grows too large it would spill the oldest partial state. Then it would sort those states and merge them (possibly in a single step). That might be too complex to pull its weight given some of the above sketches are probably simple enough. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
Greg Stark st...@mit.edu writes: On Mon, Mar 26, 2012 at 6:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Could you give us a brain dump on the sketch? I've never seen how to do it without unreasonable overhead. Hm. So my original plan was dependent on adding the state-merge function we've talked about in the past. Not all aggregate functions necessarily can support such a function but I think all or nearly all the builtin aggregates can. Certainly min,max, count, sum, avg, stddev, array_agg can which are most of what people do. That would be a function which can take two state variables and produce a new state variable. I'd rather not invent new requirements for aggregate implementations if we can avoid it. However now that I've started thinking about it further I think you could solve it with less complexity by cheating in various ways. For example if you limit the hash size to 1/2 of work_mem then you when you reach that limit you could just stuff any tuple that doesn't match a hash entry into a tuplesort with 1/2 of work_mem and do the regular level break logic on the output of that. Or just start dumping such tuples into a tuplestore, while continuing to process tuples that match the hashagg entries that are already in existence. Once the input is exhausted, read out the hashagg entries we have, flush the hashagg table, start reading from the tuplestore. Repeat as needed. I like this idea because the only thing you give up is predictability of the order of output of aggregated entries, which is something that a hashagg isn't guaranteeing anyway. In particular, we would still have a guarantee that any one aggregate evaluation processes the matching tuples in arrival order, which is critical for some aggregates. The main problem I can see is that if we start to flush after work_mem is X% full, we're essentially hoping that the state values for the existing aggregates won't grow by more than 1-X%, which is safe for many common aggregates but fails for some like array_agg(). Ultimately, for ones like that, it'd probably be best to never consider hashing at all. I guess we could invent an unsafe for hash aggregation flag for aggregates that have unbounded state-size requirements. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers