Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 12.11.2012 01:25, Tom Lane wrote: Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. There are several cases where index insertion operations generate more than one WAL record while holding exclusive lock on buffer(s). If the lock continuity is actually necessary to prevent concurrent queries from seeing inconsistent index states, then we have a big problem, because WAL replay isn't designed to hold any buffer lock for more than one WAL action. Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), the GiST index is always in a consistent state. Before the downlink for a newly split page has been inserted yet, its left sibling is flagged so that a search knows to follow the right-link to find it. In normal operation, the lock continuity is needed to prevent another backend from seeing the incomplete split and trying to fix it, but in hot standby, we never try to fix incomplete splits anyway. So I think we're good on = 9.1. The 9.0 code is broken, however. In 9.0, when a child page is split, the parent and new children are kept locked until the downlinks are inserted/updated. If a concurrent scan comes along and sees that incomplete state, it will miss tuples on the new right siblings. We rely on a rm_cleanup operation at the end of WAL replay to fix that situation, if the downlink insertion record is not there. I don't see any easy way to fix that, unfortunately. Perhaps we could backpatch the 9.1 rewrite, now that it's gotten some real-world testing, but it was a big change so I don't feel very comfortable doing that. Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was removed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Jeff, On 11/10/2012 12:08 AM, Jeff Davis wrote: The bit indicating that a checksum is present may be lost due to corruption. Hm.. I see. Sorry if that has been discussed before, but can't we do without that bit at all? It adds a checksum switch to each page, where we just agreed we don't event want a per-database switch. Can we simply write a progress indicator to pg_control or someplace saying that all pages up to X of relation Y are supposed to have valid checksums? That would mean having to re-calculate the checksums on pages that got dirtied before VACUUM came along to migrate them to having a checksum, but that seems acceptable. VACUUM could even detect that case and wouldn't have to re-write it with the same contents. I realize this doesn't support Jesper's use case of wanting to have the checksums only for newly dirtied pages. However, I'd argue that prolonging the migration to spread the load would allow even big shops to go through this without much of an impact on performance. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/12/2012 05:55 AM, Greg Smith wrote: Adding an initdb option to start out with everything checksummed seems an uncontroversial good first thing to have available. +1 So the following discussion really is for a future patch extending on that initial checkpoint support. One of the really common cases I was expecting here is that conversions are done by kicking off a slow background VACUUM CHECKSUM job that might run in pieces. I was thinking of an approach like this: -Initialize a last_checked_block value for each table -Loop: --Grab the next block after the last checked one --When on the last block of the relation, grab an exclusive lock to protect against race conditions with extension --If it's marked as checksummed and the checksum matches, skip it ---Otherwise, add a checksum and write it out --When that succeeds, update last_checked_block --If that was the last block, save some state saying the whole table is checkedsummed Perfect, thanks. That's the rough idea I had in mind as well, written out in detail and catching the extension case. With that logic, there is at least a forward moving pointer that removes the uncertainty around whether pages have been updated or not. It will keep going usefully if interrupted too. One obvious this way this can fail is if: 1) A late page in the relation is updated and a checksummed page written 2) The page is corrupted such that the is this checksummed? bits are not consistent anymore, along with other damage to it 3) The conversion process gets to this page eventually 4) The corruption of (2) isn't detected IMO this just outlines how limited the use of the is this checksummed bit in the page itself is. It just doesn't catch all cases. Is it worth having that bit at all, given your block-wise approach above? It really only serves to catch corruptions to *newly* dirtied pages *during* the migration phase that *keep* that single bit set. Everything else is covered by the last_checked_block variable. Sounds narrow enough to be negligible. Then again, it's just a single bit per page... The only guarantee I see that we can give for online upgrades is that after a VACUUM CHECKSUM sweep is done, and every page is known to both have a valid checksum on it and have its checksum bits set, *then* any page that doesn't have both set bits and a matching checksum is garbage. From that point in time on, we'd theoretically better use that bit as an additional checksum bit rather than requiring it to be set all times. Really just theoretically, I'm certainly not advocating a 33 bit checksum :-) Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/12/2012 04:44 PM, Markus Wanner wrote: Jeff, On 11/10/2012 12:08 AM, Jeff Davis wrote: The bit indicating that a checksum is present may be lost due to corruption. Hm.. I see. Sorry if that has been discussed before, but can't we do without that bit at all? It adds a checksum switch to each page, where we just agreed we don't event want a per-database switch. Can we simply write a progress indicator to pg_control or someplace saying that all pages up to X of relation Y are supposed to have valid checksums? That'll make it hard for VACUUM, hint-bit setting, etc to opportunistically checksum pages whenever they're doing a page write anyway. Is it absurd to suggest using another bitmap, like the FSM or visibility map, to store information on page checksumming while checksumming is enabled but incomplete? As a much smaller file the bitmap could its self be very quickly generated in one pass when checksumming is enabled, with its starting state showing no pages having checksums. It perhaps its self have page checksums since presumably the persistent maps like the FSM and visibility map will support them? Some way to ensure the checksum map is valid would be needed. -- Craig Ringer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reduce palloc's in numeric operations.
Thanks for comments, Have to be careful to really not modify the operands. numeric_out() and numeric_out_sci() are wrong; they call get_str_from_var(), which modifies the argument. Same with numeric_expr(): it passes the argument to numericvar_to_double_no_overflow(), which passes it to get_str_from_var(). numericvar_to_int8() also modifies its argument, so all the functions that use that, directly or indirectly, must make a copy. mmm. My carefulness was a bit short to pick up them... I overlooked that get_str_from_var() and numeric_to_int8() calls round_var() which rewrites the operand. I reverted numeric_out() and numeric_int8(), numeric_int2(). Altough, I couldn't find in get_str_from_var_sci() where the operand would be modified. The lines where var showing in get_str_from_var_sci() is listed below. | 2:get_str_from_var_sci(NumericVar *var, int rscale) | 21: if (var-ndigits 0) | 23:exponent = (var-weight + 1) * DEC_DIGITS; | 29:exponent -= DEC_DIGITS - (int) log10(var-digits[0]); | 59: div_var(var, denominator, significand, rscale, true); The only suspect is div_var at this level, and do the same thing for var1 in div_var. | 2:div_var(NumericVar *var1, NumericVar *var2, NumericVar *result, | 20: int var1ndigits = var1-ndigits; | 35: if (var1ndigits == 0) | 47: if (var1-sign == var2-sign) | 51: res_weight = var1-weight - var2-weight; | 68: div_ndigits = Max(div_ndigits, var1ndigits); | 83: memcpy(dividend + 1, var1-digits, var1ndigits * sizeof(NumericDigit)); | 132: for (i = var1ndigits; i = 0; i--) No line seems to modify var1 as far as I see so I've left numeric_out_sci() modified in this patch. Well, I found some other bugs in numeric_stddev_internal. vN was errorniously freed and vsumX2 in is used as work. They are fixed in this new patch. Perhaps get_str_from_var(), and the other functions that currently scribble on the arguments, should be modified to not do so. They could easily make a copy of the argument within the function. Then the callers could safely use set_var_from_num_nocopy(). The performance would be the same, you would have the same number of pallocs, but you would get rid of the surprising argument-modifying behavior of those functions. I agree with that. const qualifiers on parameters would rule this mechanically. I try this for the next version of this patch. SELECT SUM(col) FROM numtest; The execution time of that query fell from about 5300 ms to 4300 ms, ie. about 20%. Wow, it seems more promising than I expected. Thanks. regards, -- 堀口恭太郎 日本電信電話株式会社 NTTオープンソフトウェアセンタ Phone: 03-5860-5115 / Fax: 03-5463-5490 diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 68c1f1d..fcff325 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -367,6 +367,7 @@ static void zero_var(NumericVar *var); static const char *set_var_from_str(const char *str, const char *cp, NumericVar *dest); static void set_var_from_num(Numeric value, NumericVar *dest); +static void set_var_from_num_nocopy(Numeric num, NumericVar *dest); static void set_var_from_var(NumericVar *value, NumericVar *dest); static char *get_str_from_var(NumericVar *var, int dscale); static char *get_str_from_var_sci(NumericVar *var, int rscale); @@ -617,11 +618,10 @@ numeric_out_sci(Numeric num, int scale) return pstrdup(NaN); init_var(x); - set_var_from_num(num, x); + set_var_from_num_nocopy(num, x); str = get_str_from_var_sci(x, scale); - free_var(x); return str; } @@ -696,7 +696,7 @@ numeric_send(PG_FUNCTION_ARGS) int i; init_var(x); - set_var_from_num(num, x); + set_var_from_num_nocopy(num, x); pq_begintypsend(buf); @@ -707,8 +707,6 @@ numeric_send(PG_FUNCTION_ARGS) for (i = 0; i x.ndigits; i++) pq_sendint(buf, x.digits[i], sizeof(NumericDigit)); - free_var(x); - PG_RETURN_BYTEA_P(pq_endtypsend(buf)); } @@ -1577,15 +1575,13 @@ numeric_add(PG_FUNCTION_ARGS) init_var(arg2); init_var(result); - set_var_from_num(num1, arg1); - set_var_from_num(num2, arg2); + set_var_from_num_nocopy(num1, arg1); + set_var_from_num_nocopy(num2, arg2); add_var(arg1, arg2, result); res = make_result(result); - free_var(arg1); - free_var(arg2); free_var(result); PG_RETURN_NUMERIC(res); @@ -1620,15 +1616,13 @@ numeric_sub(PG_FUNCTION_ARGS) init_var(arg2); init_var(result); - set_var_from_num(num1, arg1); - set_var_from_num(num2, arg2); + set_var_from_num_nocopy(num1, arg1); + set_var_from_num_nocopy(num2, arg2); sub_var(arg1, arg2, result); res = make_result(result); - free_var(arg1); - free_var(arg2); free_var(result); PG_RETURN_NUMERIC(res); @@ -1667,15 +1661,13 @@ numeric_mul(PG_FUNCTION_ARGS) init_var(arg2); init_var(result); - set_var_from_num(num1, arg1); - set_var_from_num(num2, arg2); + set_var_from_num_nocopy(num1, arg1); + set_var_from_num_nocopy(num2, arg2); mul_var(arg1, arg2, result, arg1.dscale +
Re: [HACKERS] Identity projection
Hello, This is new version of identity projection patch. Reverted projectionInfo and ExecBuildProjectionInfo. Identity projection is recognized directly in ExecGroup, ExecResult, and ExecWindowAgg. nodeAgg is reverted because I couldn't make it sane.. The following is the result of performance test posted before in order to show the source of the gain. regards, -- -- Kyotaro Horiguchi NTT Open Source Software Center At Fri, 05 Oct 2012 16:04:16 +0900, Kyotaro HORIGUCHI wrote in 20121005.160416.256387378.horiguchi.kyot...@lab.ntt.co.jp Although I said as following, the gain seems a bit larger... I'll recheck the testing conditions... I had inspected more precisely on two aspects maginifying the effect of this patch by putting 300 columns into table. First, explain analyze says the difference caused by this patch is only in the actual time of Result node. orig$ psql -c 'explain analyze select * from parenta' QUERY PLAN -- Result (cost=0.00..176667.00 rows=101 width=1202) (actual time=0.013.. *2406.792* rows=100 loops=1) - Append (cost=0.00..176667.00 rows=101 width=1202) (actual time=0.011..412.749 rows=100 loops=1) - Seq Scan on parenta (cost=0.00..0.00 rows=1 width=1228) (actual time=0.001..0.001 rows=0 loops=1) - Seq Scan on childa000 parenta (cost=0.00..176667.00 rows=100 width=1202) (actual time=0.009..334.633 rows=100 loops=1) Total runtime: 2446.474 ms (5 rows) patched$ psql -c 'explain analyze select * from parenta' QUERY PLAN -- Result (cost=0.00..176667.00 rows=101 width=1202) (actual time=0.011.. *507.239* rows=100 loops=1) - Append (cost=0.00..176667.00 rows=101 width=1202) (actual time=0.011..419.420 rows=100 loops=1) - Seq Scan on parenta (cost=0.00..0.00 rows=1 width=1228) (actual time=0.000..0.000 rows=0 loops=1) - Seq Scan on childa000 parenta (cost=0.00..176667.00 rows=100 width=1202) (actual time=0.010..335.721 rows=100 loops=1) Total runtime: 545.879 ms (5 rows) Second, the results of configure --enable-profiling shows that the exectime of ExecProject chages greately. This is consistent with what explain showed. orig: % cumulative self self total time seconds secondscalls s/call s/call name 60.29 1.26 1.26 105 0.00 0.00 slot_deform_tuple ! 30.14 1.89 0.63 100 0.00 0.00 ExecProject 3.35 1.96 0.07 304 0.00 0.00 ExecProcNode 0.96 1.98 0.02 102 0.00 0.00 ExecScan 0.96 2.00 0.02 166379 0.00 0.00 TerminateBufferIO 0.48 2.01 0.01 304 0.00 0.00 InstrStartNode 0.48 2.02 0.01 304 0.00 0.00 InstrStopNode ! 0.48 2.03 0.01 101 0.00 0.00 ExecResult 0.48 2.04 0.01 830718 0.00 0.00 LWLockAcquire 0.48 2.05 0.01 506834 0.00 0.00 hash_search_with_hash_value 0.48 2.06 0.01 341656 0.00 0.00 LockBuffer 0.48 2.07 0.01 168383 0.00 0.00 ReadBuffer_common 0.48 2.08 0.014 0.00 0.00 InstrEndLoop 0.48 2.09 0.01 ExecCleanTargetListLength 0.00 2.09 0.00 205 0.00 0.00 MemoryContextReset patched: % cumulative self self total time seconds secondscalls ms/call ms/call name 23.08 0.03 0.03 304 0.00 0.00 ExecProcNode 15.38 0.05 0.02 102 0.00 0.00 heapgettup_pagemode 15.38 0.07 0.02 830718 0.00 0.00 LWLockAcquire 7.69 0.08 0.01 205 0.00 0.00 MemoryContextReset 7.69 0.09 0.01 102 0.00 0.00 ExecScan 7.69 0.10 0.01 100 0.00 0.00 ExecStoreTuple 7.69 0.11 0.01 841135 0.00 0.00 LWLockRelease 7.69 0.12 0.01 168383 0.00 0.00 ReadBuffer_common 7.69 0.13 0.01 168383 0.00 0.00 UnpinBuffer 0.00 0.13 0.00 304 0.00 0.00 InstrStartNode ... ! 0.00 0.13 0.00 101 0.00 0.00 ExecResult ! 0.00 0.13 0.00 100 0.00 0.00 ExecProject == diff --git a/src/backend/executor/nodeGroup.c
Re: [HACKERS] Enabling Checksums
On 11/12/2012 10:44 AM, Craig Ringer wrote: That'll make it hard for VACUUM, hint-bit setting, etc to opportunistically checksum pages whenever they're doing a page write anyway. It *is* a hard problem, yes. And the single bit doesn't really solve it. So I'm arguing against opportunistically checksumming in general. Who needs that anyway? Is it absurd to suggest using another bitmap, like the FSM or visibility map, to store information on page checksumming while checksumming is enabled but incomplete? Not absurd. But arguably inefficient, because that bitmap may well become a bottleneck itself. Plus there's the problem of making sure those pages are safe against corruptions, so you'd need to checksum the checksum bitmap... doesn't sound like a nice solution to me. This has certainly been discussed before. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why can't plpgsql return a row-expression?
Hi, I have tried to solve this issue. Please see the attached patch. With this patch, any expression is allowed in the return statement. For any invalid expression an error is generated without doing any special handling. When a row expression is used in the return statement then the resultant tuple will have rowtype in a single column that needed to be extracted. Hence I have handled that case in exec_stmt_return(). any comments/suggestions? Regards, --Asif On Mon, Oct 8, 2012 at 9:53 PM, Tom Lane t...@sss.pgh.pa.us wrote: Pavel Stehule pavel.steh...@gmail.com writes: 2012/10/8 Tom Lane t...@sss.pgh.pa.us: Laziness, probably. Feel free to have at it. I wrote patch some years ago. It was rejected from performance reasons - because every row had to be casted to resulted type. I don't recall that patch in any detail, but it's not apparent to me that an extra cast step *must* be required to implement this. In the cases that are supported now, surely we could notice that no additional work is required. It's also worth commenting that if we were to switch the storage of composite-type plpgsql variables to HeapTuple, as has been suggested here for other reasons, the performance tradeoffs in this area would likely change completely anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers return_rowtype.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL
Greg Smith g...@2ndquadrant.com writes: Regardless, exactly how to prevent two concurrent processes from writing the same file feels like the last step in the small roadmap for what this feature needs. Write a temp file and use rename(2) to move it into place is the standard solution for that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Well, according to storage/buffer/README: 1. To scan a page for tuples, one must hold a pin and either shared or exclusive content lock. To examine the commit status (XIDs and status bits) of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. That does indeed make it sound like an x-lock is enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Robert Haas robertmh...@gmail.com writes: On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: I already pointed out the inconsistency in heap_xlog_freeze about whether a cleanup lock is needed. As is, this patch uses a cleanup lock, but I suspect that a regular lock is sufficient --- comments? Well, according to storage/buffer/README: 1. To scan a page for tuples, one must hold a pin and either shared or exclusive content lock. To examine the commit status (XIDs and status bits) of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. That does indeed make it sound like an x-lock is enough. Yeah. AFAICS, the only possible downside is that somebody might be using the tuple (while holding just a buffer pin), and that its xmin might change while that's happening. So for instance a nestloop join might read out different xmin values for the same row while the join proceeds. But that could happen anyway if a different plan had been chosen (viz, this table on the inside not the outside of the nestloop). The xmin update ought to be physically atomic, so you shouldn't be able to get a garbage result, just the old value or the new. The cleanup lock is needed for cases where we'd like to remove or move a tuple, but that's not required here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Heikki Linnakangas hlinnakan...@vmware.com writes: On 12.11.2012 01:25, Tom Lane wrote: Worse than that, GIST WAL replay seems to be a total disaster from this standpoint, and I'm not convinced that it's even fixable without incompatible WAL changes. Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), the GiST index is always in a consistent state. Before the downlink for a newly split page has been inserted yet, its left sibling is flagged so that a search knows to follow the right-link to find it. In normal operation, the lock continuity is needed to prevent another backend from seeing the incomplete split and trying to fix it, but in hot standby, we never try to fix incomplete splits anyway. So I think we're good on = 9.1. Okay. I'll update the patch to make sure that the individual WAL replay functions hold all locks, but not worry about holding locks across actions. The 9.0 code is broken, however. In 9.0, when a child page is split, the parent and new children are kept locked until the downlinks are inserted/updated. If a concurrent scan comes along and sees that incomplete state, it will miss tuples on the new right siblings. We rely on a rm_cleanup operation at the end of WAL replay to fix that situation, if the downlink insertion record is not there. I don't see any easy way to fix that, unfortunately. Perhaps we could backpatch the 9.1 rewrite, now that it's gotten some real-world testing, but it was a big change so I don't feel very comfortable doing that. Me either. Given the lack of field complaints, I think we're better advised to just leave it unfixed in 9.0. It'd not be a step forward if we broke something trying to make this work. Also, gistRedoPageDeleteRecord seems to be dead code? I don't see anything that emits XLOG_GIST_PAGE_DELETE. Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was removed. Okay. I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer any way that 9.0 or later versions could see this WAL record type. I'll delete that replay function rather than bothering to fix it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila amit.kap...@huawei.com wrote: On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote: On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila amit.kap...@huawei.com wrote: On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote: On 19.10.2012 14:42, Amit kapila wrote: On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote: Before implementing the timeout parameter, I think that it's better to change both pg_basebackup background process and pg_receivexlog so that they send back the reply message immediately when they receive the keepalive message requesting the reply. Currently, they always ignore such keepalive message, so status interval parameter (-s) in them always must be set to the value less than replication timeout. We can avoid this troublesome parameter setting by introducing the same logic of walreceiver into both pg_basebackup background process and pg_receivexlog. Please find the patch attached to address the modification mentioned by you (send immediate reply for keepalive). Both basebackup and pg_receivexlog uses the same function ReceiveXLogStream, so single change for both will address the issue. Thanks, committed this one after shuffling it around the changes I committed yesterday. I also updated the docs to not claim that -s option is required to avoid timeout disconnects anymore. Thank you. However I think still the issue will not be completely solved. pg_basebackup/pg_receivexlog can still take long time to detect network break as they don't have timeout concept. To do that I have sent one proposal which is mentioned at end of mail chain: http://archives.postgresql.org/message- id/6C0B27F7206C9E4CA54AE035729E9C3828 53BBED@szxeml509-mbs Do you think there is any need to introduce such mechanism in pg_basebackup/pg_receivexlog? Are you planning to introduce the timeout mechanism in pg_basebackup main process? Or background process? It's useful to implement both. By background process, you mean ReceiveXlogStream? For both. I think for background process, it can be done in a way similar to what we have done for walreceiver. Yes. But I have some doubts for how to do for main process: Logic similar to walreceiver can not be used incase network goes down during getting other database file from server. The reason for the same is to receive the data files PQgetCopyData() is called in synchronous mode, so it keeps waiting for infinite time till it gets some data. In order to solve this issue, I can think of following options: 1. Making this call also asynchronous (but now sure about impact of this). +1 Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can solve the issue in the similar way to walreceiver's. 2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite wait), we can send some finite time. This time can be received as command line argument from respective utility and set the same in PGconn structure. In order to have timeout value in PGconn, we can have: a. Add new parameter in PGconn to indicate the receive timeout. b. Use the existing parameter connect_timeout for receive timeout also but this may lead to confusion. 3. Any other better option? Apart from above issue, there is possibility that if during connect time network goes down, then it might hang, because connect_timeout by default will be NULL and connectDBComplete will start waiting inifinitely for connection to become successful. So shall we have command line argument separately for this also or any other way as you suugest. Yes, I think that we should add something like --conninfo option to pg_basebackup and pg_receivexlog. We can easily set not only connect_timeout but also sslmode, application_name, ... by using such option accepting conninfo string. BTW, IIRC the walsender has no timeout mechanism during sending backup data to pg_basebackup. So it's also useful to implement the timeout mechanism for the walsender during backup. Yes, its useful, but for walsender the main problem is that it uses blocking send call to send the data. I have tried using tcp_keepalive settings, but the send call doesn't comeout incase of network break. The only way I could get it out is: change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using the command echo 8 /proc/sys/net/ipv4/tcp_retries2 As per recommendation, its value should be at-least 8 (equivalent to 100 sec) Do you have any idea, how it can be achieved? What about using pq_putmessage_noblock()? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Greg Smith wrote: On 11/11/12 2:56 PM, Jeff Davis wrote: We could have a separate utility, pg_checksums, that can alter the state and/or do an offline verification. And initdb would take an option that would start everything out fully protected with checksums. Adding an initdb option to start out with everything checksummed seems an uncontroversial good first thing to have available. +1 Won't a pg_checksums program just grow until it looks like a limited version of vacuum though? It's going to iterate over most of the table; it needs the same cost controls as autovacuum (and to respect the load of concurrent autovacuum work) to keep I/O under control; and those cost control values might change if there's a SIGHUP to reload parameters. It looks so much like vacuum that I think there needs to be a really compelling reason to split it into something new. Why can't this be yet another autovacuum worker that does its thing? I agree that much of the things it's gonna do are going to be pretty much the same as vacuum, but vacuum does so many other things that I think it should be kept separate. Sure, we can make it be invoked from autovacuum in background according to some (yet to be devised) scheduling heuristics. But I don't see that it needs to share any vacuum code. A couple of thoughts about autovacuum: it's important to figure out whether checksumming can run concurrently with vacuuming the same table; if not, which one defers to the other in case of lock conflict. Also, can checksumming be ignored by concurrent transactions when computing Xmin (I don't see any reason not to ...) One of the really common cases I was expecting here is that conversions are done by kicking off a slow background VACUUM CHECKSUM job that might run in pieces. I was thinking of an approach like this: -Initialize a last_checked_block value for each table -Loop: --Grab the next block after the last checked one --When on the last block of the relation, grab an exclusive lock to protect against race conditions with extension Note that we have a separate lock type for relation extension, so we can use that to avoid a conflict here. --If it's marked as checksummed and the checksum matches, skip it ---Otherwise, add a checksum and write it out --When that succeeds, update last_checked_block --If that was the last block, save some state saying the whole table is checkedsummed Some state can be a pg_class field that's updated per heap_inplace_update. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On Fri, Sep 14, 2012 at 6:42 AM, Amit kapila amit.kap...@huawei.com wrote: On Tuesday, September 11, 2012 7:07 PM Amit Kapila wrote: On Monday, September 10, 2012 8:20 PM Amit Kapila wrote: On Sunday, September 09, 2012 1:37 PM Amit Kapila wrote: On Friday, September 07, 2012 11:19 PM Tom Lane wrote: Heikki Linnakangas hlinn...@iki.fi writes: Would socketpair(2) be simpler? I've not done anything yet about the potential security issues associated with untrusted libpq connection strings. I think this is still at the proof-of-concept stage; in particular, it's probably time to see if we can make it work on Windows before we worry more about that. I have started working on this patch to make it work on Windows. The 3 main things to make it work are: The patch which contains Windows implementation as well is attached with this mail. It contains changes related to following 1. waitpid 2. socketpair 3. fork-exec The following is still left: 1. Error handling in all paths The modified version 2 contains error handling in all paths. I didn't see that this patch was added to a commitfest -- should it have been? I very much like Tom's proposed starting point for this feature as a replacement for --single. Hate to see this die on the vine. Would some testing on windows be what's needed to get the ball rolling? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Andres Freund and...@2ndquadrant.com writes: On 2012-11-10 16:24:22 -0500, Tom Lane wrote: If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely used by xlogdump. Not sure if either are worth that much attention, but it seems worth noticing that such a change will break stuff. Hm. Okay, how about we leave the old macros in place in the back branches? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix errcontext() function
On 10.11.2012 11:46, Chen Huajun wrote: Unfortunately not all compilers support varargs macros. I bumped into this in February, see http://archives.postgresql.org/message-id/4f3b72e0.8040...@enterprisedb.com. My last attempt to fix this was at http://archives.postgresql.org/pgsql-hackers/2012-04/msg00812.php. That patch is probably good to go, I just got busy with other things and forgot about it back then. Can you take a look at that patch and see if I missed anything, please? I think you are right,although the number of changed place is a a little bit large. Thanks for your answer! Ok, I've committed this patch now, it will be fixed in 9.3. Thanks for reminding me about this. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Version 4.9 of PostgreSQL buildfarm client released
Version 4.9 of the PostgreSQL buildfarm client has been released and is available at https://github.com/downloads/PGBuildFarm/client-code/build-farm-4_9.tgz Changes since version 4.8: . adjust git status checking to allow for messages from recent git releases . add a module to test FileTextArray_FDW as an example of checking an extension . a few small bug fixes and tweaks You will need this release if you are using or likely to use a fairly recent git version to avoid getting spurious git-Dirty failures. Enjoy cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 2012-11-12 10:19:09 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2012-11-10 16:24:22 -0500, Tom Lane wrote: If any of this stuff were getting used by external modules, changing it would be problematic ... but fortunately, it isn't, because we lack support for plug-in rmgrs. So I'm not worried about back-patching the change, and would prefer to keep the 9.x branches in sync. XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely used by xlogdump. Not sure if either are worth that much attention, but it seems worth noticing that such a change will break stuff. Hm. Okay, how about we leave the old macros in place in the back branches? Sounds good to me. The RestoreBkpBlocks change seems unproblematic to me. If anything its good that it has been renamed. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix errcontext() function
Heikki Linnakangas wrote: On 10.11.2012 11:46, Chen Huajun wrote: Unfortunately not all compilers support varargs macros. I bumped into this in February, see http://archives.postgresql.org/message-id/4f3b72e0.8040...@enterprisedb.com. My last attempt to fix this was at http://archives.postgresql.org/pgsql-hackers/2012-04/msg00812.php. That patch is probably good to go, I just got busy with other things and forgot about it back then. Can you take a look at that patch and see if I missed anything, please? I think you are right,although the number of changed place is a a little bit large. Thanks for your answer! Ok, I've committed this patch now, it will be fixed in 9.3. Thanks for reminding me about this. Hopefully you noticed that contrib is broken. -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Dumping an Extension's Script
Hi, Please find attached to this email an RFC patch implementing the basics of the pg_dump --extension-script option. After much discussion around the concept of an inline extension, we decided last year that a good first step would be pg_dump support for an extension's script. The approach I've been using here is to dump the script from the catalog current dependencies, which mean that a sequence of CREATE EXTENSION followed by a number of ALTER EXTENSION … UPDATE … will be consolidated into a single CREATE EXTENSION command in the dump, much the same as with CREATE TABLE then ALTER TABLE … ADD COLUMN and the like. Currently the option behavior is the following, that looks sane to me, and is open for discussion: the dump's schema always include the CREATE EXTENSION commands you need. The extensions listed in the -X option (that you can use more than once) will get dumped with their's current member objects in a script, inline. To try the attached patch, you could do as following: createdb foo psql -c create extension hstore -d foo pg_dump -X hstore -f /tmp/foo.sql foo createdb bar psql -1 -f /tmp/foo.sql -d bar To be able to restore the dump, I've been adding some basic support to the CREATE EXTENSION command so that it will find the data it needs from the SQL command rather than the control file. Note that the extension control file only contains information about how to install an extension from a script file on disk. That's something we don't need at all when installing the extension from a dump, using either pg_restore or psql. We have some exceptions to that principle, namely: requires (sets the search_path) and relocatable (found in the catalogs, needs to survive dump/restore). Given positive feedback on that way to attack the problem, the TODO list includes: - document the new pg_dump --extension-script switch - add support for ALTER EXTENSION … WITH $$ script here $$; The ALTER EXTENSION support is optional as far as pg_dump support goes, it would be good to have it to make the User Interface complete. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support *** a/src/backend/commands/extension.c --- b/src/backend/commands/extension.c *** *** 75,80 typedef struct ExtensionControlFile --- 75,81 bool superuser; /* must be superuser to install? */ int encoding; /* encoding of the script file, or -1 */ List *requires; /* names of prerequisite extensions */ + char *script; } ExtensionControlFile; /* *** *** 577,586 parse_extension_control_file(ExtensionControlFile *control, } /* ! * Read the primary control file for the specified extension. */ static ExtensionControlFile * ! read_extension_control_file(const char *extname) { ExtensionControlFile *control; --- 578,587 } /* ! * Create an ExtensionControlFile with default values. */ static ExtensionControlFile * ! default_extension_control_file(const char *extname) { ExtensionControlFile *control; *** *** 593,598 read_extension_control_file(const char *extname) --- 594,610 control-superuser = true; control-encoding = -1; + return control; + } + + /* + * Read the primary control file for the specified extension. + */ + static ExtensionControlFile * + read_extension_control_file(const char *extname) + { + ExtensionControlFile *control = default_extension_control_file(extname); + /* * Parse the primary control file. */ *** *** 858,866 execute_extension_script(Oid extensionOid, ExtensionControlFile *control, CurrentExtensionObject = extensionOid; PG_TRY(); { ! char *c_sql = read_extension_script_file(control, filename); Datum t_sql; /* We use various functions that want to operate on text datums */ t_sql = CStringGetTextDatum(c_sql); --- 870,883 CurrentExtensionObject = extensionOid; PG_TRY(); { ! char *c_sql; Datum t_sql; + if (control-script) + c_sql = control-script; + else + c_sql = read_extension_script_file(control, filename); + /* We use various functions that want to operate on text datums */ t_sql = CStringGetTextDatum(c_sql); *** *** 1178,1183 void --- 1195,1203 CreateExtension(CreateExtensionStmt *stmt) { DefElem*d_schema = NULL; + DefElem*d_script = NULL; + DefElem*d_requires = NULL; + DefElem*d_relocatable = NULL; DefElem*d_new_version = NULL; DefElem*d_old_version = NULL; char *schemaName; *** *** 1229,1248 CreateExtension(CreateExtensionStmt *stmt) errmsg(nested CREATE EXTENSION is not supported))); /* ! * Read the primary control file. Note we assume that it does not contain ! * any non-ASCII data, so there is no need to worry about encoding at this ! * point. ! */ ! pcontrol =
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Sat, Nov 10, 2012 at 05:59:54PM -0500, Bruce Momjian wrote: On Sat, Nov 10, 2012 at 02:45:54PM -0800, Jeff Janes wrote: On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian br...@momjian.us wrote: On Fri, Nov 9, 2012 at 04:23:40PM -0800, Jeff Janes wrote: On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian br...@momjian.us wrote: Again, using SERIAL? Yep. Odd why yours is so much after. You didn't build git head under --enable-cassert, did you? Yikes, you got me! I have not done performance testing in so long, I had forgotten I changed my defaults. New numbers to follow. Sorry. Any chance you can do a oprofile or gprof of head's pg_dump dumping out of head's server? That really should be a lot faster (since commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server. If it is not for you, I don't see how to figure it out without a profile of the slow system. Yes, coming. OK, here are my results. Again, apologies for posting non-linear results based on assert builds: -- 9.2 -- 9.3 -- normal -- -- bin-up -- -- normal -- -- bin-up -- dump rest dump rest dump rest dump rest pg_upgrade 1 0.12 0.06 0.12 0.06 0.11 0.07 0.11 0.07 11.11 1000 7.22 2.40 4.74 2.78 2.20 2.43 4.04 2.86 19.60 2000 5.67 5.10 8.82 5.57 4.50 4.97 8.07 5.69 30.55 4000 13.34 11.13 25.16 12.52 8.95 11.24 16.75 12.16 60.70 8000 29.12 25.98 59.60 28.08 16.68 24.02 30.63 27.08 123.05 16000 87.36 53.16 189.38 62.72 31.38 55.37 61.55 62.66 365.71 You can see the non-linear dump at 16k in 9.2, and the almost-linear in 9.3. :-) pg_upgrade shows non-linear, but that is probably because of the non-linear behavior of 9.2, and because of the two non-linear loops that Ants found, that I will address in a separate email. Thanks for the feedback. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote: So what we're talking about here is a new mode for COPY, that when requested will pre-freeze tuples when loading into a newly created/truncated table. If the table isn't newly created/truncated then we'll just ignore it and continue. I see no need to throw an error, since that will just cause annoying usability issues. Actually, why not just have it work always? If people want to load frozen tuples into a table that's not newly created/truncated, why not let them? Sure, there could be MVCC violations, but as long as the behavior is opt-in, who cares? I think it'd be useful to a lot of people. If we want to reduce (not eliminate) the potential MVCC issues, which I think would be a good idea, we could take AccessExclusiveLock on the table when COPY (FREEZE) is used. Someone using an old snapshot but accessing the table for the first time after AEL is released could still see MVCC anomalies, but at least it would rule out things changing in mid-query, which is the case that I think would be most problematic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Fri, Nov 9, 2012 at 12:50:34AM -0500, Tom Lane wrote: Jeff Janes jeff.ja...@gmail.com writes: Are sure the server you are dumping out of is head? I experimented a bit with dumping/restoring 16000 tables matching Bruce's test case (ie, one serial column apiece). The pg_dump profile seems fairly flat, without any easy optimization targets. But restoring the dump script shows a rather interesting backend profile: samples %image name symbol name 3086139.6289 postgres AtEOXact_RelationCache 9911 12.7268 postgres hash_seq_search 2682 3.4440 postgres init_sequence 2218 2.8482 postgres _bt_compare 2120 2.7223 postgres hash_search_with_hash_value 1976 2.5374 postgres XLogInsert 1429 1.8350 postgres CatalogCacheIdInvalidate 1282 1.6462 postgres LWLockAcquire 973 1.2494 postgres LWLockRelease 702 0.9014 postgres hash_any The hash_seq_search time is probably mostly associated with AtEOXact_RelationCache, which is run during transaction commit and scans the relcache hashtable looking for tables created in the current transaction. So that's about 50% of the runtime going into that one activity. Thanks for finding this. What is odd is that I am not seeing non-linear restores at 16k in git head, so I am confused how something that consumes ~50% of backend time could still perform linearly. Would this consume 50% at lower table counts? I agree we should do something, even if this is a rare case, because 50% is a large percentage. There are at least three ways we could whack that mole: * Run the psql script in --single-transaction mode, as I was mumbling about the other day. If we were doing AtEOXact_RelationCache only once, rather than once per CREATE TABLE statement, it wouldn't be a problem. Easy but has only a narrow scope of applicability. * Keep a separate list (or data structure of your choice) so that relcache entries created in the current xact could be found directly rather than having to scan the whole relcache. That'd add complexity though, and could perhaps be a net loss for cases where the relcache isn't so bloated. I like this one. Could we do it only when the cache gets to be above a certain size, to avoid any penalty? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote: Based on the discussion and suggestions in this mail chain, following features can be implemented: 1. To compute the value of max LSN in data pages based on user input whether he wants it for an individual file, a particular directory or whole database. 2a. To search the available WAL files for the latest checkpoint record and prints the value. 2b. To search the available WAL files for the latest checkpoint record and recreates a pg_control file pointing at that checkpoint. I have kept both options to address different kind of corruption scenarios. I think I can see all of those things being potentially useful. There are a couple of pending patches that will revise the WAL format slightly; not sure how much those are likely to interfere with any development you might do on (2) in the meantime. Based on above conclusion, I have prepared a patch which implements Option-1 I wonder if we shouldn't make this a separate utility, rather than something that is part of pg_resetxlog. Anyone have a thought on that topic? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
Robert Haas robertmh...@gmail.com writes: On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote: So what we're talking about here is a new mode for COPY, that when requested will pre-freeze tuples when loading into a newly created/truncated table. If the table isn't newly created/truncated then we'll just ignore it and continue. I see no need to throw an error, since that will just cause annoying usability issues. Actually, why not just have it work always? If people want to load frozen tuples into a table that's not newly created/truncated, why not let them? Sure, there could be MVCC violations, but as long as the behavior is opt-in, who cares? I think it'd be useful to a lot of people. I thought about that too, but there's a big problem. It wouldn't be just MVCC that would be broken, but transactional integrity: if the COPY fails partway through, the already-loaded rows still look valid. The new-file requirement provides a way to roll them back. I'm willing to have an option that compromises MVCC semantics transiently, but giving up transactional integrity seems a bit much. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: question on foreign key lock
On Thu, Nov 8, 2012 at 3:45 AM, Filip Rembiałkowski filip.rembialkow...@gmail.com wrote: maybe this is a better group for this question? I can't see why creating foreign key on table A referencing table B, generates an AccessExclusiveLock on B. It seems (to a layman :-) ) that only writes to B should be blocked. I'm really interested if this is either expected effect or any open TODO item or suboptimal behavior of postgres. This comment explains it: /* * Grab an exclusive lock on the pk table, so that someone doesn't delete * rows out from under us. (Although a lesser lock would do for that * purpose, we'll need exclusive lock anyway to add triggers to the pk * table; trying to start with a lesser lock will just create a risk of * deadlock.) */ pkrel = heap_openrv(fkconstraint-pktable, AccessExclusiveLock); Concurrent DDL is something that's been discussed in detail on this list in the past; unfortunately, there are some tricky race conditions are the shared invalidation queue and SnapshotNow that make it hard to implement properly. I'm hoping to have some time to work on this at some point, but it hasn't happened yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On Mon, Nov 12, 2012 at 11:20 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote: So what we're talking about here is a new mode for COPY, that when requested will pre-freeze tuples when loading into a newly created/truncated table. If the table isn't newly created/truncated then we'll just ignore it and continue. I see no need to throw an error, since that will just cause annoying usability issues. Actually, why not just have it work always? If people want to load frozen tuples into a table that's not newly created/truncated, why not let them? Sure, there could be MVCC violations, but as long as the behavior is opt-in, who cares? I think it'd be useful to a lot of people. I thought about that too, but there's a big problem. It wouldn't be just MVCC that would be broken, but transactional integrity: if the COPY fails partway through, the already-loaded rows still look valid. The new-file requirement provides a way to roll them back. I'm willing to have an option that compromises MVCC semantics transiently, but giving up transactional integrity seems a bit much. Hmm, good point. There might be some way around that, but figuring it out is probably material for a separate patch. But I guess that raises the question - should COPY (FREEZE) silently ignore the option for not-new relfilenodes, or should it error out? Simon proposed the former, but I'm wondering if the latter would be better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
Bruce Momjian br...@momjian.us writes: On Fri, Nov 9, 2012 at 12:50:34AM -0500, Tom Lane wrote: The hash_seq_search time is probably mostly associated with AtEOXact_RelationCache, which is run during transaction commit and scans the relcache hashtable looking for tables created in the current transaction. So that's about 50% of the runtime going into that one activity. Thanks for finding this. What is odd is that I am not seeing non-linear restores at 16k in git head, so I am confused how something that consumes ~50% of backend time could still perform linearly. Would this consume 50% at lower table counts? No, the cost from that is O(N^2), though with a pretty small multiplier. 16K tables is evidently where the cost reaches the point of being significant --- if you went up from there, you'd probably start to notice an overall O(N^2) behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Doc patch: Document names of automatically created constraints and indexes
Hi, Attached is a patch (against head) which documents the name of the constraint and index created when using PRIMARY KEY with CREATE TABLE, and the name of the index created when using UNIQUE. I haven't read all the docs recently but I don't believe this is presently documented. It's unclear to me that this is the right approach but perhaps this will start a discussion that finds the right approach. The big problem I see is that these are somewhat of an implementation internal while at the same time being something that the user might have to concern themselves with. First, the constraint and index names are in the namespace used by the user so there is potential for collision with user-defined constraints and indexes. Second, the only way (I know of) to remove primary-key-ness is to drop the primary key constraint, by name. This lead me right into another thought: It would be nice to have ALTER TABLE be able to drop the primary key constraint. (Then the user would not need to know the name of the constraint related to primary-key-ness.) However, it is probably more useful to be able to drop the constraint (and attendant foreign key meta-information) separately from the unique index associated with the primary key, if for no other reason than index re-creation can be expensive and missing indexes make bad things happen. This patch is the improvement I could come up with. Someone else can decide to commit or reject, I don't believe I can contribute much more on this at this time. Regards, Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index 445ca40..da87574 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -486,6 +486,14 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI primary key constraint defined for the table. (Otherwise it would just be the same constraint listed twice.) /para + + para + The name of the index created when literalUNIQUE/literal is + used is literalreplaceable + class=PARAMETERtablename/replaceable_replaceable + class=PARAMETERcolname/replaceable_key/literal. + /para + /listitem /varlistentry @@ -514,6 +522,13 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI different from other sets of columns named by any unique constraint defined for the same table. /para + + para + The name of the primary key constraint, and the name of the + index which enforces uniqueness, is literalreplaceable + class=PARAMETERtablename/replaceable_pkey/literal. + /para + /listitem /varlistentry -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
Robert Haas escribió: On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote: I think I can see all of those things being potentially useful. There are a couple of pending patches that will revise the WAL format slightly; not sure how much those are likely to interfere with any development you might do on (2) in the meantime. Based on above conclusion, I have prepared a patch which implements Option-1 I wonder if we shouldn't make this a separate utility, rather than something that is part of pg_resetxlog. Anyone have a thought on that topic? That thought did cross my mind too. -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix errcontext() function
On 12.11.2012 17:52, Alvaro Herrera wrote: Hopefully you noticed that contrib is broken. Oops.. fixed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2012-11-12 at 09:44 +0100, Markus Wanner wrote: Can we simply write a progress indicator to pg_control or someplace saying that all pages up to X of relation Y are supposed to have valid checksums? pg_control would not be the right place for that structure. It's intended to be fixed-size (it's just a serialized C structure) and it should be smaller than a sector so that it doesn't suffer from torn pages. Not a bad approach overall, but requires some kind of new structure. And that increases the risk that it doesn't make 9.3. Right now, I'm honestly just trying to get the simplest approach that doesn't restrict these kinds of ideas if we want to do them later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doc patch: Document names of automatically created constraints and indexes
Karl O. Pinc k...@meme.com writes: Attached is a patch (against head) which documents the name of the constraint and index created when using PRIMARY KEY with CREATE TABLE, and the name of the index created when using UNIQUE. I haven't read all the docs recently but I don't believe this is presently documented. This is not actually correct: it ignores the corner cases where the name would be overlength or conflict with something else. Personally I don't think this should be documented, as it's an implementation detail that we've changed in the past and may change again. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Fri, Nov 9, 2012 at 12:50 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Janes jeff.ja...@gmail.com writes: Are sure the server you are dumping out of is head? I experimented a bit with dumping/restoring 16000 tables matching Bruce's test case (ie, one serial column apiece). The pg_dump profile seems fairly flat, without any easy optimization targets. But restoring the dump script shows a rather interesting backend profile: samples %image name symbol name 3086139.6289 postgres AtEOXact_RelationCache 9911 12.7268 postgres hash_seq_search 2682 3.4440 postgres init_sequence 2218 2.8482 postgres _bt_compare 2120 2.7223 postgres hash_search_with_hash_value 1976 2.5374 postgres XLogInsert 1429 1.8350 postgres CatalogCacheIdInvalidate 1282 1.6462 postgres LWLockAcquire 973 1.2494 postgres LWLockRelease 702 0.9014 postgres hash_any The hash_seq_search time is probably mostly associated with AtEOXact_RelationCache, which is run during transaction commit and scans the relcache hashtable looking for tables created in the current transaction. So that's about 50% of the runtime going into that one activity. There are at least three ways we could whack that mole: * Run the psql script in --single-transaction mode, as I was mumbling about the other day. If we were doing AtEOXact_RelationCache only once, rather than once per CREATE TABLE statement, it wouldn't be a problem. Easy but has only a narrow scope of applicability. * Keep a separate list (or data structure of your choice) so that relcache entries created in the current xact could be found directly rather than having to scan the whole relcache. That'd add complexity though, and could perhaps be a net loss for cases where the relcache isn't so bloated. * Limit the size of the relcache (eg by aging out not-recently-referenced entries) so that we aren't incurring O(N^2) costs for scripts touching N tables. Again, this adds complexity and could be counterproductive in some scenarios. Although there may be some workloads that access very large numbers of tables repeatedly, I bet that's not typical. Rather, I bet that a session which accesses 10,000 tables is most likely to access them just once each - and right now we don't handle that case very well; this is not the first complaint about big relcaches causing problems. On the flip side, we don't want workloads that exceed some baked-in cache size to fall off a cliff. So I think we should be looking for a solution that doesn't put a hard limit on the size of the relcache, but does provide at least some latitude to get rid of old entries. So maybe something like this. Add a flag to each relcache entry indicating whether or not it has been used. After adding 1024 entries to the relcache, scan all the entries: clear the flag if it's set, flush the entry if it's already clear. This allows the size of the relcache to grow without bound, but only if we're continuing to access the old tables in between adding new ones to the mix. As an additional safeguard, we could count the number of toplevel SQL commands that have been executed and require that a flush not be performed more often than, say, every 64 toplevel SQL commands. That way, if a single operation on an inheritance parent with many children sucks a lot of stuff into the relcache, we'll avoid cleaning it out too quickly. Maybe this is all too ad-hoc, but I feel like we don't need to overengineer this. The existing system is fine in 99% of the cases, so we really only need to find a way to detect the really egregious case where we are doing a neverending series of one-time table accesses and apply a very light tap to avoid the pain point in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Sat, Nov 10, 2012 at 12:41:44PM -0500, Bruce Momjian wrote: On Sat, Nov 10, 2012 at 07:17:34PM +0200, Ants Aasma wrote: On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian br...@momjian.us wrote: I am confused why you see a loop. transfer_all_new_dbs() does a merge-join of old/new database names, then calls gen_db_file_maps(), which loops over the relations and calls create_rel_filename_map(), which adds to the map via array indexing. I don't see any file loops in there --- can you be more specific? Sorry, I was too tired when posting that. I actually meant transfer_single_new_db(). More specifically the profile clearly showed that most of the time was spent in the two loops starting on lines 193 and 228. Wow, you are right on target. I was so focused on making logical lookups linear that I did not consider file system vm/fsm and file extension lookups. Let me think a little and I will report back. Thanks. OK, I have had some time to think about this. What the current code does is, for each database, get a directory listing to know about any vm, fsm, and 1gig extents that exist in the directory. It caches the directory listing and does full array scans looking for matches. If the tablespace changes, it creates a new directory cache and throws away the old one. This code certainly needs improvement! I can think of two solutions. The first would be to scan the database directory, and any tablespaces used by the database, sort it, then allow binary search of the directory listing looking for file prefixes that match the current relation. The second approach would be to simply try to copy the fsm, vm, and extent files, and ignore any ENOEXIST errors. This allows code simplification. The downside is that it doesn't pull all files with matching prefixes --- it requires pg_upgrade to _know_ what suffixes might exist in that directory. Second, it assumes there can be no number gaps in the file extent numbering (is that safe?). I need recommendations on which direction to persue; this would only be for 9.3. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doc patch: Document names of automatically created constraints and indexes
On 11/12/2012 10:40:00 AM, Tom Lane wrote: Karl O. Pinc k...@meme.com writes: Attached is a patch (against head) which documents the name of the constraint and index created when using PRIMARY KEY with CREATE TABLE, and the name of the index created when using UNIQUE. Personally I don't think this should be documented, as it's an implementation detail that we've changed in the past and may change again. Ok. Could ALTER TABLE use an option to drop the primary key constraint? I needed to do that, found it was not obvious, and this lead me to try to improve things. Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suggestion for --truncate-tables to pg_restore
Hi, Attached is version 4. Version 3 no longer built against head. On 10/16/2012 09:48:06 PM, Karl O. Pinc wrote: On 09/23/2012 08:52:07 PM, Karl O. Pinc wrote: On 09/23/2012 12:24:27 AM, Karl O. Pinc wrote: On 09/23/2012 12:19:07 AM, Karl O. Pinc wrote: On 09/21/2012 10:54:05 AM, Karl O. Pinc wrote: On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote: I've had problems using pg_restore --data-only when restoring individual schemas (which contain data which has had bad things done to it). --clean does not work well because of dependent objects in other schemas. Since there wasn't much more to do I've gone ahead and written the patch. Works for me. Against git master. Passes regression tests, but there's no regression tests for pg_restore so this does not say much. Since there's no regression tests I've not written one. Since this is a real patch for application I've given it a new name (it's not a v2). Truncate done right before COPY, since that's what the parallel restores do. Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein --quoted attachment-- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Karl k...@meme.com Free Software: You don't pay back, you pay forward. -- Robert A. Heinlein diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index 1b9db6b..23b0d16 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -546,6 +546,26 @@ /varlistentry varlistentry + termoption--truncate-tables//term + listitem + para +This option is only relevant when performing a data-only +restore. It instructs applicationpg_restore/application +to execute commands to truncate the target tables while the +data is reloaded. Use this when restoring tables or schemas +and option--clean/option cannot be used because dependent +objects would be destroyed. + /para + + para + The option--disable-triggers/option will almost always + always need to be used in conjunction with this option to + disable check constraints on foreign keys. + /para + /listitem + /varlistentry + + varlistentry termoption--use-set-session-authorization/option/term listitem para diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 3b49395..0aaf1d3 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -101,6 +101,8 @@ typedef struct _restoreOptions int noTablespace; /* Don't issue tablespace-related commands */ int disable_triggers; /* disable triggers during data-only * restore */ + int truncate_tables; /* truncate tables during data-only + * restore */ int use_setsessauth;/* Use SET SESSION AUTHORIZATION commands * instead of OWNER TO */ int no_security_labels; /* Skip security label entries */ diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 1fead28..c7fdc79 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -309,6 +309,11 @@ RestoreArchive(Archive *AHX) if (ropt-createDB ropt-single_txn) exit_horribly(modulename, -C and -1 are incompatible options\n); + /* When the schema is dropped and re-created then no point + * truncating tables. */ + if (ropt-dropSchema ropt-truncate_tables) + exit_horribly(modulename, -c and --truncate-tables are incompatible options\n); + /* * If we're going to do parallel restore, there are some restrictions. */ @@ -403,6 +408,10 @@ RestoreArchive(Archive *AHX) } } + /* Truncate tables only when restoring data. */ + if (!ropt-dataOnly ropt-truncate_tables) + exit_horribly(modulename, --truncate-tables requires the --data-only option\n); + /* * Setup the output file if necessary. */ @@ -562,6 +571,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, int retval = 0; teReqs reqs; bool defnDumped; + bool truncate; AH-currentTE = te; @@ -696,14 +706,21 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, * server, so no need to see if we should issue BEGIN. */ StartTransaction(AH); + truncate = 1; + } else + /* Truncate the table when asked to. */ + truncate = ropt-truncate_tables; + if (truncate) { /* * If the server version is = 8.4, make sure we issue * TRUNCATE with ONLY so that child tables are not - * wiped. + * wiped. If we don't know the server version + * then err on the side of safety. */ ahprintf(AH, TRUNCATE TABLE %s%s;\n\n, -
Re: [HACKERS] Enabling Checksums
On Sun, 2012-11-11 at 23:55 -0500, Greg Smith wrote: Adding an initdb option to start out with everything checksummed seems an uncontroversial good first thing to have available. OK, so here's my proposal for a first patch (changes from Simon's patch): * Add a flag to the postgres executable indicating that it should use checksums on everything. This would only be valid if bootstrap mode is also specified. * Add a multi-state checksums flag in pg_control, that would have three states: OFF, ENABLING, and ON. It would only be set to ON during bootstrap, and in this first patch, it would not be possible to set ENABLING. * Remove GUC and use this checksums flag everywhere. * Use the TLI field rather than the version field of the page header. * Incorporate page number into checksum calculation (already done). Does this satisfy the requirements for a first step? Does it interfere with potential future work? Won't a pg_checksums program just grow until it looks like a limited version of vacuum though? We can dig into the details of that later, but I don't think it's useless, even if we do have per-table (or better) checksums. For instance, it would be useful to verify backups offline. I think it's a legitimate concern that we might reinvent some VACUUM machinery. Ideally, we'd get better online migration tools for checksums (perhaps using VACUUM) fast enough that nobody will bother introducing that kind of bloat into pg_checksums. I think it's useful to step back for a minute and consider the larger uncertainty an existing relation has, which amplifies just how ugly this situation is. The best guarantee I think online checksumming can offer is to tell the user after transaction id X, all new data in relation R is known to be checksummed. It's slightly better than that. It's more like: we can tell you if any of your data gets corrupted after transaction X. If old data is corrupted before transaction X, then there's nothing we can do. But if it's corrupted after transaction X (even if it's old data), the checksums should catch it. Unless you do this at initdb time, any conversion case is going to have the possibility that a page is corrupted before you get to it--whether you're adding the checksum as part of a let's add them while we're writing anyway page update or the conversion tool is hitting it. Good point. That's why I don't think anyone will find online conversion really useful until they've done a full sweep updating the old pages. I don't entirely agree. A lot of times, you just want to know whether your disk is changing your data out from under you. Maybe you miss some cases and maybe not all of your data is protected, but just knowing which disks need to be replaced, and which RAID controllers not to buy again, is quite valuable. And the more data you get checksummed the faster you'll find out. One of the really common cases I was expecting here is that conversions are done by kicking off a slow background VACUUM CHECKSUM job that might run in pieces. Right now I'm focused on the initial patch and other fairly immediate goals, so I won't address this now. But I don't want to cut off the conversation, either. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
Noah Misch wrote: On Sun, Nov 11, 2012 at 02:18:11AM -0300, Alvaro Herrera wrote: I will move all the open patches to CF3, unless someone beats me to it. I probably won't be able to get anything done tomorrow, so if somebody has a boring Sunday I would appreciate the help. Likewise. Many thanks. I have closed the CF (just 3 days before the next one starts, which is somewhat depressing). -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg_upgrade speed for many tables
On Mon, Nov 5, 2012 at 12:08 PM, Bruce Momjian br...@momjian.us wrote: Magnus reported that a customer with a million tables was finding pg_upgrade slow. I had never considered many table to be a problem, but decided to test it. I created a database with 2k tables like this: CREATE TABLE test1990 (x SERIAL); Running the git version of pg_upgrade on that took 203 seconds. Using synchronous_commit=off dropped the time to 78 seconds. This was tested on magnetic disks with a write-through cache. (No change on an SSD with a super-capacitor.) I don't see anything unsafe about having pg_upgrade use synchronous_commit=off. I could set it just for the pg_dump reload, but it seems safe to just use it always. We don't write to the old cluster, and if pg_upgrade fails, you have to re-initdb the new cluster anyway. Patch attached. I think it should be applied to 9.2 as well. Is turning off synchronous_commit enough? What about turning off fsync? When I'm doing a pg_upgrade with thousands of tables, the shutdown checkpoint after restoring the dump to the new cluster takes a very long time, as the writer drains its operation table by opening and individually fsync-ing thousands of files. This takes about 40 ms per file, which I assume is a combination of slow lap-top disk drive, and a strange deal with ext4 which makes fsyncing a recently created file very slow. But even with faster hdd, this would still be a problem if it works the same way, with every file needing 4 rotations to be fsynced and this happens in serial. Worse, the shutdown only waits for the default of 60 seconds for the shutdown to take place before it throws an error and the entire pg_upgrade gives up. It seems to me that either the -t setting should be increased, or should be an option to pg_upgrade. My work around was to invoke a system-wide sync a couple seconds after the 'pg_ctl stop' is initiated. Flushing the files wholesale seems to work to make the checkpoint writer rapidly find it has nothing to do when it tries to flush them retail. Anyway, the reason I think turning fsync off might be reasonable is that as soon as the new cluster is shut down, pg_upgrade starts overwriting most of those just-fsynced file with other files from the old cluster, and AFAICT makes no effort to fsync them. So until there is a system-wide sync after the pg_upgrade finishes, your new cluster is already in mortal danger anyway. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pg_upgrade speed for many tables
On Mon, 2012-11-12 at 10:29 -0800, Jeff Janes wrote: When I'm doing a pg_upgrade with thousands of tables, the shutdown checkpoint after restoring the dump to the new cluster takes a very long time, as the writer drains its operation table by opening and individually fsync-ing thousands of files. This reminds me of the fix I did for initdb to sync the files. I think we do need to make sure they are sync'd, because ext4 can keep buffers around for quite a long time without cleaning them. I ended up using sync_file_range(..., SYNC_FILE_RANGE_WRITE) on linux, and posix_fadvise(..., POSIX_FADV_DONTNEED) on everything else, and that made subsequent fsyncs more efficient. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On 10 September 2012 17:50, Tom Lane t...@sss.pgh.pa.us wrote: The point of the proposal that I am making is to have a simple, low-maintenance solution for people who need a single-application database. A compromise somewhere in the middle isn't likely to be an improvement for anybody. For instance, if you want to have additional connections, you open up a whole collection of communication and authentication issues, which potential users of a single-application database don't want to cope with. So the proposal is to implement a database that can't ever have 2 or more connections. And so any data it stores cannot ever be accessed by another connection, so all forms of replication are excluded and all maintenance actions force the database to be unavailable for a period of time. Those two things are barriers of the most major kind to anybody working in an enterprise with connected data and devices. The only people that want this are people with a very short term view of the purpose of their applications, and disregard for the value and permanence of the data stored. They may not want to cope with those issues *now* but they will later and won't thank us for implementing it in a way that means it can never be achieved. To be honest, I can't believe I'm reading this. And worse, it's on our Don't Want list, and nobody has said stop. It's almost impossible to purchase a CPU these days that doesn't have multiple cores, so the whole single-process architecture is just dead. Yes, we want Postgres installed everywhere, but this isn't the way to achieve that. I agree we should allow a PostgreSQL installation to work for a single user, but I don't see that requires other changes. This idea will cause endless bugs, thinkos and severely waste our time. So without a much better justification, I don't think we should do this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY
On 12 November 2012 16:22, Robert Haas robertmh...@gmail.com wrote: But I guess that raises the question - should COPY (FREEZE) silently ignore the option for not-new relfilenodes, or should it error out? Simon proposed the former, but I'm wondering if the latter would be better. It's got some complex pre-conditions, so having scripts fail because you mis-specified FREEZE would be annoying. The option indicates I accept the potential MVCC violation, not it will always freeze. If there is a better name... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Hmm, not good. Thanks for spotting. Do these changes do anything to actions that occur across multiple records? I assume not and think those are OK, agreed? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
On 12 November 2012 14:51, Tom Lane t...@sss.pgh.pa.us wrote: The 9.0 code is broken, however. In 9.0, when a child page is split, the parent and new children are kept locked until the downlinks are inserted/updated. If a concurrent scan comes along and sees that incomplete state, it will miss tuples on the new right siblings. We rely on a rm_cleanup operation at the end of WAL replay to fix that situation, if the downlink insertion record is not there. I don't see any easy way to fix that, unfortunately. Perhaps we could backpatch the 9.1 rewrite, now that it's gotten some real-world testing, but it was a big change so I don't feel very comfortable doing that. Me either. Given the lack of field complaints, I think we're better advised to just leave it unfixed in 9.0. It'd not be a step forward if we broke something trying to make this work. Agreed. Most people running 9.0 with GIST indexes have already upgraded. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Jeff, On 11/12/2012 06:52 PM, Jeff Davis wrote: OK, so here's my proposal for a first patch (changes from Simon's patch): * Add a flag to the postgres executable indicating that it should use checksums on everything. This would only be valid if bootstrap mode is also specified. * Add a multi-state checksums flag in pg_control, that would have three states: OFF, ENABLING, and ON. It would only be set to ON during bootstrap, and in this first patch, it would not be possible to set ENABLING. * Remove GUC and use this checksums flag everywhere. * Use the TLI field rather than the version field of the page header. * Incorporate page number into checksum calculation (already done). Does this satisfy the requirements for a first step? Does it interfere with potential future work? As described before in this thread, I think we might be able to do without the has checksum-bit, as yet another simplification. But I don't object to adding it, either. It's slightly better than that. It's more like: we can tell you if any of your data gets corrupted after transaction X. If old data is corrupted before transaction X, then there's nothing we can do. But if it's corrupted after transaction X (even if it's old data), the checksums should catch it. I (mis?)read that as Greg referring to the intermediate (enabling) state, where pages with old data may or may not have a checksum, yet. So I think it was an argument against staying in that state any longer than necessary. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Simon Riggs si...@2ndquadrant.com writes: On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote: Practically all WAL record types that touch multiple pages have some bug of this type. In addition to btree_xlog_split, I found that heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs, spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold locks as required to make their updates safe for concurrent queries. (I'm not totally sure about ginRedoDeletePage, but the original action definitely locks the pages simultaneously, and it's not clear that it's safe not to.) Most of these are okay in cases without any full-page images, but could fail if the wrong subset of the pages-to-be-touched were processed by RestoreBkpBlocks. Some had bugs even without that :-( Hmm, not good. Thanks for spotting. Do these changes do anything to actions that occur across multiple records? I assume not and think those are OK, agreed? Right, we were and still are assuming that any state that exists between WAL records is consistent and safe to expose to hot-standby queries. The important thing here is that these WAL replay functions were failing to ensure that their own actions appear atomic to onlookers. This is basically hangover from pre-hot-standby coding conventions, when no such concern existed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay
Here's an updated patch that fixes the GIST replay functions as well as the other minor issues that were mentioned. Barring objections, I'll set about back-patching this as far as 9.0. One thing that could use verification is my fix for gistRedoPageSplitRecord. AFAICS, the first page listed in the WAL record is always the original page, and the ones following it are pages that were split off from it, and can (as yet) only be reached by following right-links from the original page. As such, it should be okay to release locks on the non-first pages as soon as we've written them. We have to hold lock on the original page though to avoid letting readers follow dangling right-links. Also, the update of NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically with all this, so that has to be done before releasing the original-page lock as well. Does that sound right? regards, tom lane restore-backup-blocks-individually-3.patch.gz Description: restore-backup-blocks-individually-3.patch.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote: OK, I have had some time to think about this. What the current code does is, for each database, get a directory listing to know about any vm, fsm, and 1gig extents that exist in the directory. It caches the directory listing and does full array scans looking for matches. If the tablespace changes, it creates a new directory cache and throws away the old one. This code certainly needs improvement! I can think of two solutions. The first would be to scan the database directory, and any tablespaces used by the database, sort it, then allow binary search of the directory listing looking for file prefixes that match the current relation. The second approach would be to simply try to copy the fsm, vm, and extent files, and ignore any ENOEXIST errors. This allows code simplification. The downside is that it doesn't pull all files with matching prefixes --- it requires pg_upgrade to _know_ what suffixes might exist in that directory. Second, it assumes there can be no number gaps in the file extent numbering (is that safe?). I need recommendations on which direction to persue; this would only be for 9.3. I went with the second idea, patch attached. Here are the times: -- 9.2 -- 9.3 -- normal -- -- bin-up -- -- normal -- -- bin-up -- pg_upgrade dump rest dump rest dump rest dump rest git patch 1 0.12 0.06 0.12 0.06 0.11 0.07 0.11 0.07 11.11 11.02 1000 7.22 2.40 4.74 2.78 2.20 2.43 4.04 2.86 19.60 19.25 2000 5.67 5.10 8.82 5.57 4.50 4.97 8.07 5.69 30.55 26.67 4000 13.34 11.13 25.16 12.52 8.95 11.24 16.75 12.16 60.70 52.31 8000 29.12 25.98 59.60 28.08 16.68 24.02 30.63 27.08 123.05 102.78 16000 87.36 53.16 189.38 62.72 31.38 55.37 61.55 62.66 365.71 286.00 You can see a significant speedup with those loops removed. The 16k case is improved, but still not linear. The 16k dump/restore scale looks fine, so it must be something in pg_upgrade, or in the kernel. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c new file mode 100644 index 33a867f..0a49a23 *** a/contrib/pg_upgrade/relfilenode.c --- b/contrib/pg_upgrade/relfilenode.c *** *** 17,25 static void transfer_single_new_db(pageCnvCtx *pageConverter, FileNameMap *maps, int size); ! static void transfer_relfile(pageCnvCtx *pageConverter, ! const char *fromfile, const char *tofile, ! const char *nspname, const char *relname); /* --- 17,24 static void transfer_single_new_db(pageCnvCtx *pageConverter, FileNameMap *maps, int size); ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map, ! const char *suffix); /* *** static void *** 131,185 transfer_single_new_db(pageCnvCtx *pageConverter, FileNameMap *maps, int size) { - char old_dir[MAXPGPATH]; - char file_pattern[MAXPGPATH]; - char **namelist = NULL; - int numFiles = 0; int mapnum; ! int fileno; ! bool vm_crashsafe_change = false; ! ! old_dir[0] = '\0'; ! /* Do not copy non-crashsafe vm files for binaries that assume crashsafety */ if (old_cluster.controldata.cat_ver VISIBILITY_MAP_CRASHSAFE_CAT_VER new_cluster.controldata.cat_ver = VISIBILITY_MAP_CRASHSAFE_CAT_VER) ! vm_crashsafe_change = true; for (mapnum = 0; mapnum size; mapnum++) { ! char old_file[MAXPGPATH]; ! char new_file[MAXPGPATH]; ! ! /* Changed tablespaces? Need a new directory scan? */ ! if (strcmp(maps[mapnum].old_dir, old_dir) != 0) ! { ! if (numFiles 0) ! { ! for (fileno = 0; fileno numFiles; fileno++) ! pg_free(namelist[fileno]); ! pg_free(namelist); ! } ! ! snprintf(old_dir, sizeof(old_dir), %s, maps[mapnum].old_dir); ! numFiles = load_directory(old_dir, namelist); ! } ! ! /* Copying files might take some time, so give feedback. */ ! ! snprintf(old_file, sizeof(old_file), %s/%u, maps[mapnum].old_dir, ! maps[mapnum].old_relfilenode); ! snprintf(new_file, sizeof(new_file), %s/%u, maps[mapnum].new_dir, ! maps[mapnum].new_relfilenode); ! pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file); ! ! /* ! * Copy/link the relation's primary file (segment 0 of main fork) ! * to the new cluster ! */ ! unlink(new_file); ! transfer_relfile(pageConverter, old_file, new_file, ! maps[mapnum].nspname, maps[mapnum].relname); /* fsm/vm files added in PG 8.4 */ if (GET_MAJOR_VERSION(old_cluster.major_version) = 804) --- 130,148 transfer_single_new_db(pageCnvCtx
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Mon, Nov 12, 2012 at 03:59:27PM -0500, Bruce Momjian wrote: The second approach would be to simply try to copy the fsm, vm, and extent files, and ignore any ENOEXIST errors. This allows code simplification. The downside is that it doesn't pull all files with matching prefixes --- it requires pg_upgrade to _know_ what suffixes might exist in that directory. Second, it assumes there can be no number gaps in the file extent numbering (is that safe?). I need recommendations on which direction to persue; this would only be for 9.3. I went with the second idea, patch attached. Here are the times: -- 9.2 -- 9.3 -- normal -- -- bin-up -- -- normal -- -- bin-up -- pg_upgrade dump rest dump rest dump rest dump rest git patch 1 0.12 0.06 0.12 0.06 0.11 0.07 0.11 0.07 11.11 11.02 1000 7.22 2.40 4.74 2.78 2.20 2.43 4.04 2.86 19.60 19.25 2000 5.67 5.10 8.82 5.57 4.50 4.97 8.07 5.69 30.55 26.67 4000 13.34 11.13 25.16 12.52 8.95 11.24 16.75 12.16 60.70 52.31 8000 29.12 25.98 59.60 28.08 16.68 24.02 30.63 27.08 123.05 102.78 16000 87.36 53.16 189.38 62.72 31.38 55.37 61.55 62.66 365.71 286.00 You can see a significant speedup with those loops removed. The 16k case is improved, but still not linear. The 16k dump/restore scale looks fine, so it must be something in pg_upgrade, or in the kernel. It is possible that the poor 16k pg_upgrade value is caused by the poor 9.2 binary-upgrade number (189.38). Perhaps I need to hack up pg_upgrade to allow a 9.3 to 9.3 upgrade to test this. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
Bruce Momjian escribió: --- 17,24 static void transfer_single_new_db(pageCnvCtx *pageConverter, FileNameMap *maps, int size); ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map, ! const char *suffix); Uh, does this code assume that forks other than the main one are not split in segments? I think that's a bug, is it not? -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
Bruce Momjian escribió: It is possible that the poor 16k pg_upgrade value is caused by the poor 9.2 binary-upgrade number (189.38). Perhaps I need to hack up pg_upgrade to allow a 9.3 to 9.3 upgrade to test this. Hmm? This already works, since make check uses it, right? -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On Mon, Nov 12, 2012 at 1:21 PM, Simon Riggs si...@2ndquadrant.com wrote: On 10 September 2012 17:50, Tom Lane t...@sss.pgh.pa.us wrote: The point of the proposal that I am making is to have a simple, low-maintenance solution for people who need a single-application database. A compromise somewhere in the middle isn't likely to be an improvement for anybody. For instance, if you want to have additional connections, you open up a whole collection of communication and authentication issues, which potential users of a single-application database don't want to cope with. So the proposal is to implement a database that can't ever have 2 or more connections. And so any data it stores cannot ever be accessed by another connection, so all forms of replication are excluded and all maintenance actions force the database to be unavailable for a period of time. Those two things are barriers of the most major kind to anybody working in an enterprise with connected data and devices. The only people that want this are people with a very short term view of the purpose of their applications, and disregard for the value and permanence of the data stored. They may not want to cope with those issues *now* but they will later and won't thank us for implementing it in a way that means it can never be achieved. To be honest, I can't believe I'm reading this. And worse, it's on our Don't Want list, and nobody has said stop. It's almost impossible to purchase a CPU these days that doesn't have multiple cores, so the whole single-process architecture is just dead. Yes, we want Postgres installed everywhere, but this isn't the way to achieve that. I agree we should allow a PostgreSQL installation to work for a single user, but I don't see that requires other changes. This idea will cause endless bugs, thinkos and severely waste our time. So without a much better justification, I don't think we should do this. I couldn't disagree more. The patch is small, logical, and fixes an awful problem, namely that --single mode is basically unusable. As to your wider point (namely, that you can't connect to it, therefore it's bad), it has got to be refuted by numerous competing solutions in the market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html, and many others. While it's not as common as it used to be, now and then a piece of software needs to distribute an application as part of a boxed product. Postgres is horrible at this and doesn't have to be; imagine how much easier the lives of poker tracker would be (for *some* of its users) with an integrated standalone mode: google 'poker tracker postgresql' and take a good long look at problems people face in this scenario. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Mon, Nov 12, 2012 at 06:14:59PM -0300, Alvaro Herrera wrote: Bruce Momjian escribió: --- 17,24 static void transfer_single_new_db(pageCnvCtx *pageConverter, FileNameMap *maps, int size); ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map, !const char *suffix); Uh, does this code assume that forks other than the main one are not split in segments? I think that's a bug, is it not? Oh, yeah, I must have fixed this long ago. It only fails if you use tablespaces: if (os_info.num_tablespaces 0 strcmp(old_cluster.tablespace_suffix, new_cluster.tablespace_suffix) == 0) pg_log(PG_FATAL, Cannot upgrade to/from the same system catalog version when\n using tablespaces.\n); -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote: The second approach would be to simply try to copy the fsm, vm, and extent files, and ignore any ENOEXIST errors. This allows code simplification. The downside is that it doesn't pull all files with matching prefixes --- it requires pg_upgrade to _know_ what suffixes might exist in that directory. Second, it assumes there can be no number gaps in the file extent numbering (is that safe?). Seems our code does the same kind of segment number looping I was suggesting for pg_upgrade, so I think I am safe: /* * Note that because we loop until getting ENOENT, we will correctly * remove all inactive segments as well as active ones. */ for (segno = 1;; segno++) { sprintf(segpath, %s.%u, path, segno); if (unlink(segpath) 0) { /* ENOENT is expected after the last segment... */ if (errno != ENOENT) ereport(WARNING, (errcode_for_file_access(), errmsg(could not remove file \%s\: %m, segpath))); break; } } -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On 12 November 2012 21:26, Merlin Moncure mmonc...@gmail.com wrote: I couldn't disagree more. The patch is small, logical, and fixes an awful problem, namely that --single mode is basically unusable. As to your wider point (namely, that you can't connect to it, therefore it's bad), it has got to be refuted by numerous competing solutions in the market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html, and many others. Small is not an argument in favour, just a statement of ease, like jumping off a cliff. i.e. lemmings. While it's not as common as it used to be, now and then a piece of software needs to distribute an application as part of a boxed product. Postgres is horrible at this and doesn't have to be; imagine how much easier the lives of poker tracker would be (for *some* of its users) with an integrated standalone mode: google 'poker tracker postgresql' and take a good long look at problems people face in this scenario. I get the installability thang, very very much, I just don't see the single process thing as the only solution. At very least an open minded analysis of the actual problem and ways of solving it is called for, not just reach for a close to hand solution. I don't ever want to hear someone reject a patch cos it would mess up poker tracker. The point is it complicates the code, introduces restrictions into what is possible and is just more inertia onto development. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Fri, Nov 9, 2012 at 10:50 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Janes jeff.ja...@gmail.com writes: Are sure the server you are dumping out of is head? I experimented a bit with dumping/restoring 16000 tables matching Bruce's test case (ie, one serial column apiece). The pg_dump profile seems fairly flat, without any easy optimization targets. But restoring the dump script shows a rather interesting backend profile: samples %image name symbol name 3086139.6289 postgres AtEOXact_RelationCache 9911 12.7268 postgres hash_seq_search ... There are at least three ways we could whack that mole: * Run the psql script in --single-transaction mode, as I was mumbling about the other day. If we were doing AtEOXact_RelationCache only once, rather than once per CREATE TABLE statement, it wouldn't be a problem. Easy but has only a narrow scope of applicability. That is effective when loading into 9.3 (assuming you make max_locks_per_transaction large enough). But when loading into 9.3, using --single-transaction will evoke the quadratic behavior in the resource owner/lock table and make things worse rather than better. Using --single-transaction gets around the AtEOXact_RelationCache quadratic, but it activates another quadratic behavior, this one in get_tabstat_entry. That is a good trade-off because that one has a lower constant, but it is still going to bite. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] vacuumlo - use a cursor
vacuumlo is rather simpleminded about dealing with the list of LOs to be removed - it just fetches them as a straight resultset. For one of my our this resulted in an out of memory condition. The attached patch tries to remedy that by using a cursor instead. If this is wanted I will add it to the next commitfest. The actualy changes are very small - most of the patch is indentation changes due to the introduction of an extra loop. cheers andrew *** a/contrib/vacuumlo/vacuumlo.c --- b/contrib/vacuumlo/vacuumlo.c *** *** 290,362 vacuumlo(const char *database, const struct _param * param) PQclear(res); buf[0] = '\0'; ! strcat(buf, SELECT lo FROM vacuum_l); ! res = PQexec(conn, buf); ! if (PQresultStatus(res) != PGRES_TUPLES_OK) ! { ! fprintf(stderr, Failed to read temp table:\n); ! fprintf(stderr, %s, PQerrorMessage(conn)); ! PQclear(res); PQfinish(conn); return -1; ! } - matched = PQntuples(res); deleted = 0; ! for (i = 0; i matched; i++) { ! Oid lo = atooid(PQgetvalue(res, i, 0)); ! if (param-verbose) { ! fprintf(stdout, \rRemoving lo %6u , lo); ! fflush(stdout); } ! if (param-dry_run == 0) { ! if (lo_unlink(conn, lo) 0) { ! fprintf(stderr, \nFailed to remove lo %u: , lo); ! fprintf(stderr, %s, PQerrorMessage(conn)); ! if (PQtransactionStatus(conn) == PQTRANS_INERROR) { ! success = false; ! break; } } else deleted++; ! } ! else ! deleted++; ! if (param-transaction_limit 0 ! (deleted % param-transaction_limit) == 0) ! { ! res2 = PQexec(conn, commit); ! if (PQresultStatus(res2) != PGRES_COMMAND_OK) { ! fprintf(stderr, Failed to commit transaction:\n); ! fprintf(stderr, %s, PQerrorMessage(conn)); PQclear(res2); ! PQclear(res); ! PQfinish(conn); ! return -1; ! } ! PQclear(res2); ! res2 = PQexec(conn, begin); ! if (PQresultStatus(res2) != PGRES_COMMAND_OK) ! { ! fprintf(stderr, Failed to start transaction:\n); ! fprintf(stderr, %s, PQerrorMessage(conn)); PQclear(res2); - PQclear(res); - PQfinish(conn); - return -1; } - PQclear(res2); } } PQclear(res); /* --- 290,389 PQclear(res); buf[0] = '\0'; ! strcat(buf, ! DECLARE myportal CURSOR WITH HOLD FOR SELECT lo FROM vacuum_l); ! res = PQexec(conn, buf); ! if (PQresultStatus(res) != PGRES_COMMAND_OK) ! { ! fprintf(stderr, DECLARE CURSOR failed: %s, PQerrorMessage(conn)); ! PQclear(res); PQfinish(conn); return -1; ! } ! PQclear(res); ! ! snprintf(buf, BUFSIZE, FETCH FORWARD INT64_FORMAT IN myportal, ! param-transaction_limit 0 ? param-transaction_limit : 1000); deleted = 0; ! ! while (1) { ! res = PQexec(conn, buf); ! if (PQresultStatus(res) != PGRES_TUPLES_OK) ! { ! fprintf(stderr, Failed to read temp table:\n); ! fprintf(stderr, %s, PQerrorMessage(conn)); ! PQclear(res); ! PQfinish(conn); ! return -1; ! } ! matched = PQntuples(res); ! ! if (matched = 0) { ! /* at end of resultset */ ! break; } ! for (i = 0; i matched; i++) { ! Oid lo = atooid(PQgetvalue(res, i, 0)); ! ! if (param-verbose) ! { ! fprintf(stdout, \rRemoving lo %6u , lo); ! fflush(stdout); ! } ! ! if (param-dry_run == 0) { ! if (lo_unlink(conn, lo) 0) { ! fprintf(stderr, \nFailed to remove lo %u: , lo); ! fprintf(stderr, %s, PQerrorMessage(conn)); ! if (PQtransactionStatus(conn) == PQTRANS_INERROR) ! { ! success = false; ! break; ! } } + else + deleted++; } else deleted++; ! ! if (param-transaction_limit 0 ! (deleted % param-transaction_limit) == 0) { ! res2 = PQexec(conn, commit); ! if (PQresultStatus(res2) != PGRES_COMMAND_OK) ! { ! fprintf(stderr, Failed to commit transaction:\n); ! fprintf(stderr, %s, PQerrorMessage(conn)); ! PQclear(res2); ! PQclear(res); ! PQfinish(conn); ! return -1; ! } PQclear(res2); ! res2 = PQexec(conn, begin); ! if (PQresultStatus(res2) != PGRES_COMMAND_OK) ! { ! fprintf(stderr, Failed to start transaction:\n); ! fprintf(stderr, %s, PQerrorMessage(conn)); ! PQclear(res2); ! PQclear(res); ! PQfinish(conn); ! return -1; ! } PQclear(res2); } } } + PQclear(res); /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On 12 November 2012 16:51, Robert Haas robertmh...@gmail.com wrote: Although there may be some workloads that access very large numbers of tables repeatedly, I bet that's not typical. Transactions with large numbers of DDL statements are typical at upgrade (application or database release level) and the execution time of those is critical to availability. I'm guessing you mean large numbers of tables and accessing each one multiple times? Rather, I bet that a session which accesses 10,000 tables is most likely to access them just once each - and right now we don't handle that case very well; this is not the first complaint about big relcaches causing problems. pg_restore frequently accesses tables more than once as it runs, but not more than a dozen times each, counting all types of DDL. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further pg_upgrade analysis for many tables
On Mon, Nov 12, 2012 at 5:17 PM, Simon Riggs si...@2ndquadrant.com wrote: On 12 November 2012 16:51, Robert Haas robertmh...@gmail.com wrote: Although there may be some workloads that access very large numbers of tables repeatedly, I bet that's not typical. Transactions with large numbers of DDL statements are typical at upgrade (application or database release level) and the execution time of those is critical to availability. I'm guessing you mean large numbers of tables and accessing each one multiple times? Yes, that is what I meant. Rather, I bet that a session which accesses 10,000 tables is most likely to access them just once each - and right now we don't handle that case very well; this is not the first complaint about big relcaches causing problems. pg_restore frequently accesses tables more than once as it runs, but not more than a dozen times each, counting all types of DDL. Hmm... yeah. Some of those accesses are probably one right after another so any cache-flushing behavior would be fine; but index creations for example might happen quite a bit later in the file, IIRC. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/12/12 4:44 AM, Craig Ringer wrote: Is it absurd to suggest using another bitmap, like the FSM or visibility map, to store information on page checksumming while checksumming is enabled but incomplete? I spent some time thinking about that last week. One problem with it is that the bitmap structure itself has the same issues as every other write here--how do we know it's going to disk accurately? The put 'checksum on' bits on the page idea and put checksum on bits in a map have the same fundamental issue. Things might get out of sync in the same way, you've just moved the potentially suspicious write to a new place. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2012-11-12 at 20:44 +0100, Markus Wanner wrote: As described before in this thread, I think we might be able to do without the has checksum-bit, as yet another simplification. But I don't object to adding it, either. I see. For a first patch, I guess that's OK. Might as well make it as simple as possible. We probably need to decide what to do there before 9.3 is released though. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 11/12/12 3:44 AM, Markus Wanner wrote: Sorry if that has been discussed before, but can't we do without that bit at all? It adds a checksum switch to each page, where we just agreed we don't event want a per-database switch. Once you accept that eventually there need to be online conversion tools, there needs to be some easy way to distinguish which pages have been processed for several potential implementations. The options seem to be adding some bits just for that or bumping the page format. I would like to just bump the format, but that has a pile of its own issues to cross. Rather not make that a requirement for this month's requirements. Can we simply write a progress indicator to pg_control or someplace saying that all pages up to X of relation Y are supposed to have valid checksums? All of the table-based checksum enabling ideas seem destined to add metadata to pg_class or something related to it for this purpose. While I think everyone agrees that this is a secondary priority to getting basic cluster-level checksums going right now, I'd like to have at least a prototype for that before 9.3 development ends. All of the I realize this doesn't support Jesper's use case of wanting to have the checksums only for newly dirtied pages. However, I'd argue that prolonging the migration to spread the load would allow even big shops to go through this without much of an impact on performance. I'm thinking of this in some ways like the way creation of a new (but not yet valid) foreign key works. Once that's active, new activity is immediately protected moving forward. And eventually there's this cleanup step needed, one that you can inch forward over a few days. The main upper limit on load spreading here is that the conversion program may need to grab a snapshot. In that case the conversion taking too long will be a problem, as it blocks other vacuum activity past that point. This is why I think any good solution to this problem needs to incorporate restartable conversion. We were just getting complaints recently about how losing a CREATE INDEX CONCURRENTLY session can cause the whole process to end and need to be started over. The way autovacuum runs right now it can be stopped and restarted later, with only a small loss of duplicated work in many common cases. If it's possible to maintain that property for the checksum conversion, that would be very helpful to larger sites. It doesn't matter if adding checksums to the old data takes a week if you throttle the load down, so long as you're not forced to hold an open snapshot the whole time. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Jeff, OK, so here's my proposal for a first patch (changes from Simon's patch): * Add a flag to the postgres executable indicating that it should use checksums on everything. This would only be valid if bootstrap mode is also specified. * Add a multi-state checksums flag in pg_control, that would have three states: OFF, ENABLING, and ON. It would only be set to ON during bootstrap, and in this first patch, it would not be possible to set ENABLING. * Remove GUC and use this checksums flag everywhere. * Use the TLI field rather than the version field of the page header. * Incorporate page number into checksum calculation (already done). Does this satisfy the requirements for a first step? Does it interfere with potential future work? So the idea of this implementation is that checksums is something you set at initdb time, and if you want checksums on an existing database, it's a migration process (e.g. dump and reload)? I think that's valid as a first cut at this. We'll need interruptable VACUUM CHECKSUM later, but we don't have to have it for the first version of the feature. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index only scans wiki page
https://wiki.postgresql.org/wiki/Index-only_scans This page is seriously out of date. It suggests they are not yet implemented, but only being talked about. Would someone who knows a lot about the subject (which is why I'm sending this hackers, not web) like to take a whack at updating this? Otherwise, I'd like to just replace the whole page with a discussion of the limitations and trade-offs involved in using them as they currently exist Thanks, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scans wiki page
On 13 November 2012 01:03, Jeff Janes jeff.ja...@gmail.com wrote: https://wiki.postgresql.org/wiki/Index-only_scans This page is seriously out of date. It suggests they are not yet implemented, but only being talked about. Attached is a piece I wrote on the feature. That might form the basis of a new wiki page. Feel free to incorporate this material as you see fit. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services index_only_scans.rst Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doc patch to See Also: CREATE TABLE AS in CREATE TABLE docs
On Sun, 2012-09-23 at 21:28 -0500, Karl O. Pinc wrote: Patch to add CREATE TABLE AS to the See Also: section of the CREATE TABLE docs. committed -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] support for LDAP URLs
Here is a patch to support RFC 2255 LDAP URLs in pg_hba.conf. So, instead of, say host ... ldap ldapserver=ldap.example.net ldapbasedn=dc=example, dc=net ldapsearchattribute=uid you could write host ... ldap lapurl=ldap://ldap.example.net/dc=example,dc=net?uid?sub; Apache and probably other software uses the same format, and it's easier to have a common format for all such configuration instead of having to translate the information provided by the LDAP admin into each software's particular configuration spellings. I'm using the OpenLDAP-provided URL parsing routine, which means this wouldn't be supported on Windows. But we already support different authentication settings on different platforms, so this didn't seem such a big problem. diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml index d053fce..ba6523b 100644 --- a/doc/src/sgml/client-auth.sgml +++ b/doc/src/sgml/client-auth.sgml @@ -1486,6 +1486,39 @@ titleLDAP Authentication/title /para /listitem /varlistentry + varlistentry + termliteralldapurl/literal/term + listitem +para + You can write most of the LDAP options alternatively using an RFC 2255 + LDAP URL. The format is +synopsis +ldap://[replaceableuser/replaceable[:replaceablepassword/replaceable]@]replaceablehost/replaceable[:replaceableport/replaceable]/replaceablebasedn/replaceable[?[replaceableattribute/replaceable][?[replaceablescope/replaceable]]] +/synopsis + replaceablescope/replaceable must be one + of literalbase/literal, literalone/literal, literalsub/literal, + typically the latter. Only one attribute is used, and some other + components of standard LDAP URLs such as filters and extensions are + not supported. +/para + +para + If you want encrypted LDAP connections, you have to use + the literalldaptls/literal option in addition + to literalldapurl/literal. The literalldaps/literal URL + scheme (direct SSL connection) is not supported. +/para + +para + Some other software that supports authentication against LDAP uses the + same URL format, so it will be easier to share the configuration. +/para + +para + LDAP URLs are currently only supported with OpenLDAP, not on Windows. +/para + /listitem + /varlistentry /variablelist /para @@ -1520,6 +1553,13 @@ titleLDAP Authentication/title If that second connection succeeds, the database access is granted. /para + para +Here is the same search+bind configuration written as a URL: +programlisting +host ... ldap lapurl=ldap://ldap.example.net/dc=example,dc=net?uid?sub; +/programlisting + /para + tip para Since LDAP often uses commas and spaces to separate the different diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c index ca470e1..cc1140d 100644 --- a/src/backend/libpq/auth.c +++ b/src/backend/libpq/auth.c @@ -2209,7 +2209,7 @@ static int pam_passwd_conv_proc(int num_msg, const struct pam_message ** msg, r = ldap_search_s(ldap, port-hba-ldapbasedn, - LDAP_SCOPE_SUBTREE, + port-hba-ldapscope, filter, attributes, 0, diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c index 7502e82..ba630a7 100644 --- a/src/backend/libpq/hba.c +++ b/src/backend/libpq/hba.c @@ -37,6 +37,16 @@ #include utils/lsyscache.h #include utils/memutils.h +#ifdef USE_LDAP +#ifndef WIN32 +/* We use a deprecated function to keep the codepath the same as win32. */ +#define LDAP_DEPRECATED 1 +#include ldap.h +#else +#include winldap.h +#endif +#endif + #define atooid(x) ((Oid) strtoul((x), NULL, 10)) #define atoxid(x) ((TransactionId) strtoul((x), NULL, 10)) @@ -1336,7 +1346,7 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, { ereport(LOG, (errcode(ERRCODE_CONFIG_FILE_ERROR), - errmsg(cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, or ldapsearchattribute together with ldapprefix), + errmsg(cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, ldapsearchattribute, or ldapurl together with ldapprefix), errcontext(line %d of configuration file \%s\, line_num, HbaFileName))); return NULL; @@ -1378,6 +1388,8 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, int line_num) { + hbaline-ldapscope = LDAP_SCOPE_SUBTREE; + if (strcmp(name, map) == 0) { if (hbaline-auth_method != uaIdent @@ -1437,6 +1449,54 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, REQUIRE_AUTH_OPTION(uaPAM, pamservice, pam); hbaline-pamservice = pstrdup(val); } + else if (strcmp(name, ldapurl) == 0) + { + LDAPURLDesc *urldata; + int rc; + + REQUIRE_AUTH_OPTION(uaLDAP, ldapurl, ldap); + +#ifdef
Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL
On Monday, November 12, 2012 12:07 PM Greg Smith wrote: On 11/9/12 11:59 PM, Amit kapila wrote: Please let me know if there are any objections or problems in above method of implementation, else I can go ahead to prepare the patch for the coming CF. It may be the case that the locking scheme Robert described is the best approach here. It seems kind of heavy to me though. I suspect that some more thinking about it might come up with something better. Yes, we should evaluate multiple options to do this and then choose the best among it. I am ready to work on evaluating other ways to accomplish this feature. Is the above opinion about only locking or even on a way to write the changed things in a file as mentioned in point-1 in mail chain upthread. (Point-1: 1. While writing .auto file, it will always assume that .auto file contain all config parameters. Now as this .auto file is of fixed format and fixed record size, it can directly write a given record to its particular position.) What my thinking was that if we can decide that the format and size of each configuration is fixed, it can be directly written without doing anything for it in memory. Regardless, exactly how to prevent two concurrent processes from writing the same file feels like the last step in the small roadmap for what this feature needs. If you wanted to work on it more, I'd suggest breaking it into chunks in this order: 1) Change to add scanning a .conf directory in the default configuration using include-dir. This is a quick fix. I predict most of the headaches around it will end up being for packagers rather than the core code to deal with. You could submit this as a small thing to be evaluated on its own. How it's done is going to be controversial. Might as well get that fighting focused against a sample implementation as soon as possible. As per my understanding, a. during initdb, new conf directory can be created and also create .auto file in it. b. use include_dir at end of postgresql.conf to include directory created in above step. c. server start/sighup will take care of above include_dir 2) Get familiar with navigating the GUC data and figuring out what, exactly, needs to be written out. This should include something that navigates like things appear after a RESET, ignoring per-user or per-session changes when figuring out what goes there. It seems inevitable that some amount of validating against the source information--what pg_settings labels source, sourcefile, and sourceline will be needed. An example is the suggestion Magnus made for confirming that the include-dir is still active before writing something there. Okay, what I will do for this is that I shall explain in next mail the way to do by navigating GUC's. 3) Add the function to write a new file out. Work out some test cases for that to confirm the logic and error checking in the previous step all works. I'd next submit what you've got for (2) and (3) to review at this point, before complicating things further with the locking parts. Okay. 4) Make the file write atomic and able to work when multiple users try it at once. You have to reach here successfully before the trivial around file locking comes into play. I wouldn't even bother aiming for that part in a first patch. It's obviously a solvable problem in a number of ways. You need a rock solid way to figure out what to write there before that solution is useful though. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
On Monday, November 12, 2012 8:23 PM Fujii Masao wrote: On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila amit.kap...@huawei.com wrote: On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote: On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila amit.kap...@huawei.com wrote: On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote: On 19.10.2012 14:42, Amit kapila wrote: On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote: Before implementing the timeout parameter, I think that it's better to change both pg_basebackup background process and pg_receivexlog so that BTW, IIRC the walsender has no timeout mechanism during sending backup data to pg_basebackup. So it's also useful to implement the timeout mechanism for the walsender during backup. Yes, its useful, but for walsender the main problem is that it uses blocking send call to send the data. I have tried using tcp_keepalive settings, but the send call doesn't comeout incase of network break. The only way I could get it out is: change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using the command echo 8 /proc/sys/net/ipv4/tcp_retries2 As per recommendation, its value should be at-least 8 (equivalent to 100 sec) Do you have any idea, how it can be achieved? What about using pq_putmessage_noblock()? I will try this, but do you know why at first place in code the blocking mode is used to send files? I am asking as I am little scared that it should not break any design which was initially thought of while making send of files as blocking. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages
On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote: Robert Haas escribió: On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote: I think I can see all of those things being potentially useful. There are a couple of pending patches that will revise the WAL format slightly; not sure how much those are likely to interfere with any development you might do on (2) in the meantime. Based on above conclusion, I have prepared a patch which implements Option-1 I wonder if we shouldn't make this a separate utility, rather than something that is part of pg_resetxlog. Anyone have a thought on that topic? That thought did cross my mind too. One of the reasons for keeping it with pg_resetxlog, is that this was proposed as a solution for scenario's where user's db has become corrupt and now he want to start it. So to do it he can find the max LSN and set the same using pg_resetxlog, it will avoid the further corruption of database after it got started. If we keep it a separate utility then user needs to first run this utility to find max LSN and then use pg_resetxlog to achieve the same. I don't see a big problem in that but may be it would have been better if there are other usecases for it. However it might be used for other purpose also which I am not able to think. Do you have any particular reasons for having it a separate utility? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Memory leaks in record_out and record_send
I looked into the problem complained of here: http://archives.postgresql.org/pgsql-general/2012-11/msg00279.php which turns out to have nothing to do with joins and everything to do with the fact that record_out() leaks memory like mad. It leaks both the strings returned by the per-column output functions and any column values that have to be detoasted. You can easily reproduce this with an example like create table leak (f1 int, f2 text); insert into leak select x, 'foo' from generate_series(1,100) x; select leak from leak; The attached patch against HEAD fixes this, as well as a similar leakage in record_send(). The added code is lifted directly from printtup() so it's not adding any new assumptions to the system. I wonder though if we ought to think about running output functions in a short-lived memory context instead of the executor's main context. We've considered that before, I think, and it's always been the path of least resistance to fix the output functions instead --- but there will always be another leak I'm afraid. OTOH I can't see trying to back-patch a solution like that. If we want to fix this in the back branches (and note the complaint linked above is against 8.3), I think we have to do it as attached. Thoughts? regards, tom lane diff --git a/src/backend/utils/adt/rowtypes.c b/src/backend/utils/adt/rowtypes.c index 13e574d4e82b4412f7ab32e02398d1671d381097..d4ed7d0ca06ac1a0dee6b4963989a5b41713fb3b 100644 *** a/src/backend/utils/adt/rowtypes.c --- b/src/backend/utils/adt/rowtypes.c *** typedef struct ColumnIOData *** 32,37 --- 32,38 Oid column_type; Oid typiofunc; Oid typioparam; + bool typisvarlena; FmgrInfo proc; } ColumnIOData; *** record_out(PG_FUNCTION_ARGS) *** 364,369 --- 365,371 { ColumnIOData *column_info = my_extra-columns[i]; Oid column_type = tupdesc-attrs[i]-atttypid; + Datum attr; char *value; char *tmp; bool nq; *** record_out(PG_FUNCTION_ARGS) *** 387,403 */ if (column_info-column_type != column_type) { - bool typIsVarlena; - getTypeOutputInfo(column_type, column_info-typiofunc, ! typIsVarlena); fmgr_info_cxt(column_info-typiofunc, column_info-proc, fcinfo-flinfo-fn_mcxt); column_info-column_type = column_type; } ! value = OutputFunctionCall(column_info-proc, values[i]); /* Detect whether we need double quotes for this value */ nq = (value[0] == '\0'); /* force quotes for empty string */ --- 389,412 */ if (column_info-column_type != column_type) { getTypeOutputInfo(column_type, column_info-typiofunc, ! column_info-typisvarlena); fmgr_info_cxt(column_info-typiofunc, column_info-proc, fcinfo-flinfo-fn_mcxt); column_info-column_type = column_type; } ! /* ! * If we have a toasted datum, forcibly detoast it here to avoid ! * memory leakage inside the type's output routine. ! */ ! if (column_info-typisvarlena) ! attr = PointerGetDatum(PG_DETOAST_DATUM(values[i])); ! else ! attr = values[i]; ! ! value = OutputFunctionCall(column_info-proc, attr); /* Detect whether we need double quotes for this value */ nq = (value[0] == '\0'); /* force quotes for empty string */ *** record_out(PG_FUNCTION_ARGS) *** 416,432 /* And emit the string */ if (nq) ! appendStringInfoChar(buf, ''); for (tmp = value; *tmp; tmp++) { char ch = *tmp; if (ch == '' || ch == '\\') ! appendStringInfoChar(buf, ch); ! appendStringInfoChar(buf, ch); } if (nq) ! appendStringInfoChar(buf, ''); } appendStringInfoChar(buf, ')'); --- 425,447 /* And emit the string */ if (nq) ! appendStringInfoCharMacro(buf, ''); for (tmp = value; *tmp; tmp++) { char ch = *tmp; if (ch == '' || ch == '\\') ! appendStringInfoCharMacro(buf, ch); ! appendStringInfoCharMacro(buf, ch); } if (nq) ! appendStringInfoCharMacro(buf, ''); ! ! pfree(value); ! ! /* Clean up detoasted copy, if any */ ! if (DatumGetPointer(attr) != DatumGetPointer(values[i])) ! pfree(DatumGetPointer(attr)); } appendStringInfoChar(buf, ')'); *** record_send(PG_FUNCTION_ARGS) *** 714,719 --- 729,735 { ColumnIOData *column_info = my_extra-columns[i]; Oid column_type = tupdesc-attrs[i]-atttypid; + Datum attr; bytea *outputbytes; /* Ignore dropped columns in datatype */ *** record_send(PG_FUNCTION_ARGS) *** 734,756 */ if (column_info-column_type != column_type) { - bool typIsVarlena; - getTypeBinaryOutputInfo(column_type, column_info-typiofunc, ! typIsVarlena); fmgr_info_cxt(column_info-typiofunc, column_info-proc, fcinfo-flinfo-fn_mcxt);
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On Monday, November 12, 2012 8:31 PM Merlin Moncure wrote: On Fri, Sep 14, 2012 at 6:42 AM, Amit kapila amit.kap...@huawei.com wrote: On Tuesday, September 11, 2012 7:07 PM Amit Kapila wrote: On Monday, September 10, 2012 8:20 PM Amit Kapila wrote: On Sunday, September 09, 2012 1:37 PM Amit Kapila wrote: On Friday, September 07, 2012 11:19 PM Tom Lane wrote: Heikki Linnakangas hlinn...@iki.fi writes: Would socketpair(2) be simpler? I have started working on this patch to make it work on Windows. The 3 main things to make it work are: The patch which contains Windows implementation as well is attached with this mail. It contains changes related to following 1. waitpid 2. socketpair 3. fork-exec The following is still left: 1. Error handling in all paths The modified version 2 contains error handling in all paths. I didn't see that this patch was added to a commitfest -- should it have been? I very much like Tom's proposed starting point for this feature as a replacement for --single. Hate to see this die on the vine. Would some testing on windows be what's needed to get the ball rolling? After Windows implementation, I have done first level tests also to make sure it works. I think Tom is the right person to comment on how to see this patch move forward. I am not sure what's in his mind that he didn't provide the feedback or proceeded to complete it. Is it due to time or he might have forseen some design or usecase problem, if it's due to time then I think it can be persuaded. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On Tuesday, November 13, 2012 3:11 AM Simon Riggs wrote: On 12 November 2012 21:26, Merlin Moncure mmonc...@gmail.com wrote: I couldn't disagree more. The patch is small, logical, and fixes an awful problem, namely that --single mode is basically unusable. As to your wider point (namely, that you can't connect to it, therefore it's bad), it has got to be refuted by numerous competing solutions in the market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html, and many others. As far as I remember even MySQL provides such a mode. Small is not an argument in favour, just a statement of ease, like jumping off a cliff. i.e. lemmings. While it's not as common as it used to be, now and then a piece of software needs to distribute an application as part of a boxed product. Postgres is horrible at this and doesn't have to be; imagine how much easier the lives of poker tracker would be (for *some* of its users) with an integrated standalone mode: google 'poker tracker postgresql' and take a good long look at problems people face in this scenario. I get the installability thang, very very much, I just don't see the single process thing as the only solution. At very least an open minded analysis of the actual problem and ways of solving it is called for, not just reach for a close to hand solution. Some other usecase where I have seen it required is in telecom billing apps. In telecom application where this solution works, needs other maintainence connections as well. Some of the reasons for its use are performance and less maintainence overhead and also their data requirements are also not so high. So even if this solution doesn't meet all requirements of single process solution (and neither I think it is written to address all) but can't we think of it as first version and then based on requirements extend it to have other capabilities: a. to have a mechnism for other background processes (autovacuum, checkpoint, ..). b. more needs to be thought of.. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers