Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Heikki Linnakangas

On 12.11.2012 01:25, Tom Lane wrote:

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.


Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
the GiST index is always in a consistent state. Before the downlink for 
a newly split page has been inserted yet, its left sibling is flagged so 
that a search knows to follow the right-link to find it. In normal 
operation, the lock continuity is needed to prevent another backend from 
seeing the incomplete split and trying to fix it, but in hot standby, we 
never try to fix incomplete splits anyway.


So I think we're good on = 9.1. The 9.0 code is broken, however. In 
9.0, when a child page is split, the parent and new children are kept 
locked until the downlinks are inserted/updated. If a concurrent scan 
comes along and sees that incomplete state, it will miss tuples on the 
new right siblings. We rely on a rm_cleanup operation at the end of WAL 
replay to fix that situation, if the downlink insertion record is not 
there. I don't see any easy way to fix that, unfortunately. Perhaps we 
could backpatch the 9.1 rewrite, now that it's gotten some real-world 
testing, but it was a big change so I don't feel very comfortable doing 
that.



Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.


Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
removed.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner
Jeff,

On 11/10/2012 12:08 AM, Jeff Davis wrote:
 The bit indicating that a checksum is present may be lost due to
 corruption.

Hm.. I see.

Sorry if that has been discussed before, but can't we do without that
bit at all? It adds a checksum switch to each page, where we just agreed
we don't event want a per-database switch.

Can we simply write a progress indicator to pg_control or someplace
saying that all pages up to X of relation Y are supposed to have valid
checksums?

That would mean having to re-calculate the checksums on pages that got
dirtied before VACUUM came along to migrate them to having a checksum,
but that seems acceptable. VACUUM could even detect that case and
wouldn't have to re-write it with the same contents.

I realize this doesn't support Jesper's use case of wanting to have the
checksums only for newly dirtied pages. However, I'd argue that
prolonging the migration to spread the load would allow even big shops
to go through this without much of an impact on performance.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner
On 11/12/2012 05:55 AM, Greg Smith wrote:
 Adding an initdb option to start out with everything checksummed seems
 an uncontroversial good first thing to have available.

+1

So the following discussion really is for a future patch extending on
that initial checkpoint support.

 One of the really common cases I was expecting here is that conversions
 are done by kicking off a slow background VACUUM CHECKSUM job that might
 run in pieces.  I was thinking of an approach like this:
 
 -Initialize a last_checked_block value for each table
 -Loop:
 --Grab the next block after the last checked one
 --When on the last block of the relation, grab an exclusive lock to
 protect against race conditions with extension
 --If it's marked as checksummed and the checksum matches, skip it
 ---Otherwise, add a checksum and write it out
 --When that succeeds, update last_checked_block
 --If that was the last block, save some state saying the whole table is
 checkedsummed

Perfect, thanks. That's the rough idea I had in mind as well, written
out in detail and catching the extension case.

 With that logic, there is at least a forward moving pointer that removes
 the uncertainty around whether pages have been updated or not.  It will
 keep going usefully if interrupted too.  One obvious this way this can
 fail is if:
 
 1) A late page in the relation is updated and a checksummed page written
 2) The page is corrupted such that the is this checksummed? bits are
 not consistent anymore, along with other damage to it
 3) The conversion process gets to this page eventually
 4) The corruption of (2) isn't detected

IMO this just outlines how limited the use of the is this checksummed
bit in the page itself is. It just doesn't catch all cases. Is it worth
having that bit at all, given your block-wise approach above?

It really only serves to catch corruptions to *newly* dirtied pages
*during* the migration phase that *keep* that single bit set. Everything
else is covered by the last_checked_block variable. Sounds narrow enough
to be negligible. Then again, it's just a single bit per page...

 The only guarantee I see that we can give for online upgrades is that
 after a VACUUM CHECKSUM sweep is done, and every page is known to both
 have a valid checksum on it and have its checksum bits set, *then* any
 page that doesn't have both set bits and a matching checksum is garbage.

From that point in time on, we'd theoretically better use that bit as an
additional checksum bit rather than requiring it to be set all times.
Really just theoretically, I'm certainly not advocating a 33 bit
checksum :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Craig Ringer
On 11/12/2012 04:44 PM, Markus Wanner wrote:
 Jeff,

 On 11/10/2012 12:08 AM, Jeff Davis wrote:
 The bit indicating that a checksum is present may be lost due to
 corruption.
 Hm.. I see.

 Sorry if that has been discussed before, but can't we do without that
 bit at all? It adds a checksum switch to each page, where we just agreed
 we don't event want a per-database switch.

 Can we simply write a progress indicator to pg_control or someplace
 saying that all pages up to X of relation Y are supposed to have valid
 checksums?
That'll make it hard for VACUUM, hint-bit setting, etc to
opportunistically checksum pages whenever they're doing a page write anyway.

Is it absurd to suggest using another bitmap, like the FSM or visibility
map, to store information on page checksumming while checksumming is
enabled but incomplete? As a much smaller file the bitmap could its self
be very quickly generated in one pass when checksumming is enabled, with
its starting state showing no pages having checksums.

It perhaps its self have page checksums since presumably the persistent
maps like the FSM and visibility map will support them? Some way to
ensure the checksum map is valid would be needed.

--
Craig Ringer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce palloc's in numeric operations.

2012-11-12 Thread Kyotaro HORIGUCHI
Thanks for comments,

 Have to be careful to really not modify the
 operands. numeric_out() and numeric_out_sci() are wrong; they
 call get_str_from_var(), which modifies the argument. Same with
 numeric_expr(): it passes the argument to
 numericvar_to_double_no_overflow(), which passes it to
 get_str_from_var(). numericvar_to_int8() also modifies its
 argument, so all the functions that use that, directly or
 indirectly, must make a copy.

mmm. My carefulness was a bit short to pick up them...

I overlooked that get_str_from_var() and numeric_to_int8() calls
round_var() which rewrites the operand. I reverted numeric_out()
and numeric_int8(), numeric_int2().

Altough, I couldn't find in get_str_from_var_sci() where the
operand would be modified.

The lines where var showing in get_str_from_var_sci() is listed
below.

|  2:get_str_from_var_sci(NumericVar *var, int rscale)
| 21:  if (var-ndigits  0)
| 23:exponent = (var-weight + 1) * DEC_DIGITS;
| 29:exponent -= DEC_DIGITS - (int) log10(var-digits[0]);
| 59:  div_var(var, denominator, significand, rscale, true);

The only suspect is div_var at this level, and do the same thing
for var1 in div_var.

|   2:div_var(NumericVar *var1, NumericVar *var2, NumericVar *result,
|  20:  int  var1ndigits = var1-ndigits;
|  35:  if (var1ndigits == 0)
|  47:  if (var1-sign == var2-sign)
|  51:  res_weight = var1-weight - var2-weight;
|  68:  div_ndigits = Max(div_ndigits, var1ndigits);
|  83:  memcpy(dividend + 1, var1-digits, var1ndigits * sizeof(NumericDigit));
| 132:  for (i = var1ndigits; i = 0; i--)

No line seems to modify var1 as far as I see so I've left
numeric_out_sci() modified in this patch.


Well, I found some other bugs in numeric_stddev_internal.
vN was errorniously freed and vsumX2 in is used as work.
They are fixed in this new patch.

 Perhaps get_str_from_var(), and the other functions that
 currently scribble on the arguments, should be modified to not
 do so. They could easily make a copy of the argument within the
 function. Then the callers could safely use
 set_var_from_num_nocopy(). The performance would be the same,
 you would have the same number of pallocs, but you would get
 rid of the surprising argument-modifying behavior of those
 functions.

I agree with that. const qualifiers on parameters would rule this
mechanically. I try this for the next version of this patch.


 SELECT SUM(col) FROM numtest;
 
 The execution time of that query fell from about 5300 ms to 4300 ms, ie. 
 about 20%.

Wow, it seems more promising than I expected. Thanks.

regards,

-- 
堀口恭太郎

日本電信電話株式会社 NTTオープンソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 68c1f1d..fcff325 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -367,6 +367,7 @@ static void zero_var(NumericVar *var);
 static const char *set_var_from_str(const char *str, const char *cp,
  NumericVar *dest);
 static void set_var_from_num(Numeric value, NumericVar *dest);
+static void set_var_from_num_nocopy(Numeric num, NumericVar *dest);
 static void set_var_from_var(NumericVar *value, NumericVar *dest);
 static char *get_str_from_var(NumericVar *var, int dscale);
 static char *get_str_from_var_sci(NumericVar *var, int rscale);
@@ -617,11 +618,10 @@ numeric_out_sci(Numeric num, int scale)
 		return pstrdup(NaN);
 
 	init_var(x);
-	set_var_from_num(num, x);
+	set_var_from_num_nocopy(num, x);
 
 	str = get_str_from_var_sci(x, scale);
 
-	free_var(x);
 	return str;
 }
 
@@ -696,7 +696,7 @@ numeric_send(PG_FUNCTION_ARGS)
 	int			i;
 
 	init_var(x);
-	set_var_from_num(num, x);
+	set_var_from_num_nocopy(num, x);
 
 	pq_begintypsend(buf);
 
@@ -707,8 +707,6 @@ numeric_send(PG_FUNCTION_ARGS)
 	for (i = 0; i  x.ndigits; i++)
 		pq_sendint(buf, x.digits[i], sizeof(NumericDigit));
 
-	free_var(x);
-
 	PG_RETURN_BYTEA_P(pq_endtypsend(buf));
 }
 
@@ -1577,15 +1575,13 @@ numeric_add(PG_FUNCTION_ARGS)
 	init_var(arg2);
 	init_var(result);
 
-	set_var_from_num(num1, arg1);
-	set_var_from_num(num2, arg2);
+	set_var_from_num_nocopy(num1, arg1);
+	set_var_from_num_nocopy(num2, arg2);
 
 	add_var(arg1, arg2, result);
 
 	res = make_result(result);
 
-	free_var(arg1);
-	free_var(arg2);
 	free_var(result);
 
 	PG_RETURN_NUMERIC(res);
@@ -1620,15 +1616,13 @@ numeric_sub(PG_FUNCTION_ARGS)
 	init_var(arg2);
 	init_var(result);
 
-	set_var_from_num(num1, arg1);
-	set_var_from_num(num2, arg2);
+	set_var_from_num_nocopy(num1, arg1);
+	set_var_from_num_nocopy(num2, arg2);
 
 	sub_var(arg1, arg2, result);
 
 	res = make_result(result);
 
-	free_var(arg1);
-	free_var(arg2);
 	free_var(result);
 
 	PG_RETURN_NUMERIC(res);
@@ -1667,15 +1661,13 @@ numeric_mul(PG_FUNCTION_ARGS)
 	init_var(arg2);
 	init_var(result);
 
-	set_var_from_num(num1, arg1);
-	set_var_from_num(num2, arg2);
+	set_var_from_num_nocopy(num1, arg1);
+	set_var_from_num_nocopy(num2, arg2);
 
 	mul_var(arg1, arg2, result, arg1.dscale + 

Re: [HACKERS] Identity projection

2012-11-12 Thread Kyotaro HORIGUCHI
Hello, This is new version of identity projection patch.

Reverted projectionInfo and ExecBuildProjectionInfo. Identity
projection is recognized directly in ExecGroup, ExecResult, and
ExecWindowAgg. nodeAgg is reverted because I couldn't make it
sane..

The following is the result of performance test posted before in
order to show the source of the gain.

regards,

-- 
-- 
Kyotaro Horiguchi
NTT Open Source Software Center



At Fri, 05 Oct 2012 16:04:16 +0900, Kyotaro HORIGUCHI wrote in 
20121005.160416.256387378.horiguchi.kyot...@lab.ntt.co.jp
  Although I said as following, the gain seems a bit larger... I'll
  recheck the testing conditions...
 
 I had inspected more precisely on two aspects maginifying the
 effect of this patch by putting 300 columns into table.
 
 
 First, explain analyze says the difference caused by this patch
 is only in the actual time of Result node.
 
 orig$ psql -c 'explain analyze select * from parenta'
   QUERY PLAN
 --
  Result  (cost=0.00..176667.00 rows=101 width=1202)
  (actual time=0.013.. *2406.792* rows=100 loops=1)
-  Append  (cost=0.00..176667.00 rows=101 width=1202)
(actual time=0.011..412.749 rows=100 loops=1)
  -  Seq Scan on parenta  (cost=0.00..0.00 rows=1 width=1228)
   (actual time=0.001..0.001 rows=0 loops=1)
  -  Seq Scan on childa000 parenta
 (cost=0.00..176667.00 rows=100 width=1202)
 (actual time=0.009..334.633 rows=100 loops=1)
  Total runtime: 2446.474 ms
 (5 rows)
 
 patched$ psql -c 'explain analyze select * from parenta'
   QUERY PLAN
 --
  Result  (cost=0.00..176667.00 rows=101 width=1202)
  (actual time=0.011.. *507.239* rows=100 loops=1)
-  Append  (cost=0.00..176667.00 rows=101 width=1202)
(actual time=0.011..419.420 rows=100 loops=1)
  -  Seq Scan on parenta  (cost=0.00..0.00 rows=1 width=1228)
   (actual time=0.000..0.000 rows=0 loops=1)
  -  Seq Scan on childa000 parenta
 (cost=0.00..176667.00 rows=100 width=1202)
 (actual time=0.010..335.721 rows=100 loops=1)
  Total runtime: 545.879 ms
 (5 rows)
 
 
  Second, the results of configure --enable-profiling shows that
 the exectime of ExecProject chages greately. This is consistent
 with what explain showed.
 
 orig:
%   cumulative   self  self total   
   time   seconds   secondscalls   s/call   s/call  name
   60.29  1.26 1.26  105 0.00 0.00  slot_deform_tuple
 ! 30.14  1.89 0.63  100 0.00 0.00  ExecProject
3.35  1.96 0.07  304 0.00 0.00  ExecProcNode
0.96  1.98 0.02  102 0.00 0.00  ExecScan
0.96  2.00 0.02   166379 0.00 0.00  TerminateBufferIO
0.48  2.01 0.01  304 0.00 0.00  InstrStartNode
0.48  2.02 0.01  304 0.00 0.00  InstrStopNode
 !  0.48  2.03 0.01  101 0.00 0.00  ExecResult
0.48  2.04 0.01   830718 0.00 0.00  LWLockAcquire
0.48  2.05 0.01   506834 0.00 0.00  
 hash_search_with_hash_value
0.48  2.06 0.01   341656 0.00 0.00  LockBuffer
0.48  2.07 0.01   168383 0.00 0.00  ReadBuffer_common
0.48  2.08 0.014 0.00 0.00  InstrEndLoop
0.48  2.09 0.01 
 ExecCleanTargetListLength
0.00  2.09 0.00  205 0.00 0.00  MemoryContextReset
 
 patched:
 %   cumulative   self  self total   
time   seconds   secondscalls  ms/call  ms/call  name
23.08  0.03 0.03  304 0.00 0.00  ExecProcNode
15.38  0.05 0.02  102 0.00 0.00  heapgettup_pagemode
15.38  0.07 0.02   830718 0.00 0.00  LWLockAcquire
 7.69  0.08 0.01  205 0.00 0.00  MemoryContextReset
 7.69  0.09 0.01  102 0.00 0.00  ExecScan
 7.69  0.10 0.01  100 0.00 0.00  ExecStoreTuple
 7.69  0.11 0.01   841135 0.00 0.00  LWLockRelease
 7.69  0.12 0.01   168383 0.00 0.00  ReadBuffer_common
 7.69  0.13 0.01   168383 0.00 0.00  UnpinBuffer
 0.00  0.13 0.00  304 0.00 0.00  InstrStartNode
  ...
 !   0.00  0.13 0.00  101 0.00 0.00  ExecResult
 !   0.00  0.13 0.00  100 0.00 0.00  ExecProject
==
diff --git a/src/backend/executor/nodeGroup.c 

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner
On 11/12/2012 10:44 AM, Craig Ringer wrote:
 That'll make it hard for VACUUM, hint-bit setting, etc to
 opportunistically checksum pages whenever they're doing a page write anyway.

It *is* a hard problem, yes. And the single bit doesn't really solve it.
So I'm arguing against opportunistically checksumming in general. Who
needs that anyway?

 Is it absurd to suggest using another bitmap, like the FSM or visibility
 map, to store information on page checksumming while checksumming is
 enabled but incomplete?

Not absurd. But arguably inefficient, because that bitmap may well
become a bottleneck itself. Plus there's the problem of making sure
those pages are safe against corruptions, so you'd need to checksum the
checksum bitmap... doesn't sound like a nice solution to me.

This has certainly been discussed before.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why can't plpgsql return a row-expression?

2012-11-12 Thread Asif Rehman
Hi,

I have tried to solve this issue. Please see the attached patch.

With this patch, any expression is allowed in the return statement. For any
invalid expression an error is generated without doing any special handling.
When a row expression is used in the return statement then the resultant
tuple will have rowtype in a single column that needed to be extracted.
Hence I have handled that case in exec_stmt_return().

any comments/suggestions?

Regards,
--Asif

On Mon, Oct 8, 2012 at 9:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Pavel Stehule pavel.steh...@gmail.com writes:
  2012/10/8 Tom Lane t...@sss.pgh.pa.us:
  Laziness, probably.  Feel free to have at it.

  I wrote patch some years ago. It was rejected from performance reasons
  - because every row had to be casted to resulted type.

 I don't recall that patch in any detail, but it's not apparent to me
 that an extra cast step *must* be required to implement this.  In the
 cases that are supported now, surely we could notice that no additional
 work is required.

 It's also worth commenting that if we were to switch the storage of
 composite-type plpgsql variables to HeapTuple, as has been suggested
 here for other reasons, the performance tradeoffs in this area would
 likely change completely anyway.

 regards, tom lane


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



return_rowtype.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-12 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 Regardless, exactly how to prevent two concurrent processes from writing 
 the same file feels like the last step in the small roadmap for what 
 this feature needs.

Write a temp file and use rename(2) to move it into place is the
standard solution for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Robert Haas
On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I already pointed out the inconsistency in heap_xlog_freeze about
 whether a cleanup lock is needed.  As is, this patch uses a cleanup
 lock, but I suspect that a regular lock is sufficient --- comments?

Well, according to storage/buffer/README:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive content lock.  To examine the commit status (XIDs and status bits)
of a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

That does indeed make it sound like an x-lock is enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I already pointed out the inconsistency in heap_xlog_freeze about
 whether a cleanup lock is needed.  As is, this patch uses a cleanup
 lock, but I suspect that a regular lock is sufficient --- comments?

 Well, according to storage/buffer/README:

 1. To scan a page for tuples, one must hold a pin and either shared or
 exclusive content lock.  To examine the commit status (XIDs and status bits)
 of a tuple in a shared buffer, one must likewise hold a pin and either shared
 or exclusive lock.

 That does indeed make it sound like an x-lock is enough.

Yeah.  AFAICS, the only possible downside is that somebody might be
using the tuple (while holding just a buffer pin), and that its xmin
might change while that's happening.  So for instance a nestloop join
might read out different xmin values for the same row while the join
proceeds.  But that could happen anyway if a different plan had been
chosen (viz, this table on the inside not the outside of the nestloop).
The xmin update ought to be physically atomic, so you shouldn't be able
to get a garbage result, just the old value or the new.

The cleanup lock is needed for cases where we'd like to remove or move a
tuple, but that's not required here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 12.11.2012 01:25, Tom Lane wrote:
 Worse than that, GIST WAL replay seems to be a total disaster from this
 standpoint, and I'm not convinced that it's even fixable without
 incompatible WAL changes.

 Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
 the GiST index is always in a consistent state. Before the downlink for 
 a newly split page has been inserted yet, its left sibling is flagged so 
 that a search knows to follow the right-link to find it. In normal 
 operation, the lock continuity is needed to prevent another backend from 
 seeing the incomplete split and trying to fix it, but in hot standby, we 
 never try to fix incomplete splits anyway.

 So I think we're good on = 9.1.

Okay.  I'll update the patch to make sure that the individual WAL replay
functions hold all locks, but not worry about holding locks across
actions.

 The 9.0 code is broken, however. In 
 9.0, when a child page is split, the parent and new children are kept 
 locked until the downlinks are inserted/updated. If a concurrent scan 
 comes along and sees that incomplete state, it will miss tuples on the 
 new right siblings. We rely on a rm_cleanup operation at the end of WAL 
 replay to fix that situation, if the downlink insertion record is not 
 there. I don't see any easy way to fix that, unfortunately. Perhaps we 
 could backpatch the 9.1 rewrite, now that it's gotten some real-world 
 testing, but it was a big change so I don't feel very comfortable doing 
 that.

Me either.  Given the lack of field complaints, I think we're better
advised to just leave it unfixed in 9.0.  It'd not be a step forward
if we broke something trying to make this work.

 Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
 anything that emits XLOG_GIST_PAGE_DELETE.

 Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
 removed.

Okay.  I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer
any way that 9.0 or later versions could see this WAL record type.
I'll delete that replay function rather than bothering to fix it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2012-11-12 Thread Fujii Masao
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:
 On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila amit.kap...@huawei.com
 wrote:
  On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:
  On 19.10.2012 14:42, Amit kapila wrote:
   On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
   Before implementing the timeout parameter, I think that it's
 better
  to change
   both pg_basebackup background process and pg_receivexlog so that
 they
   send back the reply message immediately when they receive the
  keepalive
   message requesting the reply. Currently, they always ignore such
  keepalive
   message, so status interval parameter (-s) in them always must be
 set
  to
   the value less than replication timeout. We can avoid this
  troublesome
   parameter setting by introducing the same logic of walreceiver
 into
  both
   pg_basebackup background process and pg_receivexlog.
  
   Please find the patch attached to address the modification
 mentioned
  by you (send immediate reply for keepalive).
   Both basebackup and pg_receivexlog uses the same function
  ReceiveXLogStream, so single change for both will address the issue.
 
  Thanks, committed this one after shuffling it around the changes I
  committed yesterday. I also updated the docs to not claim that -s
 option
  is required to avoid timeout disconnects anymore.
 
  Thank you.
  However I think still the issue will not be completely solved.
  pg_basebackup/pg_receivexlog can still take long time to
  detect network break as they don't have timeout concept. To do that I
 have
  sent one proposal which is mentioned at end of mail chain:
  http://archives.postgresql.org/message-
 id/6C0B27F7206C9E4CA54AE035729E9C3828
  53BBED@szxeml509-mbs
 
  Do you think there is any need to introduce such mechanism in
  pg_basebackup/pg_receivexlog?

 Are you planning to introduce the timeout mechanism in pg_basebackup
 main process? Or background process? It's useful to implement both.

 By background process, you mean ReceiveXlogStream?
 For both.

 I think for background process, it can be done in a way similar to what we
 have done for walreceiver.

Yes.

 But I have some doubts for how to do for main process:

 Logic similar to walreceiver can not be used incase network goes down during
 getting other database file from server.
 The reason for the same is to receive the data files PQgetCopyData() is
 called in synchronous mode, so it keeps waiting for infinite time till it
 gets some data.
 In order to solve this issue, I can think of following options:
 1. Making this call also asynchronous (but now sure about impact of this).

+1

Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
solve the issue in the similar way to walreceiver's.

 2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
 wait), we can send some finite time. This time can be received as command
 line argument
 from respective utility and set the same in PGconn structure.
 In order to have timeout value in PGconn, we can have:
 a. Add new parameter in PGconn to indicate the receive timeout.
 b. Use the existing parameter connect_timeout for receive timeout
 also but this may lead to confusion.
 3. Any other better option?

 Apart from above issue, there is possibility that if during connect time
 network goes down, then it might hang,  because connect_timeout by default
 will be NULL and connectDBComplete will start waiting inifinitely for
 connection to become successful.
 So shall we have command line argument separately for this also or any other
 way as you suugest.

Yes, I think that we should add something like --conninfo option to
pg_basebackup
and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
application_name, ... by using such option accepting conninfo string.

 BTW, IIRC the walsender has no timeout mechanism during sending
 backup data to pg_basebackup. So it's also useful to implement the
 timeout mechanism for the walsender during backup.

 Yes, its useful, but for walsender the main problem is that it uses blocking
 send call to send the data.
 I have tried using tcp_keepalive settings, but the send call doesn't comeout
 incase of network break.
 The only way I could get it out is:
 change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
 the command
 echo 8  /proc/sys/net/ipv4/tcp_retries2
 As per recommendation, its value should be at-least 8 (equivalent to 100
 sec)

 Do you have any idea, how it can be achieved?

What about using pq_putmessage_noblock()?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Alvaro Herrera
Greg Smith wrote:
 On 11/11/12 2:56 PM, Jeff Davis wrote:
 We could have a separate utility, pg_checksums, that can
 alter the state and/or do an offline verification. And initdb would take
 an option that would start everything out fully protected with
 checksums.
 
 Adding an initdb option to start out with everything checksummed
 seems an uncontroversial good first thing to have available.

+1

 Won't a pg_checksums program just grow until it looks like a limited
 version of vacuum though?  It's going to iterate over most of the
 table; it needs the same cost controls as autovacuum (and to respect
 the load of concurrent autovacuum work) to keep I/O under control;
 and those cost control values might change if there's a SIGHUP to
 reload parameters. It looks so much like vacuum that I think there
 needs to be a really compelling reason to split it into something
 new.  Why can't this be yet another autovacuum worker that does its
 thing?

I agree that much of the things it's gonna do are going to be pretty
much the same as vacuum, but vacuum does so many other things that I
think it should be kept separate.  Sure, we can make it be invoked from
autovacuum in background according to some (yet to be devised)
scheduling heuristics.  But I don't see that it needs to share any
vacuum code.

A couple of thoughts about autovacuum: it's important to figure out
whether checksumming can run concurrently with vacuuming the same table;
if not, which one defers to the other in case of lock conflict.  Also,
can checksumming be ignored by concurrent transactions when computing
Xmin (I don't see any reason not to ...)

 One of the really common cases I was expecting here is that
 conversions are done by kicking off a slow background VACUUM
 CHECKSUM job that might run in pieces.  I was thinking of an
 approach like this:
 
 -Initialize a last_checked_block value for each table
 -Loop:
 --Grab the next block after the last checked one
 --When on the last block of the relation, grab an exclusive lock to
 protect against race conditions with extension

Note that we have a separate lock type for relation extension, so we can
use that to avoid a conflict here.

 --If it's marked as checksummed and the checksum matches, skip it
 ---Otherwise, add a checksum and write it out
 --When that succeeds, update last_checked_block
 --If that was the last block, save some state saying the whole table
 is checkedsummed

Some state can be a pg_class field that's updated per
heap_inplace_update.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Merlin Moncure
On Fri, Sep 14, 2012 at 6:42 AM, Amit kapila amit.kap...@huawei.com wrote:
 On Tuesday, September 11, 2012 7:07 PM Amit Kapila wrote:
 On Monday, September 10, 2012 8:20 PM Amit Kapila wrote:
 On Sunday, September 09, 2012 1:37 PM Amit Kapila wrote:
 On Friday, September 07, 2012 11:19 PM Tom Lane wrote:
 Heikki Linnakangas hlinn...@iki.fi writes:
 Would socketpair(2) be simpler?



I've not done anything yet about the potential security issues
associated with untrusted libpq connection strings.  I think this
is still at the proof-of-concept stage; in particular, it's probably
 time to see if we can make it work on Windows before we worry more
about that.

 I have started working on this patch to make it work on Windows. The 3
 main things to make it work are:

The patch which contains Windows implementation as well is attached with this 
mail. It contains changes related to following
1. waitpid
2. socketpair
3. fork-exec

The following is still left:
 1. Error handling in all paths

 The modified version 2 contains error handling in all paths.

I didn't see that this patch was added to a commitfest  -- should it
have been?  I very much like Tom's proposed starting point for this
feature as a replacement for --single.  Hate to see this die on the
vine.  Would some testing on windows be what's needed to get the ball
rolling?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
 If any of this stuff were getting used by external modules, changing it
 would be problematic ... but fortunately, it isn't, because we lack
 support for plug-in rmgrs.  So I'm not worried about back-patching the
 change, and would prefer to keep the 9.x branches in sync.

 XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
 used by xlogdump. Not sure if either are worth that much attention, but
 it seems worth noticing that such a change will break stuff.

Hm.  Okay, how about we leave the old macros in place in the back
branches?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix errcontext() function

2012-11-12 Thread Heikki Linnakangas

On 10.11.2012 11:46, Chen Huajun wrote:

Unfortunately not all compilers support varargs macros. I bumped into
this in February, see
http://archives.postgresql.org/message-id/4f3b72e0.8040...@enterprisedb.com.
My last attempt to fix this was
at http://archives.postgresql.org/pgsql-hackers/2012-04/msg00812.php.
That patch is probably good to go, I just got busy with other things
and forgot about it back then. Can you take a look at that
patch and see if I missed anything, please?


I think you are right,although the number of changed place is a a little
bit large.
Thanks for your answer!


Ok, I've committed this patch now, it will be fixed in 9.3. Thanks for 
reminding me about this.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Version 4.9 of PostgreSQL buildfarm client released

2012-11-12 Thread Andrew Dunstan
Version 4.9 of the PostgreSQL buildfarm client has been released and is 
available at 
https://github.com/downloads/PGBuildFarm/client-code/build-farm-4_9.tgz


Changes since version 4.8:

. adjust git status checking to allow for messages from recent git releases
. add a module to test FileTextArray_FDW as an example of checking an 
extension

. a few small bug fixes and tweaks


You will need this release if you are using or likely to use a fairly 
recent git version to avoid getting spurious git-Dirty failures.


Enjoy

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Andres Freund
On 2012-11-12 10:19:09 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
  If any of this stuff were getting used by external modules, changing it
  would be problematic ... but fortunately, it isn't, because we lack
  support for plug-in rmgrs.  So I'm not worried about back-patching the
  change, and would prefer to keep the 9.x branches in sync.

  XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
  used by xlogdump. Not sure if either are worth that much attention, but
  it seems worth noticing that such a change will break stuff.

 Hm.  Okay, how about we leave the old macros in place in the back
 branches?

Sounds good to me. The RestoreBkpBlocks change seems unproblematic to
me. If anything its good that it has been renamed.

Thanks,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix errcontext() function

2012-11-12 Thread Alvaro Herrera
Heikki Linnakangas wrote:
 On 10.11.2012 11:46, Chen Huajun wrote:
 Unfortunately not all compilers support varargs macros. I bumped into
 this in February, see
 http://archives.postgresql.org/message-id/4f3b72e0.8040...@enterprisedb.com.
 My last attempt to fix this was
 at http://archives.postgresql.org/pgsql-hackers/2012-04/msg00812.php.
 That patch is probably good to go, I just got busy with other things
 and forgot about it back then. Can you take a look at that
 patch and see if I missed anything, please?
 
 I think you are right,although the number of changed place is a a little
 bit large.
 Thanks for your answer!
 
 Ok, I've committed this patch now, it will be fixed in 9.3. Thanks
 for reminding me about this.

Hopefully you noticed that contrib is broken.

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Dumping an Extension's Script

2012-11-12 Thread Dimitri Fontaine
Hi,

Please find attached to this email an RFC patch implementing the basics
of the pg_dump --extension-script option. After much discussion around
the concept of an inline extension, we decided last year that a good
first step would be pg_dump support for an extension's script.

The approach I've been using here is to dump the script from the catalog
current dependencies, which mean that a sequence of CREATE EXTENSION
followed by a number of ALTER EXTENSION … UPDATE … will be consolidated
into a single CREATE EXTENSION command in the dump, much the same as
with CREATE TABLE then ALTER TABLE … ADD COLUMN and the like.

Currently the option behavior is the following, that looks sane to me,
and is open for discussion: the dump's schema always include the CREATE
EXTENSION commands you need. The extensions listed in the -X option
(that you can use more than once) will get dumped with their's current
member objects in a script, inline.

To try the attached patch, you could do as following:

createdb foo
psql -c create extension hstore -d foo
pg_dump -X hstore -f /tmp/foo.sql foo

createdb bar
psql -1 -f /tmp/foo.sql -d bar

To be able to restore the dump, I've been adding some basic support to
the CREATE EXTENSION command so that it will find the data it needs from
the SQL command rather than the control file.

Note that the extension control file only contains information about how
to install an extension from a script file on disk. That's something we
don't need at all when installing the extension from a dump, using
either pg_restore or psql. We have some exceptions to that principle,
namely: requires (sets the search_path) and relocatable (found in the
catalogs, needs to survive dump/restore).

Given positive feedback on that way to attack the problem, the TODO list
includes:

 - document the new pg_dump --extension-script switch
 - add support for ALTER EXTENSION … WITH $$ script here $$;

The ALTER EXTENSION support is optional as far as pg_dump support goes,
it would be good to have it to make the User Interface complete.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

*** a/src/backend/commands/extension.c
--- b/src/backend/commands/extension.c
***
*** 75,80  typedef struct ExtensionControlFile
--- 75,81 
  	bool		superuser;		/* must be superuser to install? */
  	int			encoding;		/* encoding of the script file, or -1 */
  	List	   *requires;		/* names of prerequisite extensions */
+ 	char   *script;
  } ExtensionControlFile;
  
  /*
***
*** 577,586  parse_extension_control_file(ExtensionControlFile *control,
  }
  
  /*
!  * Read the primary control file for the specified extension.
   */
  static ExtensionControlFile *
! read_extension_control_file(const char *extname)
  {
  	ExtensionControlFile *control;
  
--- 578,587 
  }
  
  /*
!  * Create an ExtensionControlFile with default values.
   */
  static ExtensionControlFile *
! default_extension_control_file(const char *extname)
  {
  	ExtensionControlFile *control;
  
***
*** 593,598  read_extension_control_file(const char *extname)
--- 594,610 
  	control-superuser = true;
  	control-encoding = -1;
  
+ 	return control;
+ }
+ 
+ /*
+  * Read the primary control file for the specified extension.
+  */
+ static ExtensionControlFile *
+ read_extension_control_file(const char *extname)
+ {
+ 	ExtensionControlFile *control = default_extension_control_file(extname);
+ 
  	/*
  	 * Parse the primary control file.
  	 */
***
*** 858,866  execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
  	CurrentExtensionObject = extensionOid;
  	PG_TRY();
  	{
! 		char	   *c_sql = read_extension_script_file(control, filename);
  		Datum		t_sql;
  
  		/* We use various functions that want to operate on text datums */
  		t_sql = CStringGetTextDatum(c_sql);
  
--- 870,883 
  	CurrentExtensionObject = extensionOid;
  	PG_TRY();
  	{
! 		char	   *c_sql;
  		Datum		t_sql;
  
+ 		if (control-script)
+ 			c_sql = control-script;
+ 		else
+ 			c_sql = read_extension_script_file(control, filename);
+ 
  		/* We use various functions that want to operate on text datums */
  		t_sql = CStringGetTextDatum(c_sql);
  
***
*** 1178,1183  void
--- 1195,1203 
  CreateExtension(CreateExtensionStmt *stmt)
  {
  	DefElem*d_schema = NULL;
+ 	DefElem*d_script = NULL;
+ 	DefElem*d_requires = NULL;
+ 	DefElem*d_relocatable = NULL;
  	DefElem*d_new_version = NULL;
  	DefElem*d_old_version = NULL;
  	char	   *schemaName;
***
*** 1229,1248  CreateExtension(CreateExtensionStmt *stmt)
   errmsg(nested CREATE EXTENSION is not supported)));
  
  	/*
! 	 * Read the primary control file.  Note we assume that it does not contain
! 	 * any non-ASCII data, so there is no need to worry about encoding at this
! 	 * point.
! 	 */
! 	pcontrol = 

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian
On Sat, Nov 10, 2012 at 05:59:54PM -0500, Bruce Momjian wrote:
 On Sat, Nov 10, 2012 at 02:45:54PM -0800, Jeff Janes wrote:
  On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian br...@momjian.us wrote:
   On Fri, Nov  9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:
   On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian br...@momjian.us wrote:
   
Again, using SERIAL?
  
   Yep.
  
   Odd why yours is so much after.
  
  You didn't build git head under --enable-cassert, did you?
 
 Yikes, you got me!  I have not done performance testing in so long, I
 had forgotten I changed my defaults.  New numbers to follow.  Sorry.
 
  Any chance you can do a oprofile or gprof of head's pg_dump dumping
  out of head's server?  That really should be a lot faster (since
  commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
  If it is not for you, I don't see how to figure it out without a
  profile of the slow system.
 
 Yes, coming.

OK, here are my results.  Again, apologies for posting non-linear
results based on assert builds:

 --  9.2 --   9.3 
 -- normal -- -- bin-up --   -- normal -- -- bin-up --
 dump   rest   dump   rest   dump   rest   dump   rest  pg_upgrade
  1  0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11
   1000  7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60
   2000  5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55
   4000 13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70
   8000 29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05
  16000 87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71


You can see the non-linear dump at 16k in 9.2, and the almost-linear in
9.3.  :-)

pg_upgrade shows non-linear, but that is probably because of the
non-linear behavior of 9.2, and because of the two non-linear loops that
Ants found, that I will address in a separate email.

Thanks for the feedback.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Robert Haas
On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote:
 So what we're talking about here is a new mode for COPY, that when
 requested will pre-freeze tuples when loading into a newly
 created/truncated table. If the table isn't newly created/truncated
 then we'll just ignore it and continue. I see no need to throw an
 error, since that will just cause annoying usability issues.

Actually, why not just have it work always?  If people want to load
frozen tuples into a table that's not newly created/truncated, why not
let them?  Sure, there could be MVCC violations, but as long as the
behavior is opt-in, who cares?  I think it'd be useful to a lot of
people.

If we want to reduce (not eliminate) the potential MVCC issues, which
I think would be a good idea, we could take AccessExclusiveLock on the
table when COPY (FREEZE) is used.  Someone using an old snapshot but
accessing the table for the first time after AEL is released could
still see MVCC anomalies, but at least it would rule out things
changing in mid-query, which is the case that I think would be most
problematic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian
On Fri, Nov  9, 2012 at 12:50:34AM -0500, Tom Lane wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
  Are sure the server you are dumping out of is head?
 
 I experimented a bit with dumping/restoring 16000 tables matching
 Bruce's test case (ie, one serial column apiece).  The pg_dump profile
 seems fairly flat, without any easy optimization targets.  But
 restoring the dump script shows a rather interesting backend profile:
 
 samples  %image name   symbol name
 3086139.6289  postgres AtEOXact_RelationCache
 9911 12.7268  postgres hash_seq_search
 2682  3.4440  postgres init_sequence
 2218  2.8482  postgres _bt_compare
 2120  2.7223  postgres hash_search_with_hash_value
 1976  2.5374  postgres XLogInsert
 1429  1.8350  postgres CatalogCacheIdInvalidate
 1282  1.6462  postgres LWLockAcquire
 973   1.2494  postgres LWLockRelease
 702   0.9014  postgres hash_any
 
 The hash_seq_search time is probably mostly associated with
 AtEOXact_RelationCache, which is run during transaction commit and scans
 the relcache hashtable looking for tables created in the current
 transaction.  So that's about 50% of the runtime going into that one
 activity.

Thanks for finding this.  What is odd is that I am not seeing non-linear
restores at 16k in git head, so I am confused how something that
consumes ~50% of backend time could still perform linearly.  Would this
consume 50% at lower table counts?

I agree we should do something, even if this is a rare case, because 50%
is a large percentage.

 There are at least three ways we could whack that mole:
 
 * Run the psql script in --single-transaction mode, as I was mumbling
 about the other day.  If we were doing AtEOXact_RelationCache only once,
 rather than once per CREATE TABLE statement, it wouldn't be a problem.
 Easy but has only a narrow scope of applicability.
 
 * Keep a separate list (or data structure of your choice) so that
 relcache entries created in the current xact could be found directly
 rather than having to scan the whole relcache.  That'd add complexity
 though, and could perhaps be a net loss for cases where the relcache
 isn't so bloated.

I like this one.  Could we do it only when the cache gets to be above a
certain size, to avoid any penalty?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Robert Haas
On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote:
 Based on the discussion and suggestions in this mail chain, following
 features can be implemented:

 1. To compute the value of max LSN in data pages based on user input
 whether he wants it for an individual

   file,  a particular directory or whole database.

 2a. To search the available WAL files for the latest checkpoint record
 and prints the value.
 2b. To search the available WAL files for the latest checkpoint record
 and recreates a pg_control file pointing at

 that checkpoint.

 I have kept both options to address different kind of corruption
 scenarios.

 I think I can see all of those things being potentially useful.  There
 are a couple of pending patches that will revise the WAL format
 slightly; not sure how much those are likely to interfere with any
 development you might do on (2) in the meantime.

 Based on above conclusion, I have prepared a patch which implements Option-1

I wonder if we shouldn't make this a separate utility, rather than
something that is part of pg_resetxlog.  Anyone have a thought on that
topic?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote:
 So what we're talking about here is a new mode for COPY, that when
 requested will pre-freeze tuples when loading into a newly
 created/truncated table. If the table isn't newly created/truncated
 then we'll just ignore it and continue. I see no need to throw an
 error, since that will just cause annoying usability issues.

 Actually, why not just have it work always?  If people want to load
 frozen tuples into a table that's not newly created/truncated, why not
 let them?  Sure, there could be MVCC violations, but as long as the
 behavior is opt-in, who cares?  I think it'd be useful to a lot of
 people.

I thought about that too, but there's a big problem.  It wouldn't be
just MVCC that would be broken, but transactional integrity: if the
COPY fails partway through, the already-loaded rows still look valid.
The new-file requirement provides a way to roll them back.

I'm willing to have an option that compromises MVCC semantics
transiently, but giving up transactional integrity seems a bit much.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: question on foreign key lock

2012-11-12 Thread Robert Haas
On Thu, Nov 8, 2012 at 3:45 AM, Filip Rembiałkowski
filip.rembialkow...@gmail.com wrote:
 maybe this is a better group for this question?

 I can't see why creating foreign key on table A referencing table B,
 generates an AccessExclusiveLock on B.
 It seems (to a layman :-) ) that only writes to B should be blocked.

 I'm really interested if this is either expected effect or any open TODO
 item or suboptimal behavior of postgres.

This comment explains it:

/*
 * Grab an exclusive lock on the pk table, so that someone doesn't delete
 * rows out from under us. (Although a lesser lock would do for that
 * purpose, we'll need exclusive lock anyway to add triggers to the pk
 * table; trying to start with a lesser lock will just create a risk of
 * deadlock.)
 */
pkrel = heap_openrv(fkconstraint-pktable, AccessExclusiveLock);

Concurrent DDL is something that's been discussed in detail on this
list in the past; unfortunately, there are some tricky race conditions
are the shared invalidation queue and SnapshotNow that make it hard to
implement properly.  I'm hoping to have some time to work on this at
some point, but it hasn't happened yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Robert Haas
On Mon, Nov 12, 2012 at 11:20 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs si...@2ndquadrant.com wrote:
 So what we're talking about here is a new mode for COPY, that when
 requested will pre-freeze tuples when loading into a newly
 created/truncated table. If the table isn't newly created/truncated
 then we'll just ignore it and continue. I see no need to throw an
 error, since that will just cause annoying usability issues.

 Actually, why not just have it work always?  If people want to load
 frozen tuples into a table that's not newly created/truncated, why not
 let them?  Sure, there could be MVCC violations, but as long as the
 behavior is opt-in, who cares?  I think it'd be useful to a lot of
 people.

 I thought about that too, but there's a big problem.  It wouldn't be
 just MVCC that would be broken, but transactional integrity: if the
 COPY fails partway through, the already-loaded rows still look valid.
 The new-file requirement provides a way to roll them back.

 I'm willing to have an option that compromises MVCC semantics
 transiently, but giving up transactional integrity seems a bit much.

Hmm, good point.  There might be some way around that, but figuring it
out is probably material for a separate patch.

But I guess that raises the question - should COPY (FREEZE) silently
ignore the option for not-new relfilenodes, or should it error out?
Simon proposed the former, but I'm wondering if the latter would be
better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 On Fri, Nov  9, 2012 at 12:50:34AM -0500, Tom Lane wrote:
 The hash_seq_search time is probably mostly associated with
 AtEOXact_RelationCache, which is run during transaction commit and scans
 the relcache hashtable looking for tables created in the current
 transaction.  So that's about 50% of the runtime going into that one
 activity.

 Thanks for finding this.  What is odd is that I am not seeing non-linear
 restores at 16k in git head, so I am confused how something that
 consumes ~50% of backend time could still perform linearly.  Would this
 consume 50% at lower table counts?

No, the cost from that is O(N^2), though with a pretty small multiplier.
16K tables is evidently where the cost reaches the point of being
significant --- if you went up from there, you'd probably start to
notice an overall O(N^2) behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Doc patch: Document names of automatically created constraints and indexes

2012-11-12 Thread Karl O. Pinc
Hi,

Attached is a patch (against head) which documents the
name of the constraint and index created when using
PRIMARY KEY with CREATE TABLE, and the name of the
index created when using UNIQUE.  I haven't read all
the docs recently but I don't believe this is presently
documented.

It's unclear to me that this is the right approach
but perhaps this will start a discussion that finds
the right approach.

The big problem I see is that these are somewhat of
an implementation internal while at the same
time being something that the user might have to
concern themselves with.  

First, the constraint and index
names are in the namespace used by the user so there
is potential for collision with user-defined
constraints and indexes.

Second, the only way (I know of) to remove primary-key-ness
is to drop the primary key constraint, by name.

This lead me right into another thought:
It would be nice to have ALTER TABLE be able to
drop the primary key constraint.  (Then the
user would not need to know the name of the
constraint related to primary-key-ness.)
However, it is probably more useful to be able
to drop the constraint (and attendant foreign
key meta-information) separately from the unique
index associated with the primary key, if for no
other reason than index re-creation can be expensive
and missing indexes make bad things happen.

This patch is the improvement I could come up
with.  Someone else can decide to commit or reject,
I don't believe I can contribute much more on this at this
time.

Regards,

Karl k...@meme.com
Free Software:  You don't pay back, you pay forward.
 -- Robert A. Heinlein

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 445ca40..da87574 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -486,6 +486,14 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
   primary key constraint defined for the table.  (Otherwise it
   would just be the same constraint listed twice.)
  /para
+
+ para
+  The name of the index created when literalUNIQUE/literal is
+  used is literalreplaceable
+  class=PARAMETERtablename/replaceable_replaceable
+  class=PARAMETERcolname/replaceable_key/literal.
+ /para
+
 /listitem
/varlistentry
 
@@ -514,6 +522,13 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
   different from other sets of columns named by any unique
   constraint defined for the same table.
  /para
+
+ para
+   The name of the primary key constraint, and the name of the
+   index which enforces uniqueness, is literalreplaceable
+   class=PARAMETERtablename/replaceable_pkey/literal.
+ /para
+
 /listitem
/varlistentry
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Alvaro Herrera
Robert Haas escribió:
 On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote:

  I think I can see all of those things being potentially useful.  There
  are a couple of pending patches that will revise the WAL format
  slightly; not sure how much those are likely to interfere with any
  development you might do on (2) in the meantime.
 
  Based on above conclusion, I have prepared a patch which implements Option-1
 
 I wonder if we shouldn't make this a separate utility, rather than
 something that is part of pg_resetxlog.  Anyone have a thought on that
 topic?

That thought did cross my mind too.

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix errcontext() function

2012-11-12 Thread Heikki Linnakangas

On 12.11.2012 17:52, Alvaro Herrera wrote:

Hopefully you noticed that contrib is broken.


Oops.. fixed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Jeff Davis
On Mon, 2012-11-12 at 09:44 +0100, Markus Wanner wrote:
 Can we simply write a progress indicator to pg_control or someplace
 saying that all pages up to X of relation Y are supposed to have valid
 checksums?

pg_control would not be the right place for that structure. It's
intended to be fixed-size (it's just a serialized C structure) and it
should be smaller than a sector so that it doesn't suffer from torn
pages.

Not a bad approach overall, but requires some kind of new structure. And
that increases the risk that it doesn't make 9.3.

Right now, I'm honestly just trying to get the simplest approach that
doesn't restrict these kinds of ideas if we want to do them later.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch: Document names of automatically created constraints and indexes

2012-11-12 Thread Tom Lane
Karl O. Pinc k...@meme.com writes:
 Attached is a patch (against head) which documents the
 name of the constraint and index created when using
 PRIMARY KEY with CREATE TABLE, and the name of the
 index created when using UNIQUE.  I haven't read all
 the docs recently but I don't believe this is presently
 documented.

This is not actually correct: it ignores the corner cases where the name
would be overlength or conflict with something else.

Personally I don't think this should be documented, as it's an
implementation detail that we've changed in the past and may change
again.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Robert Haas
On Fri, Nov 9, 2012 at 12:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 Are sure the server you are dumping out of is head?

 I experimented a bit with dumping/restoring 16000 tables matching
 Bruce's test case (ie, one serial column apiece).  The pg_dump profile
 seems fairly flat, without any easy optimization targets.  But
 restoring the dump script shows a rather interesting backend profile:

 samples  %image name   symbol name
 3086139.6289  postgres AtEOXact_RelationCache
 9911 12.7268  postgres hash_seq_search
 2682  3.4440  postgres init_sequence
 2218  2.8482  postgres _bt_compare
 2120  2.7223  postgres hash_search_with_hash_value
 1976  2.5374  postgres XLogInsert
 1429  1.8350  postgres CatalogCacheIdInvalidate
 1282  1.6462  postgres LWLockAcquire
 973   1.2494  postgres LWLockRelease
 702   0.9014  postgres hash_any

 The hash_seq_search time is probably mostly associated with
 AtEOXact_RelationCache, which is run during transaction commit and scans
 the relcache hashtable looking for tables created in the current
 transaction.  So that's about 50% of the runtime going into that one
 activity.

 There are at least three ways we could whack that mole:

 * Run the psql script in --single-transaction mode, as I was mumbling
 about the other day.  If we were doing AtEOXact_RelationCache only once,
 rather than once per CREATE TABLE statement, it wouldn't be a problem.
 Easy but has only a narrow scope of applicability.

 * Keep a separate list (or data structure of your choice) so that
 relcache entries created in the current xact could be found directly
 rather than having to scan the whole relcache.  That'd add complexity
 though, and could perhaps be a net loss for cases where the relcache
 isn't so bloated.

 * Limit the size of the relcache (eg by aging out
 not-recently-referenced entries) so that we aren't incurring O(N^2)
 costs for scripts touching N tables.  Again, this adds complexity and
 could be counterproductive in some scenarios.

Although there may be some workloads that access very large numbers of
tables repeatedly, I bet that's not typical.  Rather, I bet that a
session which accesses 10,000 tables is most likely to access them
just once each - and right now we don't handle that case very well;
this is not the first complaint about big relcaches causing problems.
On the flip side, we don't want workloads that exceed some baked-in
cache size to fall off a cliff.  So I think we should be looking for a
solution that doesn't put a hard limit on the size of the relcache,
but does provide at least some latitude to get rid of old entries.

So maybe something like this.  Add a flag to each relcache entry
indicating whether or not it has been used.  After adding 1024 entries
to the relcache, scan all the entries: clear the flag if it's set,
flush the entry if it's already clear.  This allows the size of the
relcache to grow without bound, but only if we're continuing to access
the old tables in between adding new ones to the mix.  As an
additional safeguard, we could count the number of toplevel SQL
commands that have been executed and require that a flush not be
performed more often than, say, every 64 toplevel SQL commands.  That
way, if a single operation on an inheritance parent with many children
sucks a lot of stuff into the relcache, we'll avoid cleaning it out
too quickly.

Maybe this is all too ad-hoc, but I feel like we don't need to
overengineer this.  The existing system is fine in 99% of the cases,
so we really only need to find a way to detect the really egregious
case where we are doing a neverending series of one-time table
accesses and apply a very light tap to avoid the pain point in that
case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian
On Sat, Nov 10, 2012 at 12:41:44PM -0500, Bruce Momjian wrote:
 On Sat, Nov 10, 2012 at 07:17:34PM +0200, Ants Aasma wrote:
  On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian br...@momjian.us wrote:
   I am confused why you see a loop.  transfer_all_new_dbs() does a
   merge-join of old/new database names, then calls gen_db_file_maps(),
   which loops over the relations and calls create_rel_filename_map(),
   which adds to the map via array indexing.   I don't see any file loops
   in there --- can you be more specific?
  
  Sorry, I was too tired when posting that. I actually meant
  transfer_single_new_db(). More specifically the profile clearly showed
  that most of the time was spent in the two loops starting on lines 193
  and 228.
 
 Wow, you are right on target.  I was so focused on making logical
 lookups linear that I did not consider file system vm/fsm and file
 extension lookups.  Let me think a little and I will report back. 
 Thanks.

OK, I have had some time to think about this.  What the current code
does is, for each database, get a directory listing to know about any
vm, fsm, and 1gig extents that exist in the directory.  It caches the
directory listing and does full array scans looking for matches.  If the
tablespace changes, it creates a new directory cache and throws away the
old one.  This code certainly needs improvement!

I can think of two solutions.  The first would be to scan the database
directory, and any tablespaces used by the database, sort it, then allow
binary search of the directory listing looking for file prefixes that
match the current relation.

The second approach would be to simply try to copy the fsm, vm, and
extent files, and ignore any ENOEXIST errors.  This allows code
simplification.  The downside is that it doesn't pull all files with
matching prefixes --- it requires pg_upgrade to _know_ what suffixes
might exist in that directory.  Second, it assumes there can be no
number gaps in the file extent numbering (is that safe?).

I need recommendations on which direction to persue;  this would only be
for 9.3.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch: Document names of automatically created constraints and indexes

2012-11-12 Thread Karl O. Pinc
On 11/12/2012 10:40:00 AM, Tom Lane wrote:
 Karl O. Pinc k...@meme.com writes:
  Attached is a patch (against head) which documents the
  name of the constraint and index created when using
  PRIMARY KEY with CREATE TABLE, and the name of the
  index created when using UNIQUE. 

 Personally I don't think this should be documented, as it's an
 implementation detail that we've changed in the past and may change
 again.

Ok.  

Could ALTER TABLE use an option to drop the
primary key constraint?  I needed to do that,
found it was not obvious, and this lead me to 
try to improve things.


Karl k...@meme.com
Free Software:  You don't pay back, you pay forward.
 -- Robert A. Heinlein



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestion for --truncate-tables to pg_restore

2012-11-12 Thread Karl O. Pinc
Hi,

Attached is version 4.  Version 3 no longer
built against head.

On 10/16/2012 09:48:06 PM, Karl O. Pinc wrote:
 
 On 09/23/2012 08:52:07 PM, Karl O. Pinc wrote:
 
  On 09/23/2012 12:24:27 AM, Karl O. Pinc wrote:
 
   On 09/23/2012 12:19:07 AM, Karl O. Pinc wrote:
On 09/21/2012 10:54:05 AM, Karl O. Pinc wrote:
 On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote:
 
  I've had problems using pg_restore --data-only when
  restoring individual schemas (which contain data which
  has had bad things done to it).  --clean does not work
  well because of dependent objects in other schemas.

Since there wasn't much more to do I've gone ahead
and written the patch.  Works for me.

Against git master.
Passes regression tests, but there's no regression
tests for pg_restore so this does not say much.
Since there's no regression tests I've not written one.

Since this is a real patch for application I've given
it a new name (it's not a v2).

Truncate done right before COPY, since that's what
the parallel restores do.
 
 
 Karl k...@meme.com
 Free Software:  You don't pay back, you pay forward.
  -- Robert A. Heinlein
 
 

--quoted attachment--
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 




Karl k...@meme.com
Free Software:  You don't pay back, you pay forward.
 -- Robert A. Heinlein

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 1b9db6b..23b0d16 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -546,6 +546,26 @@
  /varlistentry
 
  varlistentry
+  termoption--truncate-tables//term
+  listitem
+   para
+This option is only relevant when performing a data-only
+restore.  It instructs applicationpg_restore/application
+to execute commands to truncate the target tables while the
+data is reloaded.  Use this when restoring tables or schemas
+and option--clean/option cannot be used because dependent
+objects would be destroyed.
+   /para
+
+   para
+ The option--disable-triggers/option will almost always
+ always need to be used in conjunction with this option to
+ disable check constraints on foreign keys.
+   /para
+  /listitem
+ /varlistentry
+
+ varlistentry
   termoption--use-set-session-authorization/option/term
   listitem
para
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 3b49395..0aaf1d3 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -101,6 +101,8 @@ typedef struct _restoreOptions
 	int			noTablespace;	/* Don't issue tablespace-related commands */
 	int			disable_triggers;		/* disable triggers during data-only
 		 * restore */
+	int			truncate_tables;		/* truncate tables during data-only
+		 * restore */
 	int			use_setsessauth;/* Use SET SESSION AUTHORIZATION commands
  * instead of OWNER TO */
 	int			no_security_labels;		/* Skip security label entries */
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1fead28..c7fdc79 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -309,6 +309,11 @@ RestoreArchive(Archive *AHX)
 	if (ropt-createDB  ropt-single_txn)
 		exit_horribly(modulename, -C and -1 are incompatible options\n);
 
+	/* When the schema is dropped and re-created then no point
+	 * truncating tables. */
+	if (ropt-dropSchema  ropt-truncate_tables)
+		exit_horribly(modulename, -c and --truncate-tables are incompatible options\n);
+
 	/*
 	 * If we're going to do parallel restore, there are some restrictions.
 	 */
@@ -403,6 +408,10 @@ RestoreArchive(Archive *AHX)
 		}
 	}
 
+	/* Truncate tables only when restoring data. */
+	if (!ropt-dataOnly  ropt-truncate_tables)
+		exit_horribly(modulename, --truncate-tables requires the --data-only option\n);
+
 	/*
 	 * Setup the output file if necessary.
 	 */
@@ -562,6 +571,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 	int			retval = 0;
 	teReqs		reqs;
 	bool		defnDumped;
+	bool		truncate;
 
 	AH-currentTE = te;
 
@@ -696,14 +706,21 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 		 * server, so no need to see if we should issue BEGIN.
 		 */
 		StartTransaction(AH);
+		truncate = 1;
+	} else
+		/* Truncate the table when asked to. */
+		truncate = ropt-truncate_tables;
 
+	if (truncate) {
 		/*
 		 * If the server version is = 8.4, make sure we issue
 		 * TRUNCATE with ONLY so that child tables are not
-		 * wiped.
+		 * wiped.  If we don't know the server version
+		 * then err on the side of safety.
 		 */
 		ahprintf(AH, TRUNCATE TABLE %s%s;\n\n,
- 

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Jeff Davis
On Sun, 2012-11-11 at 23:55 -0500, Greg Smith wrote:
 Adding an initdb option to start out with everything checksummed seems 
 an uncontroversial good first thing to have available.

OK, so here's my proposal for a first patch (changes from Simon's
patch):

  * Add a flag to the postgres executable indicating that it should use
checksums on everything. This would only be valid if bootstrap mode is
also specified.
  * Add a multi-state checksums flag in pg_control, that would have
three states: OFF, ENABLING, and ON. It would only be set to ON during
bootstrap, and in this first patch, it would not be possible to set
ENABLING.
  * Remove GUC and use this checksums flag everywhere.
  * Use the TLI field rather than the version field of the page header.
  * Incorporate page number into checksum calculation (already done).
  
Does this satisfy the requirements for a first step? Does it interfere
with potential future work?

 Won't a pg_checksums program just grow until it looks like a limited 
 version of vacuum though?

We can dig into the details of that later, but I don't think it's
useless, even if we do have per-table (or better) checksums. For
instance, it would be useful to verify backups offline.

I think it's a legitimate concern that we might reinvent some VACUUM
machinery. Ideally, we'd get better online migration tools for checksums
(perhaps using VACUUM) fast enough that nobody will bother introducing
that kind of bloat into pg_checksums.

 I think it's useful to step back for a minute and consider the larger 
 uncertainty an existing relation has, which amplifies just how ugly this 
 situation is.  The best guarantee I think online checksumming can offer 
 is to tell the user after transaction id X, all new data in relation R 
 is known to be checksummed.

It's slightly better than that. It's more like: we can tell you if any
of your data gets corrupted after transaction X. If old data is
corrupted before transaction X, then there's nothing we can do. But if
it's corrupted after transaction X (even if it's old data), the
checksums should catch it.

 Unless you do this at initdb time, any 
 conversion case is going to have the possibility that a page is 
 corrupted before you get to it--whether you're adding the checksum as 
 part of a let's add them while we're writing anyway page update or the 
 conversion tool is hitting it.

Good point.

 That's why I don't think anyone will find online conversion really 
 useful until they've done a full sweep updating the old pages.

I don't entirely agree. A lot of times, you just want to know whether
your disk is changing your data out from under you. Maybe you miss some
cases and maybe not all of your data is protected, but just knowing
which disks need to be replaced, and which RAID controllers not to buy
again, is quite valuable. And the more data you get checksummed the
faster you'll find out.

 One of the really common cases I was expecting here is that conversions 
 are done by kicking off a slow background VACUUM CHECKSUM job that might 
 run in pieces.

Right now I'm focused on the initial patch and other fairly immediate
goals, so I won't address this now. But I don't want to cut off the
conversation, either.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Alvaro Herrera
Noah Misch wrote:
 On Sun, Nov 11, 2012 at 02:18:11AM -0300, Alvaro Herrera wrote:

  I will
  move all the open patches to CF3, unless someone beats me to it.  I
  probably won't be able to get anything done tomorrow, so if somebody has
  a boring Sunday I would appreciate the help.
 
 Likewise.

Many thanks.  I have closed the CF (just 3 days before the next one
starts, which is somewhat depressing).

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade speed for many tables

2012-11-12 Thread Jeff Janes
On Mon, Nov 5, 2012 at 12:08 PM, Bruce Momjian br...@momjian.us wrote:
 Magnus reported that a customer with a million tables was finding
 pg_upgrade slow.  I had never considered many table to be a problem, but
 decided to test it.  I created a database with 2k tables like this:

 CREATE TABLE test1990 (x SERIAL);

 Running the git version of pg_upgrade on that took 203 seconds.  Using
 synchronous_commit=off dropped the time to 78 seconds.  This was tested
 on magnetic disks with a write-through cache.  (No change on an SSD with
 a super-capacitor.)

 I don't see anything unsafe about having pg_upgrade use
 synchronous_commit=off.  I could set it just for the pg_dump reload, but
 it seems safe to just use it always.  We don't write to the old cluster,
 and if pg_upgrade fails, you have to re-initdb the new cluster anyway.

 Patch attached.  I think it should be applied to 9.2 as well.

Is turning off synchronous_commit enough?  What about turning off fsync?

When I'm doing a pg_upgrade with thousands of tables, the shutdown
checkpoint after restoring the dump to the new cluster takes a very
long time, as the writer drains its operation table by opening and
individually fsync-ing thousands of files.  This takes about 40 ms per
file, which I assume is a combination of slow lap-top disk drive, and
a strange deal with ext4 which makes fsyncing a recently created file
very slow.   But even with faster hdd, this would still be a problem
if it works the same way, with every file needing 4 rotations to be
fsynced and this happens in serial.

Worse, the shutdown only waits for the default of 60 seconds for the
shutdown to take place before it throws an error and the entire
pg_upgrade gives up.  It seems to me that either the -t setting should
be increased, or should be an option to pg_upgrade.

My work around was to invoke a system-wide sync a couple seconds
after the 'pg_ctl stop' is initiated.  Flushing the files wholesale
seems to work to make the checkpoint writer rapidly find it has
nothing to do when it tries to flush them retail.

Anyway, the reason I think turning fsync off might be reasonable is
that as soon as the new cluster is shut down, pg_upgrade starts
overwriting most of those just-fsynced file with other files from the
old cluster, and AFAICT makes no effort to fsync them.  So until there
is a system-wide sync after the pg_upgrade finishes, your new cluster
is already in mortal danger anyway.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg_upgrade speed for many tables

2012-11-12 Thread Jeff Davis
On Mon, 2012-11-12 at 10:29 -0800, Jeff Janes wrote:
 When I'm doing a pg_upgrade with thousands of tables, the shutdown
 checkpoint after restoring the dump to the new cluster takes a very
 long time, as the writer drains its operation table by opening and
 individually fsync-ing thousands of files.

This reminds me of the fix I did for initdb to sync the files. I think
we do need to make sure they are sync'd, because ext4 can keep buffers
around for quite a long time without cleaning them.

I ended up using sync_file_range(..., SYNC_FILE_RANGE_WRITE) on linux,
and posix_fadvise(..., POSIX_FADV_DONTNEED) on everything else, and that
made subsequent fsyncs more efficient.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Simon Riggs
On 10 September 2012 17:50, Tom Lane t...@sss.pgh.pa.us wrote:

 The point of the proposal that I am making is to have a simple,
 low-maintenance solution for people who need a single-application
 database.  A compromise somewhere in the middle isn't likely to be an
 improvement for anybody.  For instance, if you want to have additional
 connections, you open up a whole collection of communication and
 authentication issues, which potential users of a single-application
 database don't want to cope with.

So the proposal is to implement a database that can't ever have 2 or
more connections. And so any data it stores cannot ever be accessed by
another connection, so all forms of replication are excluded and all
maintenance actions force the database to be unavailable for a period
of time. Those two things are barriers of the most major kind to
anybody working in an enterprise with connected data and devices. The
only people that want this are people with a very short term view of
the purpose of their applications, and disregard for the value and
permanence of the data stored. They may not want to cope with those
issues *now* but they will later and won't thank us for implementing
it in a way that means it can never be achieved.

To be honest, I can't believe I'm reading this.

And worse, it's on our Don't Want list, and nobody has said stop.

It's almost impossible to purchase a CPU these days that doesn't have
multiple cores, so the whole single-process architecture is just dead.
Yes, we want Postgres installed everywhere, but this isn't the way to
achieve that.

I agree we should allow a PostgreSQL installation to work for a single
user, but I don't see that requires other changes. This idea will
cause endless bugs, thinkos and severely waste our time. So without a
much better justification, I don't think we should do this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Simon Riggs
On 12 November 2012 16:22, Robert Haas robertmh...@gmail.com wrote:

 But I guess that raises the question - should COPY (FREEZE) silently
 ignore the option for not-new relfilenodes, or should it error out?
 Simon proposed the former, but I'm wondering if the latter would be
 better.

It's got some complex pre-conditions, so having scripts fail because
you mis-specified FREEZE would be annoying.

The option indicates I accept the potential MVCC violation, not it
will always freeze.

If there is a better name...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Simon Riggs
On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote:

 Practically all WAL record types that touch multiple pages have some
 bug of this type.  In addition to btree_xlog_split, I found that
 heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
 spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
 locks as required to make their updates safe for concurrent queries.
 (I'm not totally sure about ginRedoDeletePage, but the original action
 definitely locks the pages simultaneously, and it's not clear that it's
 safe not to.)  Most of these are okay in cases without any full-page
 images, but could fail if the wrong subset of the pages-to-be-touched
 were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Hmm, not good. Thanks for spotting.

Do these changes do anything to actions that occur across multiple
records? I assume not and think those are OK, agreed?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Simon Riggs
On 12 November 2012 14:51, Tom Lane t...@sss.pgh.pa.us wrote:

 The 9.0 code is broken, however. In
 9.0, when a child page is split, the parent and new children are kept
 locked until the downlinks are inserted/updated. If a concurrent scan
 comes along and sees that incomplete state, it will miss tuples on the
 new right siblings. We rely on a rm_cleanup operation at the end of WAL
 replay to fix that situation, if the downlink insertion record is not
 there. I don't see any easy way to fix that, unfortunately. Perhaps we
 could backpatch the 9.1 rewrite, now that it's gotten some real-world
 testing, but it was a big change so I don't feel very comfortable doing
 that.

 Me either.  Given the lack of field complaints, I think we're better
 advised to just leave it unfixed in 9.0.  It'd not be a step forward
 if we broke something trying to make this work.

Agreed. Most people running 9.0 with GIST indexes have already upgraded.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner
Jeff,

On 11/12/2012 06:52 PM, Jeff Davis wrote:
 OK, so here's my proposal for a first patch (changes from Simon's
 patch):
 
   * Add a flag to the postgres executable indicating that it should use
 checksums on everything. This would only be valid if bootstrap mode is
 also specified.
   * Add a multi-state checksums flag in pg_control, that would have
 three states: OFF, ENABLING, and ON. It would only be set to ON during
 bootstrap, and in this first patch, it would not be possible to set
 ENABLING.
   * Remove GUC and use this checksums flag everywhere.
   * Use the TLI field rather than the version field of the page header.
   * Incorporate page number into checksum calculation (already done).
   
 Does this satisfy the requirements for a first step? Does it interfere
 with potential future work?

As described before in this thread, I think we might be able to do
without the has checksum-bit, as yet another simplification. But I
don't object to adding it, either.

 It's slightly better than that. It's more like: we can tell you if any
 of your data gets corrupted after transaction X. If old data is
 corrupted before transaction X, then there's nothing we can do. But if
 it's corrupted after transaction X (even if it's old data), the
 checksums should catch it.

I (mis?)read that as Greg referring to the intermediate (enabling)
state, where pages with old data may or may not have a checksum, yet. So
I think it was an argument against staying in that state any longer than
necessary.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 11 November 2012 23:24, Tom Lane t...@sss.pgh.pa.us wrote:
 Practically all WAL record types that touch multiple pages have some
 bug of this type.  In addition to btree_xlog_split, I found that
 heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
 spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
 locks as required to make their updates safe for concurrent queries.
 (I'm not totally sure about ginRedoDeletePage, but the original action
 definitely locks the pages simultaneously, and it's not clear that it's
 safe not to.)  Most of these are okay in cases without any full-page
 images, but could fail if the wrong subset of the pages-to-be-touched
 were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

 Hmm, not good. Thanks for spotting.

 Do these changes do anything to actions that occur across multiple
 records? I assume not and think those are OK, agreed?

Right, we were and still are assuming that any state that exists between
WAL records is consistent and safe to expose to hot-standby queries.
The important thing here is that these WAL replay functions were failing
to ensure that their own actions appear atomic to onlookers.  This is
basically hangover from pre-hot-standby coding conventions, when no such
concern existed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane
Here's an updated patch that fixes the GIST replay functions as well as
the other minor issues that were mentioned.  Barring objections, I'll
set about back-patching this as far as 9.0.

One thing that could use verification is my fix for
gistRedoPageSplitRecord.  AFAICS, the first page listed in the WAL
record is always the original page, and the ones following it are
pages that were split off from it, and can (as yet) only be reached by
following right-links from the original page.  As such, it should be
okay to release locks on the non-first pages as soon as we've written
them.  We have to hold lock on the original page though to avoid letting
readers follow dangling right-links.  Also, the update of
NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
with all this, so that has to be done before releasing the original-page
lock as well.  Does that sound right?

regards, tom lane



restore-backup-blocks-individually-3.patch.gz
Description: restore-backup-blocks-individually-3.patch.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian
On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:
 OK, I have had some time to think about this.  What the current code
 does is, for each database, get a directory listing to know about any
 vm, fsm, and 1gig extents that exist in the directory.  It caches the
 directory listing and does full array scans looking for matches.  If the
 tablespace changes, it creates a new directory cache and throws away the
 old one.  This code certainly needs improvement!
 
 I can think of two solutions.  The first would be to scan the database
 directory, and any tablespaces used by the database, sort it, then allow
 binary search of the directory listing looking for file prefixes that
 match the current relation.
 
 The second approach would be to simply try to copy the fsm, vm, and
 extent files, and ignore any ENOEXIST errors.  This allows code
 simplification.  The downside is that it doesn't pull all files with
 matching prefixes --- it requires pg_upgrade to _know_ what suffixes
 might exist in that directory.  Second, it assumes there can be no
 number gaps in the file extent numbering (is that safe?).
 
 I need recommendations on which direction to persue;  this would only be
 for 9.3.

I went with the second idea, patch attached.  Here are the times:

 --  9.2 --   9.3 
 -- normal -- -- bin-up --   -- normal -- -- bin-up --  pg_upgrade
 dump   rest   dump   rest   dump   rest   dump   rest   git   patch
  1  0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11  11.02
   1000  7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60  19.25
   2000  5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55  26.67
   4000 13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70  52.31
   8000 29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05 102.78
  16000 87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71 286.00

You can see a significant speedup with those loops removed.  The 16k
case is improved, but still not linear.  The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c
new file mode 100644
index 33a867f..0a49a23
*** a/contrib/pg_upgrade/relfilenode.c
--- b/contrib/pg_upgrade/relfilenode.c
***
*** 17,25 
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  	   FileNameMap *maps, int size);
! static void transfer_relfile(pageCnvCtx *pageConverter,
!  const char *fromfile, const char *tofile,
!  const char *nspname, const char *relname);
  
  
  /*
--- 17,24 
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  	   FileNameMap *maps, int size);
! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 			 const char *suffix);
  
  
  /*
*** static void
*** 131,185 
  transfer_single_new_db(pageCnvCtx *pageConverter,
  	   FileNameMap *maps, int size)
  {
- 	char		old_dir[MAXPGPATH];
- 	char		file_pattern[MAXPGPATH];
- 	char		**namelist = NULL;
- 	int			numFiles = 0;
  	int			mapnum;
! 	int			fileno;
! 	bool		vm_crashsafe_change = false;
! 
! 	old_dir[0] = '\0';
! 
  	/* Do not copy non-crashsafe vm files for binaries that assume crashsafety */
  	if (old_cluster.controldata.cat_ver  VISIBILITY_MAP_CRASHSAFE_CAT_VER 
  		new_cluster.controldata.cat_ver = VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_change = true;
  
  	for (mapnum = 0; mapnum  size; mapnum++)
  	{
! 		char		old_file[MAXPGPATH];
! 		char		new_file[MAXPGPATH];
! 
! 		/* Changed tablespaces?  Need a new directory scan? */
! 		if (strcmp(maps[mapnum].old_dir, old_dir) != 0)
! 		{
! 			if (numFiles  0)
! 			{
! for (fileno = 0; fileno  numFiles; fileno++)
! 	pg_free(namelist[fileno]);
! pg_free(namelist);
! 			}
! 
! 			snprintf(old_dir, sizeof(old_dir), %s, maps[mapnum].old_dir);
! 			numFiles = load_directory(old_dir, namelist);
! 		}
! 
! 		/* Copying files might take some time, so give feedback. */
! 
! 		snprintf(old_file, sizeof(old_file), %s/%u, maps[mapnum].old_dir,
!  maps[mapnum].old_relfilenode);
! 		snprintf(new_file, sizeof(new_file), %s/%u, maps[mapnum].new_dir,
!  maps[mapnum].new_relfilenode);
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
! 
! 		/*
! 		 * Copy/link the relation's primary file (segment 0 of main fork)
! 		 * to the new cluster
! 		 */
! 		unlink(new_file);
! 		transfer_relfile(pageConverter, old_file, new_file,
! 		 maps[mapnum].nspname, maps[mapnum].relname);
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) = 804)
--- 130,148 
  transfer_single_new_db(pageCnvCtx 

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian
On Mon, Nov 12, 2012 at 03:59:27PM -0500, Bruce Momjian wrote:
  The second approach would be to simply try to copy the fsm, vm, and
  extent files, and ignore any ENOEXIST errors.  This allows code
  simplification.  The downside is that it doesn't pull all files with
  matching prefixes --- it requires pg_upgrade to _know_ what suffixes
  might exist in that directory.  Second, it assumes there can be no
  number gaps in the file extent numbering (is that safe?).
  
  I need recommendations on which direction to persue;  this would only be
  for 9.3.
 
 I went with the second idea, patch attached.  Here are the times:
 
  --  9.2 --   9.3 
  -- normal -- -- bin-up --   -- normal -- -- bin-up --  pg_upgrade
  dump   rest   dump   rest   dump   rest   dump   rest   git   
 patch
   1  0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11  
 11.02
1000  7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60  
 19.25
2000  5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55  
 26.67
4000 13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70  
 52.31
8000 29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05 
 102.78
   16000 87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71 
 286.00
 
 You can see a significant speedup with those loops removed.  The 16k
 case is improved, but still not linear.  The 16k dump/restore scale
 looks fine, so it must be something in pg_upgrade, or in the kernel.

It is possible that the poor 16k pg_upgrade value is caused by the poor
9.2 binary-upgrade number (189.38).  Perhaps I need to hack up
pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Alvaro Herrera
Bruce Momjian escribió:

 --- 17,24 
   
   static void transfer_single_new_db(pageCnvCtx *pageConverter,
  FileNameMap *maps, int size);
 ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
 !  const char *suffix);

Uh, does this code assume that forks other than the main one are not
split in segments?  I think that's a bug, is it not?

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Alvaro Herrera
Bruce Momjian escribió:

 It is possible that the poor 16k pg_upgrade value is caused by the poor
 9.2 binary-upgrade number (189.38).  Perhaps I need to hack up
 pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

Hmm?  This already works, since make check uses it, right?

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Merlin Moncure
On Mon, Nov 12, 2012 at 1:21 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 10 September 2012 17:50, Tom Lane t...@sss.pgh.pa.us wrote:

 The point of the proposal that I am making is to have a simple,
 low-maintenance solution for people who need a single-application
 database.  A compromise somewhere in the middle isn't likely to be an
 improvement for anybody.  For instance, if you want to have additional
 connections, you open up a whole collection of communication and
 authentication issues, which potential users of a single-application
 database don't want to cope with.

 So the proposal is to implement a database that can't ever have 2 or
 more connections. And so any data it stores cannot ever be accessed by
 another connection, so all forms of replication are excluded and all
 maintenance actions force the database to be unavailable for a period
 of time. Those two things are barriers of the most major kind to
 anybody working in an enterprise with connected data and devices. The
 only people that want this are people with a very short term view of
 the purpose of their applications, and disregard for the value and
 permanence of the data stored. They may not want to cope with those
 issues *now* but they will later and won't thank us for implementing
 it in a way that means it can never be achieved.

 To be honest, I can't believe I'm reading this.

 And worse, it's on our Don't Want list, and nobody has said stop.

 It's almost impossible to purchase a CPU these days that doesn't have
 multiple cores, so the whole single-process architecture is just dead.
 Yes, we want Postgres installed everywhere, but this isn't the way to
 achieve that.

 I agree we should allow a PostgreSQL installation to work for a single
 user, but I don't see that requires other changes. This idea will
 cause endless bugs, thinkos and severely waste our time. So without a
 much better justification, I don't think we should do this.

I couldn't disagree more.  The patch is small, logical, and fixes an
awful problem, namely that --single mode is basically unusable.  As to
your wider point (namely, that you can't connect to it, therefore it's
bad), it has got to be refuted by numerous competing solutions in the
market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html,
and many others.

While it's not as common as it used to be, now and then a piece of
software needs to distribute an application as part of a boxed
product.  Postgres is horrible at this and doesn't have to be; imagine
how much easier the lives of poker tracker would be (for *some* of its
users) with an integrated standalone mode: google 'poker tracker
postgresql' and take a good long look at problems people face in this
scenario.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian
On Mon, Nov 12, 2012 at 06:14:59PM -0300, Alvaro Herrera wrote:
 Bruce Momjian escribió:
 
  --- 17,24 

static void transfer_single_new_db(pageCnvCtx *pageConverter,
 FileNameMap *maps, int size);
  ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
  !const char *suffix);
 
 Uh, does this code assume that forks other than the main one are not
 split in segments?  I think that's a bug, is it not?

Oh, yeah, I must have fixed this long ago.  It only fails if you use
tablespaces:

if (os_info.num_tablespaces  0 
strcmp(old_cluster.tablespace_suffix, new_cluster.tablespace_suffix) == 0)
pg_log(PG_FATAL,
   Cannot upgrade to/from the same system catalog version when\n
   using tablespaces.\n);

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian
On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:
 The second approach would be to simply try to copy the fsm, vm, and
 extent files, and ignore any ENOEXIST errors.  This allows code
 simplification.  The downside is that it doesn't pull all files with
 matching prefixes --- it requires pg_upgrade to _know_ what suffixes
 might exist in that directory.  Second, it assumes there can be no
 number gaps in the file extent numbering (is that safe?).

Seems our code does the same kind of segment number looping I was
suggesting for pg_upgrade, so I think I am safe:

/*
 * Note that because we loop until getting ENOENT, we will correctly
 * remove all inactive segments as well as active ones.
 */
for (segno = 1;; segno++)
{
sprintf(segpath, %s.%u, path, segno);
if (unlink(segpath)  0)
{
/* ENOENT is expected after the last segment... */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
   errmsg(could not remove file \%s\: %m, segpath)));
break;
}
}

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Simon Riggs
On 12 November 2012 21:26, Merlin Moncure mmonc...@gmail.com wrote:

 I couldn't disagree more.  The patch is small, logical, and fixes an
 awful problem, namely that --single mode is basically unusable.  As to
 your wider point (namely, that you can't connect to it, therefore it's
 bad), it has got to be refuted by numerous competing solutions in the
 market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html,
 and many others.

Small is not an argument in favour, just a statement of ease, like
jumping off a cliff. i.e. lemmings.

 While it's not as common as it used to be, now and then a piece of
 software needs to distribute an application as part of a boxed
 product.  Postgres is horrible at this and doesn't have to be; imagine
 how much easier the lives of poker tracker would be (for *some* of its
 users) with an integrated standalone mode: google 'poker tracker
 postgresql' and take a good long look at problems people face in this
 scenario.

I get the installability thang, very very much, I just don't see the
single process thing as the only solution. At very least an open
minded analysis of the actual problem and ways of solving it is called
for, not just reach for a close to hand solution.

I don't ever want to hear someone reject a patch cos it would mess up
poker tracker. The point is it complicates the code, introduces
restrictions into what is possible and is just more inertia onto
development.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Jeff Janes
On Fri, Nov 9, 2012 at 10:50 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 Are sure the server you are dumping out of is head?

 I experimented a bit with dumping/restoring 16000 tables matching
 Bruce's test case (ie, one serial column apiece).  The pg_dump profile
 seems fairly flat, without any easy optimization targets.  But
 restoring the dump script shows a rather interesting backend profile:

 samples  %image name   symbol name
 3086139.6289  postgres AtEOXact_RelationCache
 9911 12.7268  postgres hash_seq_search
 ...

 There are at least three ways we could whack that mole:

 * Run the psql script in --single-transaction mode, as I was mumbling
 about the other day.  If we were doing AtEOXact_RelationCache only once,
 rather than once per CREATE TABLE statement, it wouldn't be a problem.
 Easy but has only a narrow scope of applicability.

 That is effective when loading into 9.3 (assuming you make
 max_locks_per_transaction large enough).  But when loading into  9.3,
 using --single-transaction will evoke the quadratic behavior in the
 resource owner/lock table and make things worse rather than better.

Using --single-transaction gets around the AtEOXact_RelationCache
quadratic, but it activates another quadratic behavior, this one in
get_tabstat_entry.   That is a good trade-off because that one has a
lower constant, but it is still going to bite.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] vacuumlo - use a cursor

2012-11-12 Thread Andrew Dunstan
vacuumlo is rather simpleminded about dealing with the list of LOs to be 
removed - it just fetches them as a straight resultset. For one of my 
our this resulted in an out of memory condition. The attached patch 
tries to remedy that by using a cursor instead. If this is wanted I will 
add it to the next commitfest. The actualy changes are very small - most 
of the patch is indentation changes due to the introduction of an extra 
loop.


cheers

andrew
*** a/contrib/vacuumlo/vacuumlo.c
--- b/contrib/vacuumlo/vacuumlo.c
***
*** 290,362  vacuumlo(const char *database, const struct _param * param)
  	PQclear(res);
  
  	buf[0] = '\0';
! 	strcat(buf, SELECT lo FROM vacuum_l);
! 	res = PQexec(conn, buf);
! 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
! 	{
! 		fprintf(stderr, Failed to read temp table:\n);
! 		fprintf(stderr, %s, PQerrorMessage(conn));
! 		PQclear(res);
  		PQfinish(conn);
  		return -1;
! 	}
  
- 	matched = PQntuples(res);
  	deleted = 0;
! 	for (i = 0; i  matched; i++)
  	{
! 		Oid			lo = atooid(PQgetvalue(res, i, 0));
  
! 		if (param-verbose)
  		{
! 			fprintf(stdout, \rRemoving lo %6u   , lo);
! 			fflush(stdout);
  		}
  
! 		if (param-dry_run == 0)
  		{
! 			if (lo_unlink(conn, lo)  0)
  			{
! fprintf(stderr, \nFailed to remove lo %u: , lo);
! fprintf(stderr, %s, PQerrorMessage(conn));
! if (PQtransactionStatus(conn) == PQTRANS_INERROR)
  {
! 	success = false;
! 	break;
  }
  			}
  			else
  deleted++;
! 		}
! 		else
! 			deleted++;
! 		if (param-transaction_limit  0 
! 			(deleted % param-transaction_limit) == 0)
! 		{
! 			res2 = PQexec(conn, commit);
! 			if (PQresultStatus(res2) != PGRES_COMMAND_OK)
  			{
! fprintf(stderr, Failed to commit transaction:\n);
! fprintf(stderr, %s, PQerrorMessage(conn));
  PQclear(res2);
! PQclear(res);
! PQfinish(conn);
! return -1;
! 			}
! 			PQclear(res2);
! 			res2 = PQexec(conn, begin);
! 			if (PQresultStatus(res2) != PGRES_COMMAND_OK)
! 			{
! fprintf(stderr, Failed to start transaction:\n);
! fprintf(stderr, %s, PQerrorMessage(conn));
  PQclear(res2);
- PQclear(res);
- PQfinish(conn);
- return -1;
  			}
- 			PQclear(res2);
  		}
  	}
  	PQclear(res);
  
  	/*
--- 290,389 
  	PQclear(res);
  
  	buf[0] = '\0';
! 	strcat(buf, 
! 		   DECLARE myportal CURSOR WITH HOLD FOR SELECT lo FROM vacuum_l);
! res = PQexec(conn, buf);
! if (PQresultStatus(res) != PGRES_COMMAND_OK)
! {
! fprintf(stderr, DECLARE CURSOR failed: %s, PQerrorMessage(conn));
! PQclear(res);
  		PQfinish(conn);
  		return -1;
! }
! PQclear(res);
! 
! 	snprintf(buf, BUFSIZE, FETCH FORWARD  INT64_FORMAT  IN myportal, 
! 			 param-transaction_limit  0 ? param-transaction_limit : 1000);
  
  	deleted = 0;
! 
! 	while (1)
  	{
! 		res = PQexec(conn, buf);
! 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
! 		{
! 			fprintf(stderr, Failed to read temp table:\n);
! 			fprintf(stderr, %s, PQerrorMessage(conn));
! 			PQclear(res);
! 			PQfinish(conn);
! 			return -1;
! 		}
  
! 		matched = PQntuples(res);
! 
! 		if (matched = 0)
  		{
! 			/* at end of resultset */
! 			break;
  		}
  
! 		for (i = 0; i  matched; i++)
  		{
! 			Oid			lo = atooid(PQgetvalue(res, i, 0));
! 			
! 			if (param-verbose)
! 			{
! fprintf(stdout, \rRemoving lo %6u   , lo);
! fflush(stdout);
! 			}
! 			
! 			if (param-dry_run == 0)
  			{
! if (lo_unlink(conn, lo)  0)
  {
! 	fprintf(stderr, \nFailed to remove lo %u: , lo);
! 	fprintf(stderr, %s, PQerrorMessage(conn));
! 	if (PQtransactionStatus(conn) == PQTRANS_INERROR)
! 	{
! 		success = false;
! 		break;
! 	}
  }
+ else
+ 	deleted++;
  			}
  			else
  deleted++;
! 
! 			if (param-transaction_limit  0 
! (deleted % param-transaction_limit) == 0)
  			{
! res2 = PQexec(conn, commit);
! if (PQresultStatus(res2) != PGRES_COMMAND_OK)
! {
! 	fprintf(stderr, Failed to commit transaction:\n);
! 	fprintf(stderr, %s, PQerrorMessage(conn));
! 	PQclear(res2);
! 	PQclear(res);
! 	PQfinish(conn);
! 	return -1;
! }
  PQclear(res2);
! res2 = PQexec(conn, begin);
! if (PQresultStatus(res2) != PGRES_COMMAND_OK)
! {
! 	fprintf(stderr, Failed to start transaction:\n);
! 	fprintf(stderr, %s, PQerrorMessage(conn));
! 	PQclear(res2);
! 	PQclear(res);
! 	PQfinish(conn);
! 	return -1;
! }
  PQclear(res2);
  			}
  		}
  	}
+ 
  	PQclear(res);
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Simon Riggs
On 12 November 2012 16:51, Robert Haas robertmh...@gmail.com wrote:

 Although there may be some workloads that access very large numbers of
 tables repeatedly, I bet that's not typical.

Transactions with large numbers of DDL statements are typical at
upgrade (application or database release level) and the execution time
of those is critical to availability.

I'm guessing you mean large numbers of tables and accessing each one
multiple times?

 Rather, I bet that a
 session which accesses 10,000 tables is most likely to access them
 just once each - and right now we don't handle that case very well;
 this is not the first complaint about big relcaches causing problems.

pg_restore frequently accesses tables more than once as it runs, but
not more than a dozen times each, counting all types of DDL.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Robert Haas
On Mon, Nov 12, 2012 at 5:17 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 12 November 2012 16:51, Robert Haas robertmh...@gmail.com wrote:

 Although there may be some workloads that access very large numbers of
 tables repeatedly, I bet that's not typical.

 Transactions with large numbers of DDL statements are typical at
 upgrade (application or database release level) and the execution time
 of those is critical to availability.

 I'm guessing you mean large numbers of tables and accessing each one
 multiple times?

Yes, that is what I meant.

 Rather, I bet that a
 session which accesses 10,000 tables is most likely to access them
 just once each - and right now we don't handle that case very well;
 this is not the first complaint about big relcaches causing problems.

 pg_restore frequently accesses tables more than once as it runs, but
 not more than a dozen times each, counting all types of DDL.

Hmm... yeah.  Some of those accesses are probably one right after
another so any cache-flushing behavior would be fine; but index
creations for example might happen quite a bit later in the file,
IIRC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Greg Smith

On 11/12/12 4:44 AM, Craig Ringer wrote:


Is it absurd to suggest using another bitmap, like the FSM or visibility
map, to store information on page checksumming while checksumming is
enabled but incomplete?


I spent some time thinking about that last week.  One problem with it is 
that the bitmap structure itself has the same issues as every other 
write here--how do we know it's going to disk accurately?  The put 
'checksum on' bits on the page idea and put checksum on bits in a map 
have the same fundamental issue.  Things might get out of sync in the 
same way, you've just moved the potentially suspicious write to a new place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Jeff Davis
On Mon, 2012-11-12 at 20:44 +0100, Markus Wanner wrote:
 As described before in this thread, I think we might be able to do
 without the has checksum-bit, as yet another simplification. But I
 don't object to adding it, either.

I see. For a first patch, I guess that's OK. Might as well make it as
simple as possible. We probably need to decide what to do there before
9.3 is released though.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Greg Smith

On 11/12/12 3:44 AM, Markus Wanner wrote:

Sorry if that has been discussed before, but can't we do without that
bit at all? It adds a checksum switch to each page, where we just agreed
we don't event want a per-database switch.


Once you accept that eventually there need to be online conversion 
tools, there needs to be some easy way to distinguish which pages have 
been processed for several potential implementations.  The options seem 
to be adding some bits just for that or bumping the page format.  I 
would like to just bump the format, but that has a pile of its own 
issues to cross.  Rather not make that a requirement for this month's 
requirements.



Can we simply write a progress indicator to pg_control or someplace
saying that all pages up to X of relation Y are supposed to have valid
checksums?


All of the table-based checksum enabling ideas seem destined to add 
metadata to pg_class or something related to it for this purpose.  While 
I think everyone agrees that this is a secondary priority to getting 
basic cluster-level checksums going right now, I'd like to have at least 
a prototype for that before 9.3 development ends.  All of the



I realize this doesn't support Jesper's use case of wanting to have the
checksums only for newly dirtied pages. However, I'd argue that
prolonging the migration to spread the load would allow even big shops
to go through this without much of an impact on performance.


I'm thinking of this in some ways like the way creation of a new (but 
not yet valid) foreign key works.  Once that's active, new activity is 
immediately protected moving forward.  And eventually there's this 
cleanup step needed, one that you can inch forward over a few days.


The main upper limit on load spreading here is that the conversion 
program may need to grab a snapshot.  In that case the conversion taking 
too long will be a problem, as it blocks other vacuum activity past that 
point.   This is why I think any good solution to this problem needs to 
incorporate restartable conversion.  We were just getting complaints 
recently about how losing a CREATE INDEX CONCURRENTLY session can cause 
the whole process to end and need to be started over.  The way 
autovacuum runs right now it can be stopped and restarted later, with 
only a small loss of duplicated work in many common cases.  If it's 
possible to maintain that property for the checksum conversion, that 
would be very helpful to larger sites.  It doesn't matter if adding 
checksums to the old data takes a week if you throttle the load down, so 
long as you're not forced to hold an open snapshot the whole time.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Josh Berkus
Jeff,

 OK, so here's my proposal for a first patch (changes from Simon's
 patch):
 
   * Add a flag to the postgres executable indicating that it should use
 checksums on everything. This would only be valid if bootstrap mode is
 also specified.
   * Add a multi-state checksums flag in pg_control, that would have
 three states: OFF, ENABLING, and ON. It would only be set to ON during
 bootstrap, and in this first patch, it would not be possible to set
 ENABLING.
   * Remove GUC and use this checksums flag everywhere.
   * Use the TLI field rather than the version field of the page header.
   * Incorporate page number into checksum calculation (already done).
   
 Does this satisfy the requirements for a first step? Does it interfere
 with potential future work?

So the idea of this implementation is that checksums is something you
set at initdb time, and if you want checksums on an existing database,
it's a migration process (e.g. dump and reload)?

I think that's valid as a first cut at this.

We'll need interruptable VACUUM CHECKSUM later, but we don't have to
have it for the first version of the feature.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Index only scans wiki page

2012-11-12 Thread Jeff Janes
https://wiki.postgresql.org/wiki/Index-only_scans

This page is seriously out of date.  It suggests they are not yet
implemented, but only being talked about.

Would someone who knows a lot about the subject (which is why I'm
sending this hackers, not web)  like to take a whack at updating this?
 Otherwise, I'd like to just replace the whole page with a discussion
of the limitations and trade-offs involved in using them as they
currently exist

Thanks,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index only scans wiki page

2012-11-12 Thread Peter Geoghegan
On 13 November 2012 01:03, Jeff Janes jeff.ja...@gmail.com wrote:
 https://wiki.postgresql.org/wiki/Index-only_scans

 This page is seriously out of date.  It suggests they are not yet
 implemented, but only being talked about.

Attached is a piece I wrote on the feature. That might form the basis
of a new wiki page. Feel free to incorporate this material as you see
fit.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


index_only_scans.rst
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch to See Also: CREATE TABLE AS in CREATE TABLE docs

2012-11-12 Thread Peter Eisentraut
On Sun, 2012-09-23 at 21:28 -0500, Karl O. Pinc wrote:
 Patch to add CREATE TABLE AS to the See Also: section
 of the CREATE TABLE docs.

committed



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] support for LDAP URLs

2012-11-12 Thread Peter Eisentraut
Here is a patch to support RFC 2255 LDAP URLs in pg_hba.conf.  So,
instead of, say

host ... ldap ldapserver=ldap.example.net ldapbasedn=dc=example, dc=net 
ldapsearchattribute=uid

you could write

host ... ldap lapurl=ldap://ldap.example.net/dc=example,dc=net?uid?sub;

Apache and probably other software uses the same format, and it's easier
to have a common format for all such configuration instead of having to
translate the information provided by the LDAP admin into each
software's particular configuration spellings.

I'm using the OpenLDAP-provided URL parsing routine, which means this
wouldn't be supported on Windows.  But we already support different
authentication settings on different platforms, so this didn't seem such
a big problem.
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index d053fce..ba6523b 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -1486,6 +1486,39 @@ titleLDAP Authentication/title
 /para
/listitem
   /varlistentry
+  varlistentry
+   termliteralldapurl/literal/term
+   listitem
+para
+ You can write most of the LDAP options alternatively using an RFC 2255
+ LDAP URL.  The format is
+synopsis
+ldap://[replaceableuser/replaceable[:replaceablepassword/replaceable]@]replaceablehost/replaceable[:replaceableport/replaceable]/replaceablebasedn/replaceable[?[replaceableattribute/replaceable][?[replaceablescope/replaceable]]]
+/synopsis
+ replaceablescope/replaceable must be one
+ of literalbase/literal, literalone/literal, literalsub/literal,
+ typically the latter.  Only one attribute is used, and some other
+ components of standard LDAP URLs such as filters and extensions are
+ not supported.
+/para
+
+para
+ If you want encrypted LDAP connections, you have to use
+ the literalldaptls/literal option in addition
+ to literalldapurl/literal.  The literalldaps/literal URL
+ scheme (direct SSL connection) is not supported.
+/para
+
+para
+ Some other software that supports authentication against LDAP uses the
+ same URL format, so it will be easier to share the configuration.
+/para
+
+para
+ LDAP URLs are currently only supported with OpenLDAP, not on Windows.
+/para
+   /listitem
+  /varlistentry
 /variablelist
/para
 
@@ -1520,6 +1553,13 @@ titleLDAP Authentication/title
 If that second connection succeeds, the database access is granted.
/para
 
+   para
+Here is the same search+bind configuration written as a URL:
+programlisting
+host ... ldap lapurl=ldap://ldap.example.net/dc=example,dc=net?uid?sub;
+/programlisting
+   /para
+
tip
 para
  Since LDAP often uses commas and spaces to separate the different
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index ca470e1..cc1140d 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -2209,7 +2209,7 @@ static int pam_passwd_conv_proc(int num_msg, const struct pam_message ** msg,
 
 		r = ldap_search_s(ldap,
 		  port-hba-ldapbasedn,
-		  LDAP_SCOPE_SUBTREE,
+		  port-hba-ldapscope,
 		  filter,
 		  attributes,
 		  0,
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 7502e82..ba630a7 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -37,6 +37,16 @@
 #include utils/lsyscache.h
 #include utils/memutils.h
 
+#ifdef USE_LDAP
+#ifndef WIN32
+/* We use a deprecated function to keep the codepath the same as win32. */
+#define LDAP_DEPRECATED 1
+#include ldap.h
+#else
+#include winldap.h
+#endif
+#endif
+
 
 #define atooid(x)  ((Oid) strtoul((x), NULL, 10))
 #define atoxid(x)  ((TransactionId) strtoul((x), NULL, 10))
@@ -1336,7 +1346,7 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
 			{
 ereport(LOG,
 		(errcode(ERRCODE_CONFIG_FILE_ERROR),
-		 errmsg(cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, or ldapsearchattribute together with ldapprefix),
+		 errmsg(cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, ldapsearchattribute, or ldapurl together with ldapprefix),
 		 errcontext(line %d of configuration file \%s\,
 	line_num, HbaFileName)));
 return NULL;
@@ -1378,6 +1388,8 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
 static bool
 parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, int line_num)
 {
+	hbaline-ldapscope = LDAP_SCOPE_SUBTREE;
+
 	if (strcmp(name, map) == 0)
 	{
 		if (hbaline-auth_method != uaIdent 
@@ -1437,6 +1449,54 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
 		REQUIRE_AUTH_OPTION(uaPAM, pamservice, pam);
 		hbaline-pamservice = pstrdup(val);
 	}
+	else if (strcmp(name, ldapurl) == 0)
+	{
+		LDAPURLDesc *urldata;
+		int rc;
+
+		REQUIRE_AUTH_OPTION(uaLDAP, ldapurl, ldap);
+
+#ifdef 

Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-12 Thread Amit kapila
On Monday, November 12, 2012 12:07 PM Greg Smith wrote:
On 11/9/12 11:59 PM, Amit kapila wrote:

 Please let me know if there are any objections or problems in above method 
 of implementation,
 else I can go ahead to prepare the patch for the coming CF.

 It may be the case that the locking scheme Robert described is the best
 approach here.  It seems kind of heavy to me though.  I suspect that
 some more thinking about it might come up with something better.

Yes, we should evaluate multiple options to do this and then choose the best 
among it. 
I am ready to work on evaluating other ways to accomplish this feature.

Is the above opinion about only locking or even on a way to write the changed 
things in a file as mentioned in point-1 in mail chain upthread.
(Point-1:  1. While writing .auto file, it will always assume that .auto file 
contain
 all config parameters.
   Now as this .auto file is of fixed format and fixed record size, it can
 directly write a given record to its particular position.)
What my thinking was that if we can decide that the format and size of each 
configuration is fixed, it can be directly written without doing anything for 
it in memory.


 Regardless, exactly how to prevent two concurrent processes from writing
 the same file feels like the last step in the small roadmap for what
 this feature needs.  If you wanted to work on it more, I'd suggest
 breaking it into chunks in this order:

 1) Change to add scanning a .conf directory in the default configuration
 using include-dir.  This is a quick fix.  I predict most of the
 headaches around it will end up being for packagers rather than the core
 code to deal with.

 You could submit this as a small thing to be evaluated on its own.  How
 it's done is going to be controversial.  Might as well get that fighting
 focused against a sample implementation as soon as possible.

As per my understanding, 
a. during initdb, new conf directory can be created and also create .auto file 
in it.
b. use include_dir at end of postgresql.conf to include directory created in 
above step.
c. server start/sighup will take care of above include_dir


 2) Get familiar with navigating the GUC data and figuring out what,
 exactly, needs to be written out.  This should include something that
 navigates like things appear after a RESET, ignoring per-user or
 per-session changes when figuring out what goes there.  It seems
 inevitable that some amount of validating against the source
 information--what pg_settings labels source, sourcefile, and sourceline
 will be needed.  An example is the suggestion Magnus made for confirming
 that the include-dir is still active before writing something there.

Okay, what I will do for this is that I shall explain in next mail the way to 
do by navigating GUC's.

 3) Add the function to write a new file out.  Work out some test cases
 for that to confirm the logic and error checking in the previous step
 all works.

 I'd next submit what you've got for (2) and (3) to review at this point,
 before complicating things further with the locking parts.

Okay. 

 4) Make the file write atomic and able to work when multiple users try
 it at once.  You have to reach here successfully before the trivial
 around file locking comes into play.  I wouldn't even bother aiming for
 that part in a first patch.   It's obviously a solvable problem in a
 number of ways.  You need a rock solid way to figure out what to write
 there before that solution is useful though.



With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2012-11-12 Thread Amit kapila
On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:
 On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila amit.kap...@huawei.com
 wrote:
  On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:
  On 19.10.2012 14:42, Amit kapila wrote:
   On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
   Before implementing the timeout parameter, I think that it's
 better
  to change
   both pg_basebackup background process and pg_receivexlog so that

 BTW, IIRC the walsender has no timeout mechanism during sending
 backup data to pg_basebackup. So it's also useful to implement the
 timeout mechanism for the walsender during backup.

 Yes, its useful, but for walsender the main problem is that it uses blocking
 send call to send the data.
 I have tried using tcp_keepalive settings, but the send call doesn't comeout
 incase of network break.
 The only way I could get it out is:
 change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
 the command
 echo 8  /proc/sys/net/ipv4/tcp_retries2
 As per recommendation, its value should be at-least 8 (equivalent to 100
 sec)

 Do you have any idea, how it can be achieved?

 What about using pq_putmessage_noblock()?

I will try this, but do you know why at first place in code the blocking mode 
is used to send files?
I am asking as I am little scared that it should not break any design which was 
initially thought of while making send of files as blocking. 

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Amit kapila
On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote:
Robert Haas escribió:
 On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila amit.kap...@huawei.com wrote:

  I think I can see all of those things being potentially useful.  There
  are a couple of pending patches that will revise the WAL format
  slightly; not sure how much those are likely to interfere with any
  development you might do on (2) in the meantime.
 
  Based on above conclusion, I have prepared a patch which implements 
  Option-1

 I wonder if we shouldn't make this a separate utility, rather than
 something that is part of pg_resetxlog.  Anyone have a thought on that
 topic?

 That thought did cross my mind too.

One of the reasons for keeping it with pg_resetxlog, is that this was proposed 
as a solution for scenario's where user's db has become corrupt and now he
want to start it. So to do it he can find the max LSN and set the same using 
pg_resetxlog, it will avoid the further corruption of database after it got 
started.
If we keep it a separate utility then user needs to first run this utility to 
find max LSN and then use pg_resetxlog to achieve the same. I don't see a big 
problem in that
but may be it would have been better if there are other usecases for it.


However it might be used for other purpose also which I am not able to think. 

Do you have any particular reasons for having it a separate utility?

With Regards,
Amit Kapila.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Memory leaks in record_out and record_send

2012-11-12 Thread Tom Lane
I looked into the problem complained of here:
http://archives.postgresql.org/pgsql-general/2012-11/msg00279.php
which turns out to have nothing to do with joins and everything to do
with the fact that record_out() leaks memory like mad.  It leaks both
the strings returned by the per-column output functions and any column
values that have to be detoasted.  You can easily reproduce this with
an example like

create table leak (f1 int, f2 text);

insert into leak select x, 'foo' from generate_series(1,100) x;

select leak from leak;

The attached patch against HEAD fixes this, as well as a similar
leakage in record_send().  The added code is lifted directly from
printtup() so it's not adding any new assumptions to the system.

I wonder though if we ought to think about running output functions in
a short-lived memory context instead of the executor's main context.
We've considered that before, I think, and it's always been the path
of least resistance to fix the output functions instead --- but there
will always be another leak I'm afraid.

OTOH I can't see trying to back-patch a solution like that.   If we want
to fix this in the back branches (and note the complaint linked above is
against 8.3), I think we have to do it as attached.

Thoughts?

regards, tom lane

diff --git a/src/backend/utils/adt/rowtypes.c b/src/backend/utils/adt/rowtypes.c
index 13e574d4e82b4412f7ab32e02398d1671d381097..d4ed7d0ca06ac1a0dee6b4963989a5b41713fb3b 100644
*** a/src/backend/utils/adt/rowtypes.c
--- b/src/backend/utils/adt/rowtypes.c
*** typedef struct ColumnIOData
*** 32,37 
--- 32,38 
  	Oid			column_type;
  	Oid			typiofunc;
  	Oid			typioparam;
+ 	bool		typisvarlena;
  	FmgrInfo	proc;
  } ColumnIOData;
  
*** record_out(PG_FUNCTION_ARGS)
*** 364,369 
--- 365,371 
  	{
  		ColumnIOData *column_info = my_extra-columns[i];
  		Oid			column_type = tupdesc-attrs[i]-atttypid;
+ 		Datum		attr;
  		char	   *value;
  		char	   *tmp;
  		bool		nq;
*** record_out(PG_FUNCTION_ARGS)
*** 387,403 
  		 */
  		if (column_info-column_type != column_type)
  		{
- 			bool		typIsVarlena;
- 
  			getTypeOutputInfo(column_type,
  			  column_info-typiofunc,
! 			  typIsVarlena);
  			fmgr_info_cxt(column_info-typiofunc, column_info-proc,
  		  fcinfo-flinfo-fn_mcxt);
  			column_info-column_type = column_type;
  		}
  
! 		value = OutputFunctionCall(column_info-proc, values[i]);
  
  		/* Detect whether we need double quotes for this value */
  		nq = (value[0] == '\0');	/* force quotes for empty string */
--- 389,412 
  		 */
  		if (column_info-column_type != column_type)
  		{
  			getTypeOutputInfo(column_type,
  			  column_info-typiofunc,
! 			  column_info-typisvarlena);
  			fmgr_info_cxt(column_info-typiofunc, column_info-proc,
  		  fcinfo-flinfo-fn_mcxt);
  			column_info-column_type = column_type;
  		}
  
! 		/*
! 		 * If we have a toasted datum, forcibly detoast it here to avoid
! 		 * memory leakage inside the type's output routine.
! 		 */
! 		if (column_info-typisvarlena)
! 			attr = PointerGetDatum(PG_DETOAST_DATUM(values[i]));
! 		else
! 			attr = values[i];
! 
! 		value = OutputFunctionCall(column_info-proc, attr);
  
  		/* Detect whether we need double quotes for this value */
  		nq = (value[0] == '\0');	/* force quotes for empty string */
*** record_out(PG_FUNCTION_ARGS)
*** 416,432 
  
  		/* And emit the string */
  		if (nq)
! 			appendStringInfoChar(buf, '');
  		for (tmp = value; *tmp; tmp++)
  		{
  			char		ch = *tmp;
  
  			if (ch == '' || ch == '\\')
! appendStringInfoChar(buf, ch);
! 			appendStringInfoChar(buf, ch);
  		}
  		if (nq)
! 			appendStringInfoChar(buf, '');
  	}
  
  	appendStringInfoChar(buf, ')');
--- 425,447 
  
  		/* And emit the string */
  		if (nq)
! 			appendStringInfoCharMacro(buf, '');
  		for (tmp = value; *tmp; tmp++)
  		{
  			char		ch = *tmp;
  
  			if (ch == '' || ch == '\\')
! appendStringInfoCharMacro(buf, ch);
! 			appendStringInfoCharMacro(buf, ch);
  		}
  		if (nq)
! 			appendStringInfoCharMacro(buf, '');
! 
! 		pfree(value);
! 
! 		/* Clean up detoasted copy, if any */
! 		if (DatumGetPointer(attr) != DatumGetPointer(values[i]))
! 			pfree(DatumGetPointer(attr));
  	}
  
  	appendStringInfoChar(buf, ')');
*** record_send(PG_FUNCTION_ARGS)
*** 714,719 
--- 729,735 
  	{
  		ColumnIOData *column_info = my_extra-columns[i];
  		Oid			column_type = tupdesc-attrs[i]-atttypid;
+ 		Datum		attr;
  		bytea	   *outputbytes;
  
  		/* Ignore dropped columns in datatype */
*** record_send(PG_FUNCTION_ARGS)
*** 734,756 
  		 */
  		if (column_info-column_type != column_type)
  		{
- 			bool		typIsVarlena;
- 
  			getTypeBinaryOutputInfo(column_type,
  	column_info-typiofunc,
! 	typIsVarlena);
  			fmgr_info_cxt(column_info-typiofunc, column_info-proc,
  		  fcinfo-flinfo-fn_mcxt);
  			

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Amit kapila
On Monday, November 12, 2012 8:31 PM Merlin Moncure wrote:
On Fri, Sep 14, 2012 at 6:42 AM, Amit kapila amit.kap...@huawei.com wrote:
 On Tuesday, September 11, 2012 7:07 PM Amit Kapila wrote:
 On Monday, September 10, 2012 8:20 PM Amit Kapila wrote:
 On Sunday, September 09, 2012 1:37 PM Amit Kapila wrote:
 On Friday, September 07, 2012 11:19 PM Tom Lane wrote:
 Heikki Linnakangas hlinn...@iki.fi writes:
  Would socketpair(2) be simpler?

 I have started working on this patch to make it work on Windows. The 3
 main things to make it work are:

The patch which contains Windows implementation as well is attached with 
this mail. It contains changes related to following
1. waitpid
2. socketpair
3. fork-exec

The following is still left:
 1. Error handling in all paths

 The modified version 2 contains error handling in all paths.

 I didn't see that this patch was added to a commitfest  -- should it
 have been?  I very much like Tom's proposed starting point for this
 feature as a replacement for --single.  Hate to see this die on the
 vine.  Would some testing on windows be what's needed to get the ball
 rolling?

After Windows implementation, I have done first level tests also to make sure 
it works. 
I think Tom is the right person to comment on how to see this patch move 
forward.
I am not sure what's in his mind that he didn't provide the feedback or 
proceeded to complete it.
Is it due to time or he might have forseen some design or usecase problem, if 
it's due to time then I think it can be persuaded.

With Regards,
Amit Kapila. 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Amit kapila
On Tuesday, November 13, 2012 3:11 AM Simon Riggs wrote:
On 12 November 2012 21:26, Merlin Moncure mmonc...@gmail.com wrote:

 I couldn't disagree more.  The patch is small, logical, and fixes an
 awful problem, namely that --single mode is basically unusable.  As to
 your wider point (namely, that you can't connect to it, therefore it's
 bad), it has got to be refuted by numerous competing solutions in the
 market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html,
 and many others.

As far as I remember even MySQL provides such a mode.

 Small is not an argument in favour, just a statement of ease, like
jumping off a cliff. i.e. lemmings.

 While it's not as common as it used to be, now and then a piece of
 software needs to distribute an application as part of a boxed
 product.  Postgres is horrible at this and doesn't have to be; imagine
 how much easier the lives of poker tracker would be (for *some* of its
 users) with an integrated standalone mode: google 'poker tracker
 postgresql' and take a good long look at problems people face in this
 scenario.

I get the installability thang, very very much, I just don't see the
single process thing as the only solution. At very least an open
minded analysis of the actual problem and ways of solving it is called
for, not just reach for a close to hand solution.

Some other usecase where I have seen it required is in telecom billing apps.
In telecom application where this solution works, needs other maintainence 
connections as well.
Some of the reasons for its use are performance and less maintainence overhead 
and also their data requirements are 
also not so high.
So even if this solution doesn't meet all requirements of single process 
solution (and neither I think it is written to address all)  but can't we think 
of it as first version and then based on requirements extend it to have other 
capabilities:
a. to have a mechnism for other background processes (autovacuum, checkpoint, 
..).
b. more needs to be thought of..


With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers