date:20121112

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Amit kapila

On Tuesday, November 13, 2012 3:11 AM Simon Riggs wrote:
On 12 November 2012 21:26, Merlin Moncure  wrote:

>> I couldn't disagree more.  The patch is small, logical, and fixes an
>> awful problem, namely that --single mode is basically unusable.  As to
>> your wider point (namely, that you can't connect to it, therefore it's
>> bad), it has got to be refuted by numerous competing solutions in the
>> market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html,
>> and many others.

As far as I remember even MySQL provides such a mode.

> Small is not an argument in favour, just a statement of ease, like
>jumping off a cliff. i.e. lemmings.

>> While it's not as common as it used to be, now and then a piece of
>> software needs to distribute an application as part of a boxed
>> product.  Postgres is horrible at this and doesn't have to be; imagine
>> how much easier the lives of poker tracker would be (for *some* of its
>> users) with an integrated standalone mode: google 'poker tracker
>> postgresql' and take a good long look at problems people face in this
>> scenario.

>I get the installability thang, very very much, I just don't see the
>single process thing as the only solution. At very least an open
>minded analysis of the actual problem and ways of solving it is called
>for, not just reach for a close to hand solution.

Some other usecase where I have seen it required is in telecom billing apps.
In telecom application where this solution works, needs other maintainence 
connections as well.
Some of the reasons for its use are performance and less maintainence overhead 
and also their data requirements are 
also not so high.
So even if this solution doesn't meet all requirements of single process 
solution (and neither I think it is written to address all)  but can't we think 
of it as first version and then based on requirements extend it to have other 
capabilities:
a. to have a mechnism for other background processes (autovacuum, checkpoint, 
..).
b. more needs to be thought of..

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Amit kapila

On Monday, November 12, 2012 8:31 PM Merlin Moncure wrote:
On Fri, Sep 14, 2012 at 6:42 AM, Amit kapila  wrote:
> On Tuesday, September 11, 2012 7:07 PM Amit Kapila wrote:
> On Monday, September 10, 2012 8:20 PM Amit Kapila wrote:
> On Sunday, September 09, 2012 1:37 PM Amit Kapila wrote:
> On Friday, September 07, 2012 11:19 PM Tom Lane wrote:
> Heikki Linnakangas  writes:
> > Would socketpair(2) be simpler?
>
 I have started working on this patch to make it work on Windows. The 3
> main things to make it work are:
>
>>>The patch which contains Windows implementation as well is attached with 
>>>this mail. It contains changes related to following
>>>1. waitpid
>>>2. socketpair
>>3. fork-exec
>
>>>The following is still left:
>>> 1. Error handling in all paths
>
>> The modified version 2 contains error handling in all paths.

> I didn't see that this patch was added to a commitfest  -- should it
> have been?  I very much like Tom's proposed starting point for this
> feature as a replacement for --single.  Hate to see this die on the
> vine.  Would some testing on windows be what's needed to get the ball
> rolling?

After Windows implementation, I have done first level tests also to make sure 
it works. 
I think Tom is the right person to comment on how to see this patch move 
forward.
I am not sure what's in his mind that he didn't provide the feedback or 
proceeded to complete it.
Is it due to time or he might have forseen some design or usecase problem, if 
it's due to time then I think it can be persuaded.

With Regards,
Amit Kapila. 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Memory leaks in record_out and record_send

2012-11-12 Thread Tom Lane

I looked into the problem complained of here:
http://archives.postgresql.org/pgsql-general/2012-11/msg00279.php
which turns out to have nothing to do with joins and everything to do
with the fact that record_out() leaks memory like mad.  It leaks both
the strings returned by the per-column output functions and any column
values that have to be detoasted.  You can easily reproduce this with
an example like

create table leak (f1 int, f2 text);

insert into leak select x, 'foo' from generate_series(1,100) x;

select leak from leak;

The attached patch against HEAD fixes this, as well as a similar
leakage in record_send().  The added code is lifted directly from
printtup() so it's not adding any new assumptions to the system.

I wonder though if we ought to think about running output functions in
a short-lived memory context instead of the executor's main context.
We've considered that before, I think, and it's always been the path
of least resistance to fix the output functions instead --- but there
will always be another leak I'm afraid.

OTOH I can't see trying to back-patch a solution like that.   If we want
to fix this in the back branches (and note the complaint linked above is
against 8.3), I think we have to do it as attached.

Thoughts?

regards, tom lane

diff --git a/src/backend/utils/adt/rowtypes.c b/src/backend/utils/adt/rowtypes.c
index 13e574d4e82b4412f7ab32e02398d1671d381097..d4ed7d0ca06ac1a0dee6b4963989a5b41713fb3b 100644
*** a/src/backend/utils/adt/rowtypes.c
--- b/src/backend/utils/adt/rowtypes.c
*** typedef struct ColumnIOData
*** 32,37 
--- 32,38 
  	Oid			column_type;
  	Oid			typiofunc;
  	Oid			typioparam;
+ 	bool		typisvarlena;
  	FmgrInfo	proc;
  } ColumnIOData;
  
*** record_out(PG_FUNCTION_ARGS)
*** 364,369 
--- 365,371 
  	{
  		ColumnIOData *column_info = &my_extra->columns[i];
  		Oid			column_type = tupdesc->attrs[i]->atttypid;
+ 		Datum		attr;
  		char	   *value;
  		char	   *tmp;
  		bool		nq;
*** record_out(PG_FUNCTION_ARGS)
*** 387,403 
  		 */
  		if (column_info->column_type != column_type)
  		{
- 			bool		typIsVarlena;
- 
  			getTypeOutputInfo(column_type,
  			  &column_info->typiofunc,
! 			  &typIsVarlena);
  			fmgr_info_cxt(column_info->typiofunc, &column_info->proc,
  		  fcinfo->flinfo->fn_mcxt);
  			column_info->column_type = column_type;
  		}
  
! 		value = OutputFunctionCall(&column_info->proc, values[i]);
  
  		/* Detect whether we need double quotes for this value */
  		nq = (value[0] == '\0');	/* force quotes for empty string */
--- 389,412 
  		 */
  		if (column_info->column_type != column_type)
  		{
  			getTypeOutputInfo(column_type,
  			  &column_info->typiofunc,
! 			  &column_info->typisvarlena);
  			fmgr_info_cxt(column_info->typiofunc, &column_info->proc,
  		  fcinfo->flinfo->fn_mcxt);
  			column_info->column_type = column_type;
  		}
  
! 		/*
! 		 * If we have a toasted datum, forcibly detoast it here to avoid
! 		 * memory leakage inside the type's output routine.
! 		 */
! 		if (column_info->typisvarlena)
! 			attr = PointerGetDatum(PG_DETOAST_DATUM(values[i]));
! 		else
! 			attr = values[i];
! 
! 		value = OutputFunctionCall(&column_info->proc, attr);
  
  		/* Detect whether we need double quotes for this value */
  		nq = (value[0] == '\0');	/* force quotes for empty string */
*** record_out(PG_FUNCTION_ARGS)
*** 416,432 
  
  		/* And emit the string */
  		if (nq)
! 			appendStringInfoChar(&buf, '"');
  		for (tmp = value; *tmp; tmp++)
  		{
  			char		ch = *tmp;
  
  			if (ch == '"' || ch == '\\')
! appendStringInfoChar(&buf, ch);
! 			appendStringInfoChar(&buf, ch);
  		}
  		if (nq)
! 			appendStringInfoChar(&buf, '"');
  	}
  
  	appendStringInfoChar(&buf, ')');
--- 425,447 
  
  		/* And emit the string */
  		if (nq)
! 			appendStringInfoCharMacro(&buf, '"');
  		for (tmp = value; *tmp; tmp++)
  		{
  			char		ch = *tmp;
  
  			if (ch == '"' || ch == '\\')
! appendStringInfoCharMacro(&buf, ch);
! 			appendStringInfoCharMacro(&buf, ch);
  		}
  		if (nq)
! 			appendStringInfoCharMacro(&buf, '"');
! 
! 		pfree(value);
! 
! 		/* Clean up detoasted copy, if any */
! 		if (DatumGetPointer(attr) != DatumGetPointer(values[i]))
! 			pfree(DatumGetPointer(attr));
  	}
  
  	appendStringInfoChar(&buf, ')');
*** record_send(PG_FUNCTION_ARGS)
*** 714,719 
--- 729,735 
  	{
  		ColumnIOData *column_info = &my_extra->columns[i];
  		Oid			column_type = tupdesc->attrs[i]->atttypid;
+ 		Datum		attr;
  		bytea	   *outputbytes;
  
  		/* Ignore dropped columns in datatype */
*** record_send(PG_FUNCTION_ARGS)
*** 734,756 
  		 */
  		if (column_info->column_type != column_type)
  		{
- 			bool		typIsVarlena;
- 
  			getTypeBinaryOutputInfo(column_type,
  	&column_info->typiofunc,
! 	&typIsVarlena);
  			fmgr_info_cxt(column_info->typiofunc, &column_in

Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Amit kapila

On Monday, November 12, 2012 9:56 PM Alvaro Herrera wrote:
Robert Haas escribió:
> On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila  wrote:

>> >> I think I can see all of those things being potentially useful.  There
>> >> are a couple of pending patches that will revise the WAL format
>> >> slightly; not sure how much those are likely to interfere with any
>> >> development you might do on (2) in the meantime.
> >
>> > Based on above conclusion, I have prepared a patch which implements 
>> > Option-1
>
>> I wonder if we shouldn't make this a separate utility, rather than
>> something that is part of pg_resetxlog.  Anyone have a thought on that
>> topic?

> That thought did cross my mind too.

One of the reasons for keeping it with pg_resetxlog, is that this was proposed 
as a solution for scenario's where user's db has become corrupt and now he
want to start it. So to do it he can find the max LSN and set the same using 
pg_resetxlog, it will avoid the further corruption of database after it got 
started.
If we keep it a separate utility then user needs to first run this utility to 
find max LSN and then use pg_resetxlog to achieve the same. I don't see a big 
problem in that
but may be it would have been better if there are other usecases for it.

However it might be used for other purpose also which I am not able to think. 

Do you have any particular reasons for having it a separate utility?

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2012-11-12 Thread Amit kapila

On Monday, November 12, 2012 8:23 PM Fujii Masao wrote:
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila  wrote:
> On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:
>> On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila 
>> wrote:
>> > On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:
>> >> On 19.10.2012 14:42, Amit kapila wrote:
>> >> > On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
>> >> >> Before implementing the timeout parameter, I think that it's
>> better
>> >> to change
>> >> >> both pg_basebackup background process and pg_receivexlog so that

>>> BTW, IIRC the walsender has no timeout mechanism during sending
>>> backup data to pg_basebackup. So it's also useful to implement the
>> timeout mechanism for the walsender during backup.
>
>> Yes, its useful, but for walsender the main problem is that it uses blocking
>> send call to send the data.
>> I have tried using tcp_keepalive settings, but the send call doesn't comeout
>> incase of network break.
>> The only way I could get it out is:
>> change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
>> the command
> echo "8" > /proc/sys/net/ipv4/tcp_retries2
>> As per recommendation, its value should be at-least 8 (equivalent to 100
>> sec)
>
>> Do you have any idea, how it can be achieved?

> What about using pq_putmessage_noblock()?

I will try this, but do you know why at first place in code the blocking mode 
is used to send files?
I am asking as I am little scared that it should not break any design which was 
initially thought of while making send of files as blocking. 

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-12 Thread Amit kapila

On Monday, November 12, 2012 12:07 PM Greg Smith wrote:
On 11/9/12 11:59 PM, Amit kapila wrote:

>> Please let me know if there are any objections or problems in above method 
>> of implementation,
>> else I can go ahead to prepare the patch for the coming CF.

> It may be the case that the locking scheme Robert described is the best
> approach here.  It seems kind of heavy to me though.  I suspect that
> some more thinking about it might come up with something better.

Yes, we should evaluate multiple options to do this and then choose the best 
among it. 
I am ready to work on evaluating other ways to accomplish this feature.

Is the above opinion about only locking or even on a way to write the changed 
things in a file as mentioned in point-1 in mail chain upthread.
(Point-1: > 1. While writing .auto file, it will always assume that .auto file 
contain
> all config parameters.
>   Now as this .auto file is of fixed format and fixed record size, it can
> directly write a given record to its particular position.)
What my thinking was that if we can decide that the format and size of each 
configuration is fixed, it can be directly written without doing anything for 
it in memory.


> Regardless, exactly how to prevent two concurrent processes from writing
> the same file feels like the last step in the small roadmap for what
> this feature needs.  If you wanted to work on it more, I'd suggest
> breaking it into chunks in this order:

> 1) Change to add scanning a .conf directory in the default configuration
> using include-dir.  This is a quick fix.  I predict most of the
> headaches around it will end up being for packagers rather than the core
> code to deal with.

> You could submit this as a small thing to be evaluated on its own.  How
> it's done is going to be controversial.  Might as well get that fighting
> focused against a sample implementation as soon as possible.

As per my understanding, 
a. during initdb, new conf directory can be created and also create .auto file 
in it.
b. use include_dir at end of postgresql.conf to include directory created in 
above step.
c. server start/sighup will take care of above include_dir


> 2) Get familiar with navigating the GUC data and figuring out what,
> exactly, needs to be written out.  This should include something that
> navigates like things appear after a RESET, ignoring per-user or
> per-session changes when figuring out what goes there.  It seems
> inevitable that some amount of validating against the source
> information--what pg_settings labels source, sourcefile, and sourceline
> will be needed.  An example is the suggestion Magnus made for confirming
> that the include-dir is still active before writing something there.

Okay, what I will do for this is that I shall explain in next mail the way to 
do by navigating GUC's.

> 3) Add the function to write a new file out.  Work out some test cases
> for that to confirm the logic and error checking in the previous step
> all works.

> I'd next submit what you've got for (2) and (3) to review at this point,
> before complicating things further with the locking parts.

Okay. 

> 4) Make the file write atomic and able to work when multiple users try
> it at once.  You have to reach here successfully before the trivial
> around file locking comes into play.  I wouldn't even bother aiming for
> that part in a first patch.   It's obviously a solvable problem in a
> number of ways.  You need a rock solid way to figure out what to write
> there before that solution is useful though.



With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] support for LDAP URLs

2012-11-12 Thread Peter Eisentraut

Here is a patch to support RFC 2255 LDAP URLs in pg_hba.conf.  So,
instead of, say

host ... ldap ldapserver=ldap.example.net ldapbasedn="dc=example, dc=net" 
ldapsearchattribute=uid

you could write

host ... ldap lapurl="ldap://ldap.example.net/dc=example,dc=net?uid?sub";

Apache and probably other software uses the same format, and it's easier
to have a common format for all such configuration instead of having to
translate the information provided by the LDAP admin into each
software's particular configuration spellings.

I'm using the OpenLDAP-provided URL parsing routine, which means this
wouldn't be supported on Windows.  But we already support different
authentication settings on different platforms, so this didn't seem such
a big problem.
diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index d053fce..ba6523b 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -1486,6 +1486,39 @@ LDAP Authentication
 

   
+  
+   ldapurl
+   
+
+ You can write most of the LDAP options alternatively using an RFC 2255
+ LDAP URL.  The format is
+
+ldap://[user[:password]@]host[:port]/basedn[?[attribute][?[scope]]]
+
+ scope must be one
+ of base, one, sub,
+ typically the latter.  Only one attribute is used, and some other
+ components of standard LDAP URLs such as filters and extensions are
+ not supported.
+
+
+
+ If you want encrypted LDAP connections, you have to use
+ the ldaptls option in addition
+ to ldapurl.  The ldaps URL
+ scheme (direct SSL connection) is not supported.
+
+
+
+ Some other software that supports authentication against LDAP uses the
+ same URL format, so it will be easier to share the configuration.
+
+
+
+ LDAP URLs are currently only supported with OpenLDAP, not on Windows.
+
+   
+  
 

 
@@ -1520,6 +1553,13 @@ LDAP Authentication
 If that second connection succeeds, the database access is granted.

 
+   
+Here is the same search+bind configuration written as a URL:
+
+host ... ldap lapurl="ldap://ldap.example.net/dc=example,dc=net?uid?sub";
+
+   
+

 
  Since LDAP often uses commas and spaces to separate the different
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index ca470e1..cc1140d 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -2209,7 +2209,7 @@ static int pam_passwd_conv_proc(int num_msg, const struct pam_message ** msg,
 
 		r = ldap_search_s(ldap,
 		  port->hba->ldapbasedn,
-		  LDAP_SCOPE_SUBTREE,
+		  port->hba->ldapscope,
 		  filter,
 		  attributes,
 		  0,
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 7502e82..ba630a7 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -37,6 +37,16 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
+#ifdef USE_LDAP
+#ifndef WIN32
+/* We use a deprecated function to keep the codepath the same as win32. */
+#define LDAP_DEPRECATED 1
+#include 
+#else
+#include 
+#endif
+#endif
+
 
 #define atooid(x)  ((Oid) strtoul((x), NULL, 10))
 #define atoxid(x)  ((TransactionId) strtoul((x), NULL, 10))
@@ -1336,7 +1346,7 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
 			{
 ereport(LOG,
 		(errcode(ERRCODE_CONFIG_FILE_ERROR),
-		 errmsg("cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, or ldapsearchattribute together with ldapprefix"),
+		 errmsg("cannot use ldapbasedn, ldapbinddn, ldapbindpasswd, ldapsearchattribute, or ldapurl together with ldapprefix"),
 		 errcontext("line %d of configuration file \"%s\"",
 	line_num, HbaFileName)));
 return NULL;
@@ -1378,6 +1388,8 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
 static bool
 parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline, int line_num)
 {
+	hbaline->ldapscope = LDAP_SCOPE_SUBTREE;
+
 	if (strcmp(name, "map") == 0)
 	{
 		if (hbaline->auth_method != uaIdent &&
@@ -1437,6 +1449,54 @@ static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
 		REQUIRE_AUTH_OPTION(uaPAM, "pamservice", "pam");
 		hbaline->pamservice = pstrdup(val);
 	}
+	else if (strcmp(name, "ldapurl") == 0)
+	{
+		LDAPURLDesc *urldata;
+		int rc;
+
+		REQUIRE_AUTH_OPTION(uaLDAP, "ldapurl", "ldap");
+
+#ifdef LDAP_API_FEATURE_X_OPENLDAP
+		rc = ldap_url_parse(val, &urldata);
+		if (rc != LDAP_SUCCESS)
+		{
+			ereport(LOG,
+ (errcode(ERRCODE_CONFIG_FILE_ERROR),
+  errmsg("could not parse LDAP URL \"%s\": %s", val, ldap_err2string(rc;
+			return false;
+		}
+
+		if (strcmp(urldata->lud_scheme, "ldap") != 0)
+		{
+			ereport(LOG,
+	(errcode(ERRCODE_CONFIG_FILE_ERROR),
+	 errmsg("unsupported LDAP URL scheme: %s", urldata->lud_scheme)));
+			ldap_free_urldesc(urldata);
+			return fa

Re: [HACKERS] Doc patch to See Also: CREATE TABLE AS in CREATE TABLE docs

2012-11-12 Thread Peter Eisentraut

On Sun, 2012-09-23 at 21:28 -0500, Karl O. Pinc wrote:
> Patch to add CREATE TABLE AS to the See Also: section
> of the CREATE TABLE docs.

committed



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index only scans wiki page

2012-11-12 Thread Peter Geoghegan

On 13 November 2012 01:03, Jeff Janes  wrote:
> https://wiki.postgresql.org/wiki/Index-only_scans
>
> This page is seriously out of date.  It suggests they are not yet
> implemented, but only being talked about.

Attached is a piece I wrote on the feature. That might form the basis
of a new wiki page. Feel free to incorporate this material as you see
fit.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


index_only_scans.rst
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index only scans wiki page

2012-11-12 Thread Jeff Janes

https://wiki.postgresql.org/wiki/Index-only_scans

This page is seriously out of date.  It suggests they are not yet
implemented, but only being talked about.

Would someone who knows a lot about the subject (which is why I'm
sending this hackers, not web)  like to take a whack at updating this?
 Otherwise, I'd like to just replace the whole page with a discussion
of the limitations and trade-offs involved in using them as they
currently exist

Thanks,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Josh Berkus

Jeff,

> OK, so here's my proposal for a first patch (changes from Simon's
> patch):
> 
>   * Add a flag to the postgres executable indicating that it should use
> checksums on everything. This would only be valid if bootstrap mode is
> also specified.
>   * Add a multi-state checksums flag in pg_control, that would have
> three states: OFF, ENABLING, and ON. It would only be set to ON during
> bootstrap, and in this first patch, it would not be possible to set
> ENABLING.
>   * Remove GUC and use this checksums flag everywhere.
>   * Use the TLI field rather than the version field of the page header.
>   * Incorporate page number into checksum calculation (already done).
>   
> Does this satisfy the requirements for a first step? Does it interfere
> with potential future work?

So the idea of this implementation is that checksums is something you
set at initdb time, and if you want checksums on an existing database,
it's a migration process (e.g. dump and reload)?

I think that's valid as a first cut at this.

We'll need interruptable VACUUM CHECKSUM later, but we don't have to
have it for the first version of the feature.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Greg Smith


On 11/12/12 3:44 AM, Markus Wanner wrote:

Sorry if that has been discussed before, but can't we do without that
bit at all? It adds a checksum switch to each page, where we just agreed
we don't event want a per-database switch.


Once you accept that eventually there need to be online conversion 
tools, there needs to be some easy way to distinguish which pages have 
been processed for several potential implementations.  The options seem 
to be adding some bits just for that or bumping the page format.  I 
would like to just bump the format, but that has a pile of its own 
issues to cross.  Rather not make that a requirement for this month's 
requirements.



Can we simply write a progress indicator to pg_control or someplace
saying that all pages up to X of relation Y are supposed to have valid
checksums?


All of the table-based checksum enabling ideas seem destined to add 
metadata to pg_class or something related to it for this purpose.  While 
I think everyone agrees that this is a secondary priority to getting 
basic cluster-level checksums going right now, I'd like to have at least 
a prototype for that before 9.3 development ends.  All of the



I realize this doesn't support Jesper's use case of wanting to have the
checksums only for newly dirtied pages. However, I'd argue that
prolonging the migration to spread the load would allow even big shops
to go through this without much of an impact on performance.


I'm thinking of this in some ways like the way creation of a new (but 
not yet valid) foreign key works.  Once that's active, new activity is 
immediately protected moving forward.  And eventually there's this 
cleanup step needed, one that you can inch forward over a few days.


The main upper limit on load spreading here is that the conversion 
program may need to grab a snapshot.  In that case the conversion taking 
too long will be a problem, as it blocks other vacuum activity past that 
point.   This is why I think any good solution to this problem needs to 
incorporate restartable conversion.  We were just getting complaints 
recently about how losing a CREATE INDEX CONCURRENTLY session can cause 
the whole process to end and need to be started over.  The way 
autovacuum runs right now it can be stopped and restarted later, with 
only a small loss of duplicated work in many common cases.  If it's 
possible to maintain that property for the checksum conversion, that 
would be very helpful to larger sites.  It doesn't matter if adding 
checksums to the old data takes a week if you throttle the load down, so 
long as you're not forced to hold an open snapshot the whole time.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Jeff Davis

On Mon, 2012-11-12 at 20:44 +0100, Markus Wanner wrote:
> As described before in this thread, I think we might be able to do
> without the "has checksum"-bit, as yet another simplification. But I
> don't object to adding it, either.

I see. For a first patch, I guess that's OK. Might as well make it as
simple as possible. We probably need to decide what to do there before
9.3 is released though.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Greg Smith


On 11/12/12 4:44 AM, Craig Ringer wrote:


Is it absurd to suggest using another bitmap, like the FSM or visibility
map, to store information on page checksumming while checksumming is
enabled but incomplete?


I spent some time thinking about that last week.  One problem with it is 
that the bitmap structure itself has the same issues as every other 
write here--how do we know it's going to disk accurately?  The "put 
'checksum on' bits on the page" idea and "put checksum on bits in a map" 
have the same fundamental issue.  Things might get out of sync in the 
same way, you've just moved the potentially suspicious write to a new place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Robert Haas

On Mon, Nov 12, 2012 at 5:17 PM, Simon Riggs  wrote:
> On 12 November 2012 16:51, Robert Haas  wrote:
>
>> Although there may be some workloads that access very large numbers of
>> tables repeatedly, I bet that's not typical.
>
> Transactions with large numbers of DDL statements are typical at
> upgrade (application or database release level) and the execution time
> of those is critical to availability.
>
> I'm guessing you mean large numbers of tables and accessing each one
> multiple times?

Yes, that is what I meant.

>> Rather, I bet that a
>> session which accesses 10,000 tables is most likely to access them
>> just once each - and right now we don't handle that case very well;
>> this is not the first complaint about big relcaches causing problems.
>
> pg_restore frequently accesses tables more than once as it runs, but
> not more than a dozen times each, counting all types of DDL.

Hmm... yeah.  Some of those accesses are probably one right after
another so any cache-flushing behavior would be fine; but index
creations for example might happen quite a bit later in the file,
IIRC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Simon Riggs

On 12 November 2012 16:51, Robert Haas  wrote:

> Although there may be some workloads that access very large numbers of
> tables repeatedly, I bet that's not typical.

Transactions with large numbers of DDL statements are typical at
upgrade (application or database release level) and the execution time
of those is critical to availability.

I'm guessing you mean large numbers of tables and accessing each one
multiple times?

> Rather, I bet that a
> session which accesses 10,000 tables is most likely to access them
> just once each - and right now we don't handle that case very well;
> this is not the first complaint about big relcaches causing problems.

pg_restore frequently accesses tables more than once as it runs, but
not more than a dozen times each, counting all types of DDL.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] vacuumlo - use a cursor

2012-11-12 Thread Andrew Dunstan

vacuumlo is rather simpleminded about dealing with the list of LOs to be 
removed - it just fetches them as a straight resultset. For one of my 
our this resulted in an out of memory condition. The attached patch 
tries to remedy that by using a cursor instead. If this is wanted I will 
add it to the next commitfest. The actualy changes are very small - most 
of the patch is indentation changes due to the introduction of an extra 
loop.


cheers

andrew
*** a/contrib/vacuumlo/vacuumlo.c
--- b/contrib/vacuumlo/vacuumlo.c
***
*** 290,362  vacuumlo(const char *database, const struct _param * param)
  	PQclear(res);
  
  	buf[0] = '\0';
! 	strcat(buf, "SELECT lo FROM vacuum_l");
! 	res = PQexec(conn, buf);
! 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
! 	{
! 		fprintf(stderr, "Failed to read temp table:\n");
! 		fprintf(stderr, "%s", PQerrorMessage(conn));
! 		PQclear(res);
  		PQfinish(conn);
  		return -1;
! 	}
  
- 	matched = PQntuples(res);
  	deleted = 0;
! 	for (i = 0; i < matched; i++)
  	{
! 		Oid			lo = atooid(PQgetvalue(res, i, 0));
  
! 		if (param->verbose)
  		{
! 			fprintf(stdout, "\rRemoving lo %6u   ", lo);
! 			fflush(stdout);
  		}
  
! 		if (param->dry_run == 0)
  		{
! 			if (lo_unlink(conn, lo) < 0)
  			{
! fprintf(stderr, "\nFailed to remove lo %u: ", lo);
! fprintf(stderr, "%s", PQerrorMessage(conn));
! if (PQtransactionStatus(conn) == PQTRANS_INERROR)
  {
! 	success = false;
! 	break;
  }
  			}
  			else
  deleted++;
! 		}
! 		else
! 			deleted++;
! 		if (param->transaction_limit > 0 &&
! 			(deleted % param->transaction_limit) == 0)
! 		{
! 			res2 = PQexec(conn, "commit");
! 			if (PQresultStatus(res2) != PGRES_COMMAND_OK)
  			{
! fprintf(stderr, "Failed to commit transaction:\n");
! fprintf(stderr, "%s", PQerrorMessage(conn));
  PQclear(res2);
! PQclear(res);
! PQfinish(conn);
! return -1;
! 			}
! 			PQclear(res2);
! 			res2 = PQexec(conn, "begin");
! 			if (PQresultStatus(res2) != PGRES_COMMAND_OK)
! 			{
! fprintf(stderr, "Failed to start transaction:\n");
! fprintf(stderr, "%s", PQerrorMessage(conn));
  PQclear(res2);
- PQclear(res);
- PQfinish(conn);
- return -1;
  			}
- 			PQclear(res2);
  		}
  	}
  	PQclear(res);
  
  	/*
--- 290,389 
  	PQclear(res);
  
  	buf[0] = '\0';
! 	strcat(buf, 
! 		   "DECLARE myportal CURSOR WITH HOLD FOR SELECT lo FROM vacuum_l");
! res = PQexec(conn, buf);
! if (PQresultStatus(res) != PGRES_COMMAND_OK)
! {
! fprintf(stderr, "DECLARE CURSOR failed: %s", PQerrorMessage(conn));
! PQclear(res);
  		PQfinish(conn);
  		return -1;
! }
! PQclear(res);
! 
! 	snprintf(buf, BUFSIZE, "FETCH FORWARD " INT64_FORMAT " IN myportal", 
! 			 param->transaction_limit > 0 ? param->transaction_limit : 1000);
  
  	deleted = 0;
! 
! 	while (1)
  	{
! 		res = PQexec(conn, buf);
! 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
! 		{
! 			fprintf(stderr, "Failed to read temp table:\n");
! 			fprintf(stderr, "%s", PQerrorMessage(conn));
! 			PQclear(res);
! 			PQfinish(conn);
! 			return -1;
! 		}
  
! 		matched = PQntuples(res);
! 
! 		if (matched <= 0)
  		{
! 			/* at end of resultset */
! 			break;
  		}
  
! 		for (i = 0; i < matched; i++)
  		{
! 			Oid			lo = atooid(PQgetvalue(res, i, 0));
! 			
! 			if (param->verbose)
! 			{
! fprintf(stdout, "\rRemoving lo %6u   ", lo);
! fflush(stdout);
! 			}
! 			
! 			if (param->dry_run == 0)
  			{
! if (lo_unlink(conn, lo) < 0)
  {
! 	fprintf(stderr, "\nFailed to remove lo %u: ", lo);
! 	fprintf(stderr, "%s", PQerrorMessage(conn));
! 	if (PQtransactionStatus(conn) == PQTRANS_INERROR)
! 	{
! 		success = false;
! 		break;
! 	}
  }
+ else
+ 	deleted++;
  			}
  			else
  deleted++;
! 
! 			if (param->transaction_limit > 0 &&
! (deleted % param->transaction_limit) == 0)
  			{
! res2 = PQexec(conn, "commit");
! if (PQresultStatus(res2) != PGRES_COMMAND_OK)
! {
! 	fprintf(stderr, "Failed to commit transaction:\n");
! 	fprintf(stderr, "%s", PQerrorMessage(conn));
! 	PQclear(res2);
! 	PQclear(res);
! 	PQfinish(conn);
! 	return -1;
! }
  PQclear(res2);
! res2 = PQexec(conn, "begin");
! if (PQresultStatus(res2) != PGRES_COMMAND_OK)
! {
! 	fprintf(stderr, "Failed to start transaction:\n");
! 	fprintf(stderr, "%s", PQerrorMessage(conn));
! 	PQclear(res2);
! 	PQclear(res);
! 	PQfinish(conn);
! 	return -1;
! }
  PQclear(res2);
  			}
  		}
  	}
+ 
  	PQclear(res);
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Jeff Janes

On Fri, Nov 9, 2012 at 10:50 AM, Jeff Janes  wrote:
> On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane  wrote:
>> Jeff Janes  writes:
>>> Are sure the server you are dumping out of is head?
>>
>> I experimented a bit with dumping/restoring 16000 tables matching
>> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
>> seems fairly flat, without any easy optimization targets.  But
>> restoring the dump script shows a rather interesting backend profile:
>>
>> samples  %image name   symbol name
>> 3086139.6289  postgres AtEOXact_RelationCache
>> 9911 12.7268  postgres hash_seq_search
> ...
>>
>> There are at least three ways we could whack that mole:
>>
>> * Run the psql script in --single-transaction mode, as I was mumbling
>> about the other day.  If we were doing AtEOXact_RelationCache only once,
>> rather than once per CREATE TABLE statement, it wouldn't be a problem.
>> Easy but has only a narrow scope of applicability.
>
> That is effective when loading into 9.3 (assuming you make
> max_locks_per_transaction large enough).  But when loading into  <9.3,
> using --single-transaction will evoke the quadratic behavior in the
> resource owner/lock table and make things worse rather than better.

Using --single-transaction gets around the AtEOXact_RelationCache
quadratic, but it activates another quadratic behavior, this one in
"get_tabstat_entry".   That is a good trade-off because that one has a
lower constant, but it is still going to bite.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Simon Riggs

On 12 November 2012 21:26, Merlin Moncure  wrote:

> I couldn't disagree more.  The patch is small, logical, and fixes an
> awful problem, namely that --single mode is basically unusable.  As to
> your wider point (namely, that you can't connect to it, therefore it's
> bad), it has got to be refuted by numerous competing solutions in the
> market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html,
> and many others.

Small is not an argument in favour, just a statement of ease, like
jumping off a cliff. i.e. lemmings.

> While it's not as common as it used to be, now and then a piece of
> software needs to distribute an application as part of a boxed
> product.  Postgres is horrible at this and doesn't have to be; imagine
> how much easier the lives of poker tracker would be (for *some* of its
> users) with an integrated standalone mode: google 'poker tracker
> postgresql' and take a good long look at problems people face in this
> scenario.

I get the installability thang, very very much, I just don't see the
single process thing as the only solution. At very least an open
minded analysis of the actual problem and ways of solving it is called
for, not just reach for a close to hand solution.

I don't ever want to hear someone reject a patch cos it would mess up
poker tracker. The point is it complicates the code, introduces
restrictions into what is possible and is just more inertia onto
development.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian

On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:
> The second approach would be to simply try to copy the fsm, vm, and
> extent files, and ignore any ENOEXIST errors.  This allows code
> simplification.  The downside is that it doesn't pull all files with
> matching prefixes --- it requires pg_upgrade to _know_ what suffixes
> might exist in that directory.  Second, it assumes there can be no
> number gaps in the file extent numbering (is that safe?).

Seems our code does the same kind of segment number looping I was
suggesting for pg_upgrade, so I think I am safe:

/*
 * Note that because we loop until getting ENOENT, we will correctly
 * remove all inactive segments as well as active ones.
 */
for (segno = 1;; segno++)
{
sprintf(segpath, "%s.%u", path, segno);
if (unlink(segpath) < 0)
{
/* ENOENT is expected after the last segment... */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
   errmsg("could not remove file \"%s\": %m", segpath)));
break;
}
}

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian

On Mon, Nov 12, 2012 at 06:14:59PM -0300, Alvaro Herrera wrote:
> Bruce Momjian escribió:
> 
> > --- 17,24 
> >   
> >   static void transfer_single_new_db(pageCnvCtx *pageConverter,
> >FileNameMap *maps, int size);
> > ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
> > !const char *suffix);
> 
> Uh, does this code assume that forks other than the main one are not
> split in segments?  I think that's a bug, is it not?

Oh, yeah, I must have fixed this long ago.  It only fails if you use
tablespaces:

if (os_info.num_tablespaces > 0 &&
strcmp(old_cluster.tablespace_suffix, new_cluster.tablespace_suffix) == 0)
pg_log(PG_FATAL,
   "Cannot upgrade to/from the same system catalog version when\n"
   "using tablespaces.\n");

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Merlin Moncure

On Mon, Nov 12, 2012 at 1:21 PM, Simon Riggs  wrote:
> On 10 September 2012 17:50, Tom Lane  wrote:
>
>> The point of the proposal that I am making is to have a simple,
>> low-maintenance solution for people who need a single-application
>> database.  A compromise somewhere in the middle isn't likely to be an
>> improvement for anybody.  For instance, if you want to have additional
>> connections, you open up a whole collection of communication and
>> authentication issues, which potential users of a single-application
>> database don't want to cope with.
>
> So the proposal is to implement a database that can't ever have 2 or
> more connections. And so any data it stores cannot ever be accessed by
> another connection, so all forms of replication are excluded and all
> maintenance actions force the database to be unavailable for a period
> of time. Those two things are barriers of the most major kind to
> anybody working in an enterprise with connected data and devices. The
> only people that want this are people with a very short term view of
> the purpose of their applications, and disregard for the value and
> permanence of the data stored. They may not want to cope with those
> issues *now* but they will later and won't thank us for implementing
> it in a way that means it can never be achieved.
>
> To be honest, I can't believe I'm reading this.
>
> And worse, it's on our Don't Want list, and nobody has said stop.
>
> It's almost impossible to purchase a CPU these days that doesn't have
> multiple cores, so the whole single-process architecture is just dead.
> Yes, we want Postgres installed everywhere, but this isn't the way to
> achieve that.
>
> I agree we should allow a PostgreSQL installation to work for a single
> user, but I don't see that requires other changes. This idea will
> cause endless bugs, thinkos and severely waste our time. So without a
> much better justification, I don't think we should do this.

I couldn't disagree more.  The patch is small, logical, and fixes an
awful problem, namely that --single mode is basically unusable.  As to
your wider point (namely, that you can't connect to it, therefore it's
bad), it has got to be refuted by numerous competing solutions in the
market such as http://www.firebirdsql.org/manual/fbmetasecur-embedded.html,
and many others.

While it's not as common as it used to be, now and then a piece of
software needs to distribute an application as part of a boxed
product.  Postgres is horrible at this and doesn't have to be; imagine
how much easier the lives of poker tracker would be (for *some* of its
users) with an integrated standalone mode: google 'poker tracker
postgresql' and take a good long look at problems people face in this
scenario.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Alvaro Herrera

Bruce Momjian escribió:

> It is possible that the poor 16k pg_upgrade value is caused by the poor
> 9.2 binary-upgrade number (189.38).  Perhaps I need to hack up
> pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

Hmm?  This already works, since "make check" uses it, right?

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Alvaro Herrera

Bruce Momjian escribió:

> --- 17,24 
>   
>   static void transfer_single_new_db(pageCnvCtx *pageConverter,
>  FileNameMap *maps, int size);
> ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
> !  const char *suffix);

Uh, does this code assume that forks other than the main one are not
split in segments?  I think that's a bug, is it not?

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian

On Mon, Nov 12, 2012 at 03:59:27PM -0500, Bruce Momjian wrote:
> > The second approach would be to simply try to copy the fsm, vm, and
> > extent files, and ignore any ENOEXIST errors.  This allows code
> > simplification.  The downside is that it doesn't pull all files with
> > matching prefixes --- it requires pg_upgrade to _know_ what suffixes
> > might exist in that directory.  Second, it assumes there can be no
> > number gaps in the file extent numbering (is that safe?).
> > 
> > I need recommendations on which direction to persue;  this would only be
> > for 9.3.
> 
> I went with the second idea, patch attached.  Here are the times:
> 
>  --  9.2 --   9.3 
>  -- normal -- -- bin-up --   -- normal -- -- bin-up --  pg_upgrade
>  dump   rest   dump   rest   dump   rest   dump   rest   git   
> patch
>   1  0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11  
> 11.02
>1000  7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60  
> 19.25
>2000  5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55  
> 26.67
>4000 13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70  
> 52.31
>8000 29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05 
> 102.78
>   16000 87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71 
> 286.00
> 
> You can see a significant speedup with those loops removed.  The 16k
> case is improved, but still not linear.  The 16k dump/restore scale
> looks fine, so it must be something in pg_upgrade, or in the kernel.

It is possible that the poor 16k pg_upgrade value is caused by the poor
9.2 binary-upgrade number (189.38).  Perhaps I need to hack up
pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian

On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:
> OK, I have had some time to think about this.  What the current code
> does is, for each database, get a directory listing to know about any
> vm, fsm, and >1gig extents that exist in the directory.  It caches the
> directory listing and does full array scans looking for matches.  If the
> tablespace changes, it creates a new directory cache and throws away the
> old one.  This code certainly needs improvement!
> 
> I can think of two solutions.  The first would be to scan the database
> directory, and any tablespaces used by the database, sort it, then allow
> binary search of the directory listing looking for file prefixes that
> match the current relation.
> 
> The second approach would be to simply try to copy the fsm, vm, and
> extent files, and ignore any ENOEXIST errors.  This allows code
> simplification.  The downside is that it doesn't pull all files with
> matching prefixes --- it requires pg_upgrade to _know_ what suffixes
> might exist in that directory.  Second, it assumes there can be no
> number gaps in the file extent numbering (is that safe?).
> 
> I need recommendations on which direction to persue;  this would only be
> for 9.3.

I went with the second idea, patch attached.  Here are the times:

 --  9.2 --   9.3 
 -- normal -- -- bin-up --   -- normal -- -- bin-up --  pg_upgrade
 dump   rest   dump   rest   dump   rest   dump   rest   git   patch
  1  0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11  11.02
   1000  7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60  19.25
   2000  5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55  26.67
   4000 13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70  52.31
   8000 29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05 102.78
  16000 87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71 286.00

You can see a significant speedup with those loops removed.  The 16k
case is improved, but still not linear.  The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/relfilenode.c b/contrib/pg_upgrade/relfilenode.c
new file mode 100644
index 33a867f..0a49a23
*** a/contrib/pg_upgrade/relfilenode.c
--- b/contrib/pg_upgrade/relfilenode.c
***
*** 17,25 
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  	   FileNameMap *maps, int size);
! static void transfer_relfile(pageCnvCtx *pageConverter,
!  const char *fromfile, const char *tofile,
!  const char *nspname, const char *relname);
  
  
  /*
--- 17,24 
  
  static void transfer_single_new_db(pageCnvCtx *pageConverter,
  	   FileNameMap *maps, int size);
! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
! 			 const char *suffix);
  
  
  /*
*** static void
*** 131,185 
  transfer_single_new_db(pageCnvCtx *pageConverter,
  	   FileNameMap *maps, int size)
  {
- 	char		old_dir[MAXPGPATH];
- 	char		file_pattern[MAXPGPATH];
- 	char		**namelist = NULL;
- 	int			numFiles = 0;
  	int			mapnum;
! 	int			fileno;
! 	bool		vm_crashsafe_change = false;
! 
! 	old_dir[0] = '\0';
! 
  	/* Do not copy non-crashsafe vm files for binaries that assume crashsafety */
  	if (old_cluster.controldata.cat_ver < VISIBILITY_MAP_CRASHSAFE_CAT_VER &&
  		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_CRASHSAFE_CAT_VER)
! 		vm_crashsafe_change = true;
  
  	for (mapnum = 0; mapnum < size; mapnum++)
  	{
! 		char		old_file[MAXPGPATH];
! 		char		new_file[MAXPGPATH];
! 
! 		/* Changed tablespaces?  Need a new directory scan? */
! 		if (strcmp(maps[mapnum].old_dir, old_dir) != 0)
! 		{
! 			if (numFiles > 0)
! 			{
! for (fileno = 0; fileno < numFiles; fileno++)
! 	pg_free(namelist[fileno]);
! pg_free(namelist);
! 			}
! 
! 			snprintf(old_dir, sizeof(old_dir), "%s", maps[mapnum].old_dir);
! 			numFiles = load_directory(old_dir, &namelist);
! 		}
! 
! 		/* Copying files might take some time, so give feedback. */
! 
! 		snprintf(old_file, sizeof(old_file), "%s/%u", maps[mapnum].old_dir,
!  maps[mapnum].old_relfilenode);
! 		snprintf(new_file, sizeof(new_file), "%s/%u", maps[mapnum].new_dir,
!  maps[mapnum].new_relfilenode);
! 		pg_log(PG_REPORT, OVERWRITE_MESSAGE, old_file);
! 
! 		/*
! 		 * Copy/link the relation's primary file (segment 0 of main fork)
! 		 * to the new cluster
! 		 */
! 		unlink(new_file);
! 		transfer_relfile(pageConverter, old_file, new_file,
! 		 maps[mapnum].nspname, maps[mapnum].relname);
  
  		/* fsm/vm files added in PG 8.4 */
  		if (GET_MAJOR_VERSION(old_cluster.major_version) >= 804)
--- 130,148 
  transfer_single_new_db(pag

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane

Here's an updated patch that fixes the GIST replay functions as well as
the other minor issues that were mentioned.  Barring objections, I'll
set about back-patching this as far as 9.0.

One thing that could use verification is my fix for
gistRedoPageSplitRecord.  AFAICS, the first page listed in the WAL
record is always the "original" page, and the ones following it are
pages that were split off from it, and can (as yet) only be reached by
following right-links from the "original" page.  As such, it should be
okay to release locks on the non-first pages as soon as we've written
them.  We have to hold lock on the original page though to avoid letting
readers follow dangling right-links.  Also, the update of
NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
with all this, so that has to be done before releasing the original-page
lock as well.  Does that sound right?

regards, tom lane



restore-backup-blocks-individually-3.patch.gz
Description: restore-backup-blocks-individually-3.patch.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane

Simon Riggs  writes:
> On 11 November 2012 23:24, Tom Lane  wrote:
>> Practically all WAL record types that touch multiple pages have some
>> bug of this type.  In addition to btree_xlog_split, I found that
>> heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
>> spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
>> locks as required to make their updates safe for concurrent queries.
>> (I'm not totally sure about ginRedoDeletePage, but the original action
>> definitely locks the pages simultaneously, and it's not clear that it's
>> safe not to.)  Most of these are okay in cases without any full-page
>> images, but could fail if the wrong subset of the pages-to-be-touched
>> were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

> Hmm, not good. Thanks for spotting.

> Do these changes do anything to actions that occur across multiple
> records? I assume not and think those are OK, agreed?

Right, we were and still are assuming that any state that exists between
WAL records is consistent and safe to expose to hot-standby queries.
The important thing here is that these WAL replay functions were failing
to ensure that their own actions appear atomic to onlookers.  This is
basically hangover from pre-hot-standby coding conventions, when no such
concern existed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

Jeff,

On 11/12/2012 06:52 PM, Jeff Davis wrote:
> OK, so here's my proposal for a first patch (changes from Simon's
> patch):
> 
>   * Add a flag to the postgres executable indicating that it should use
> checksums on everything. This would only be valid if bootstrap mode is
> also specified.
>   * Add a multi-state checksums flag in pg_control, that would have
> three states: OFF, ENABLING, and ON. It would only be set to ON during
> bootstrap, and in this first patch, it would not be possible to set
> ENABLING.
>   * Remove GUC and use this checksums flag everywhere.
>   * Use the TLI field rather than the version field of the page header.
>   * Incorporate page number into checksum calculation (already done).
>   
> Does this satisfy the requirements for a first step? Does it interfere
> with potential future work?

As described before in this thread, I think we might be able to do
without the "has checksum"-bit, as yet another simplification. But I
don't object to adding it, either.

> It's slightly better than that. It's more like: "we can tell you if any
> of your data gets corrupted after transaction X". If old data is
> corrupted before transaction X, then there's nothing we can do. But if
> it's corrupted after transaction X (even if it's old data), the
> checksums should catch it.

I (mis?)read that as Greg referring to the intermediate (enabling)
state, where pages with old data may or may not have a checksum, yet. So
I think it was an argument against staying in that state any longer than
necessary.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Simon Riggs

On 12 November 2012 14:51, Tom Lane  wrote:

>> The 9.0 code is broken, however. In
>> 9.0, when a child page is split, the parent and new children are kept
>> locked until the downlinks are inserted/updated. If a concurrent scan
>> comes along and sees that incomplete state, it will miss tuples on the
>> new right siblings. We rely on a rm_cleanup operation at the end of WAL
>> replay to fix that situation, if the downlink insertion record is not
>> there. I don't see any easy way to fix that, unfortunately. Perhaps we
>> could backpatch the 9.1 rewrite, now that it's gotten some real-world
>> testing, but it was a big change so I don't feel very comfortable doing
>> that.
>
> Me either.  Given the lack of field complaints, I think we're better
> advised to just leave it unfixed in 9.0.  It'd not be a step forward
> if we broke something trying to make this work.

Agreed. Most people running 9.0 with GIST indexes have already upgraded.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Simon Riggs

On 11 November 2012 23:24, Tom Lane  wrote:

> Practically all WAL record types that touch multiple pages have some
> bug of this type.  In addition to btree_xlog_split, I found that
> heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
> spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
> locks as required to make their updates safe for concurrent queries.
> (I'm not totally sure about ginRedoDeletePage, but the original action
> definitely locks the pages simultaneously, and it's not clear that it's
> safe not to.)  Most of these are okay in cases without any full-page
> images, but could fail if the wrong subset of the pages-to-be-touched
> were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Hmm, not good. Thanks for spotting.

Do these changes do anything to actions that occur across multiple
records? I assume not and think those are OK, agreed?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Simon Riggs

On 12 November 2012 16:22, Robert Haas  wrote:

> But I guess that raises the question - should COPY (FREEZE) silently
> ignore the option for not-new relfilenodes, or should it error out?
> Simon proposed the former, but I'm wondering if the latter would be
> better.

It's got some complex pre-conditions, so having scripts fail because
you mis-specified FREEZE would be annoying.

The option indicates "I accept the potential MVCC violation", not "it
will always freeze".

If there is a better name...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Simon Riggs

On 10 September 2012 17:50, Tom Lane  wrote:

> The point of the proposal that I am making is to have a simple,
> low-maintenance solution for people who need a single-application
> database.  A compromise somewhere in the middle isn't likely to be an
> improvement for anybody.  For instance, if you want to have additional
> connections, you open up a whole collection of communication and
> authentication issues, which potential users of a single-application
> database don't want to cope with.

So the proposal is to implement a database that can't ever have 2 or
more connections. And so any data it stores cannot ever be accessed by
another connection, so all forms of replication are excluded and all
maintenance actions force the database to be unavailable for a period
of time. Those two things are barriers of the most major kind to
anybody working in an enterprise with connected data and devices. The
only people that want this are people with a very short term view of
the purpose of their applications, and disregard for the value and
permanence of the data stored. They may not want to cope with those
issues *now* but they will later and won't thank us for implementing
it in a way that means it can never be achieved.

To be honest, I can't believe I'm reading this.

And worse, it's on our Don't Want list, and nobody has said stop.

It's almost impossible to purchase a CPU these days that doesn't have
multiple cores, so the whole single-process architecture is just dead.
Yes, we want Postgres installed everywhere, but this isn't the way to
achieve that.

I agree we should allow a PostgreSQL installation to work for a single
user, but I don't see that requires other changes. This idea will
cause endless bugs, thinkos and severely waste our time. So without a
much better justification, I don't think we should do this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pg_upgrade speed for many tables

2012-11-12 Thread Jeff Davis

On Mon, 2012-11-12 at 10:29 -0800, Jeff Janes wrote:
> When I'm doing a pg_upgrade with thousands of tables, the shutdown
> checkpoint after restoring the dump to the new cluster takes a very
> long time, as the writer drains its operation table by opening and
> individually fsync-ing thousands of files.

This reminds me of the fix I did for initdb to sync the files. I think
we do need to make sure they are sync'd, because ext4 can keep buffers
around for quite a long time without cleaning them.

I ended up using sync_file_range(..., SYNC_FILE_RANGE_WRITE) on linux,
and posix_fadvise(..., POSIX_FADV_DONTNEED) on everything else, and that
made subsequent fsyncs more efficient.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pg_upgrade speed for many tables

2012-11-12 Thread Jeff Janes

On Mon, Nov 5, 2012 at 12:08 PM, Bruce Momjian  wrote:
> Magnus reported that a customer with a million tables was finding
> pg_upgrade slow.  I had never considered many table to be a problem, but
> decided to test it.  I created a database with 2k tables like this:
>
> CREATE TABLE test1990 (x SERIAL);
>
> Running the git version of pg_upgrade on that took 203 seconds.  Using
> synchronous_commit=off dropped the time to 78 seconds.  This was tested
> on magnetic disks with a write-through cache.  (No change on an SSD with
> a super-capacitor.)
>
> I don't see anything unsafe about having pg_upgrade use
> synchronous_commit=off.  I could set it just for the pg_dump reload, but
> it seems safe to just use it always.  We don't write to the old cluster,
> and if pg_upgrade fails, you have to re-initdb the new cluster anyway.
>
> Patch attached.  I think it should be applied to 9.2 as well.

Is turning off synchronous_commit enough?  What about turning off fsync?

When I'm doing a pg_upgrade with thousands of tables, the shutdown
checkpoint after restoring the dump to the new cluster takes a very
long time, as the writer drains its operation table by opening and
individually fsync-ing thousands of files.  This takes about 40 ms per
file, which I assume is a combination of slow lap-top disk drive, and
a strange deal with ext4 which makes fsyncing a recently created file
very slow.   But even with faster hdd, this would still be a problem
if it works the same way, with every file needing 4 rotations to be
fsynced and this happens in serial.

Worse, the shutdown only waits for the default of 60 seconds for the
shutdown to take place before it throws an error and the entire
pg_upgrade gives up.  It seems to me that either the -t setting should
be increased, or should be an option to pg_upgrade.

My work around was to invoke a system-wide "sync" a couple seconds
after the 'pg_ctl stop' is initiated.  Flushing the files wholesale
seems to work to make the checkpoint writer rapidly find it has
nothing to do when it tries to flush them retail.

Anyway, the reason I think turning fsync off might be reasonable is
that as soon as the new cluster is shut down, pg_upgrade starts
overwriting most of those just-fsynced file with other files from the
old cluster, and AFAICT makes no effort to fsync them.  So until there
is a system-wide sync after the pg_upgrade finishes, your new cluster
is already in mortal danger anyway.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Alvaro Herrera

Noah Misch wrote:
> On Sun, Nov 11, 2012 at 02:18:11AM -0300, Alvaro Herrera wrote:

> > I will
> > move all the open patches to CF3, unless someone beats me to it.  I
> > probably won't be able to get anything done tomorrow, so if somebody has
> > a boring Sunday I would appreciate the help.
> 
> Likewise.

Many thanks.  I have closed the CF (just 3 days before the next one
starts, which is somewhat depressing).

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Jeff Davis

On Sun, 2012-11-11 at 23:55 -0500, Greg Smith wrote:
> Adding an initdb option to start out with everything checksummed seems 
> an uncontroversial good first thing to have available.

OK, so here's my proposal for a first patch (changes from Simon's
patch):

  * Add a flag to the postgres executable indicating that it should use
checksums on everything. This would only be valid if bootstrap mode is
also specified.
  * Add a multi-state checksums flag in pg_control, that would have
three states: OFF, ENABLING, and ON. It would only be set to ON during
bootstrap, and in this first patch, it would not be possible to set
ENABLING.
  * Remove GUC and use this checksums flag everywhere.
  * Use the TLI field rather than the version field of the page header.
  * Incorporate page number into checksum calculation (already done).

Does this satisfy the requirements for a first step? Does it interfere
with potential future work?

> Won't a pg_checksums program just grow until it looks like a limited 
> version of vacuum though?

We can dig into the details of that later, but I don't think it's
useless, even if we do have per-table (or better) checksums. For
instance, it would be useful to verify backups offline.

I think it's a legitimate concern that we might reinvent some VACUUM
machinery. Ideally, we'd get better online migration tools for checksums
(perhaps using VACUUM) fast enough that nobody will bother introducing
that kind of bloat into pg_checksums.

> I think it's useful to step back for a minute and consider the larger 
> uncertainty an existing relation has, which amplifies just how ugly this 
> situation is.  The best guarantee I think online checksumming can offer 
> is to tell the user "after transaction id X, all new data in relation R 
> is known to be checksummed".

It's slightly better than that. It's more like: "we can tell you if any
of your data gets corrupted after transaction X". If old data is
corrupted before transaction X, then there's nothing we can do. But if
it's corrupted after transaction X (even if it's old data), the
checksums should catch it.

> Unless you do this at initdb time, any 
> conversion case is going to have the possibility that a page is 
> corrupted before you get to it--whether you're adding the checksum as 
> part of a "let's add them while we're writing anyway" page update or the 
> conversion tool is hitting it.

Good point.

> That's why I don't think anyone will find online conversion really 
> useful until they've done a full sweep updating the old pages.

I don't entirely agree. A lot of times, you just want to know whether
your disk is changing your data out from under you. Maybe you miss some
cases and maybe not all of your data is protected, but just knowing
which disks need to be replaced, and which RAID controllers not to buy
again, is quite valuable. And the more data you get checksummed the
faster you'll find out.

> One of the really common cases I was expecting here is that conversions 
> are done by kicking off a slow background VACUUM CHECKSUM job that might 
> run in pieces.

Right now I'm focused on the initial patch and other fairly immediate
goals, so I won't address this now. But I don't want to cut off the
conversation, either.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Suggestion for --truncate-tables to pg_restore

2012-11-12 Thread Karl O. Pinc

Hi,

Attached is version 4.  Version 3 no longer
built against head.

On 10/16/2012 09:48:06 PM, Karl O. Pinc wrote:
> 
> On 09/23/2012 08:52:07 PM, Karl O. Pinc wrote:
> 
> > On 09/23/2012 12:24:27 AM, Karl O. Pinc wrote:
> 
> > > On 09/23/2012 12:19:07 AM, Karl O. Pinc wrote:
> > > > On 09/21/2012 10:54:05 AM, Karl O. Pinc wrote:
> > > > > On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote:
> > > > > 
> > > > > > I've had problems using pg_restore --data-only when
> > > > > > restoring individual schemas (which contain data which
> > > > > > has had bad things done to it).  --clean does not work
> > > > > > well because of dependent objects in other schemas.
> > > > 
> > > > Since there wasn't much more to do I've gone ahead
> > > > and written the patch.  Works for me.
> > > > 
> > > > Against git master.
> > > > Passes regression tests, but there's no regression
> > > > tests for pg_restore so this does not say much.
> > > > Since there's no regression tests I've not written one.
> > > > 
> > > > Since this is a real patch for application I've given
> > > > it a new name (it's not a v2).
> > > > 
> > > > Truncate done right before COPY, since that's what
> > > > the parallel restores do.
> 
> 
> Karl 
> Free Software:  "You don't pay back, you pay forward."
>  -- Robert A. Heinlein
> 
> 

--quoted attachment--
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
> 




Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 1b9db6b..23b0d16 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -546,6 +546,26 @@
  
 
  
+  --truncate-tables
+  
+   
+This option is only relevant when performing a data-only
+restore.  It instructs pg_restore
+to execute commands to truncate the target tables while the
+data is reloaded.  Use this when restoring tables or schemas
+and --clean cannot be used because dependent
+objects would be destroyed.
+   
+
+   
+ The --disable-triggers will almost always
+ always need to be used in conjunction with this option to
+ disable check constraints on foreign keys.
+   
+  
+ 
+
+ 
   --use-set-session-authorization
   

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 3b49395..0aaf1d3 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -101,6 +101,8 @@ typedef struct _restoreOptions
 	int			noTablespace;	/* Don't issue tablespace-related commands */
 	int			disable_triggers;		/* disable triggers during data-only
 		 * restore */
+	int			truncate_tables;		/* truncate tables during data-only
+		 * restore */
 	int			use_setsessauth;/* Use SET SESSION AUTHORIZATION commands
  * instead of OWNER TO */
 	int			no_security_labels;		/* Skip security label entries */
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1fead28..c7fdc79 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -309,6 +309,11 @@ RestoreArchive(Archive *AHX)
 	if (ropt->createDB && ropt->single_txn)
 		exit_horribly(modulename, "-C and -1 are incompatible options\n");
 
+	/* When the schema is dropped and re-created then no point
+	 * truncating tables. */
+	if (ropt->dropSchema && ropt->truncate_tables)
+		exit_horribly(modulename, "-c and --truncate-tables are incompatible options\n");
+
 	/*
 	 * If we're going to do parallel restore, there are some restrictions.
 	 */
@@ -403,6 +408,10 @@ RestoreArchive(Archive *AHX)
 		}
 	}
 
+	/* Truncate tables only when restoring data. */
+	if (!ropt->dataOnly && ropt->truncate_tables)
+		exit_horribly(modulename, "--truncate-tables requires the --data-only option\n");
+
 	/*
 	 * Setup the output file if necessary.
 	 */
@@ -562,6 +571,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 	int			retval = 0;
 	teReqs		reqs;
 	bool		defnDumped;
+	bool		truncate;
 
 	AH->currentTE = te;
 
@@ -696,14 +706,21 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 		 * server, so no need to see if we should issue BEGIN.
 		 */
 		StartTransaction(AH);
+		truncate = 1;
+	} else
+		/* Truncate the table when asked to. */
+		truncate = ropt->truncate_tables;
 
+	if (truncate) {
 		/*
 		 * If the server version is >= 8.4, make sure we issue
 		 * TRUNCATE with ONLY so that child tables are not
-		 * wiped.
+		 * wiped.  If we don't know the server version
+		 * then err on the side of safety.
 		 */
 		ahprintf(AH, "TRUNCATE TABLE %s%s;\n\n",
- (PQserverVersion(AH->connection) >= 80400 ?
+ ((!AH->connection

Re: [HACKERS] Doc patch: Document names of automatically created constraints and indexes

2012-11-12 Thread Karl O. Pinc

On 11/12/2012 10:40:00 AM, Tom Lane wrote:
> "Karl O. Pinc"  writes:
> > Attached is a patch (against head) which documents the
> > name of the constraint and index created when using
> > PRIMARY KEY with CREATE TABLE, and the name of the
> > index created when using UNIQUE. 

> Personally I don't think this should be documented, as it's an
> implementation detail that we've changed in the past and may change
> again.

Ok.  

Could ALTER TABLE use an option to drop the
primary key constraint?  I needed to do that,
found it was not obvious, and this lead me to 
try to improve things.


Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian

On Sat, Nov 10, 2012 at 12:41:44PM -0500, Bruce Momjian wrote:
> On Sat, Nov 10, 2012 at 07:17:34PM +0200, Ants Aasma wrote:
> > On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian  wrote:
> > > I am confused why you see a loop.  transfer_all_new_dbs() does a
> > > merge-join of old/new database names, then calls gen_db_file_maps(),
> > > which loops over the relations and calls create_rel_filename_map(),
> > > which adds to the map via array indexing.   I don't see any file loops
> > > in there --- can you be more specific?
> > 
> > Sorry, I was too tired when posting that. I actually meant
> > transfer_single_new_db(). More specifically the profile clearly showed
> > that most of the time was spent in the two loops starting on lines 193
> > and 228.
> 
> Wow, you are right on target.  I was so focused on making logical
> lookups linear that I did not consider file system vm/fsm and file
> extension lookups.  Let me think a little and I will report back. 
> Thanks.

OK, I have had some time to think about this.  What the current code
does is, for each database, get a directory listing to know about any
vm, fsm, and >1gig extents that exist in the directory.  It caches the
directory listing and does full array scans looking for matches.  If the
tablespace changes, it creates a new directory cache and throws away the
old one.  This code certainly needs improvement!

I can think of two solutions.  The first would be to scan the database
directory, and any tablespaces used by the database, sort it, then allow
binary search of the directory listing looking for file prefixes that
match the current relation.

The second approach would be to simply try to copy the fsm, vm, and
extent files, and ignore any ENOEXIST errors.  This allows code
simplification.  The downside is that it doesn't pull all files with
matching prefixes --- it requires pg_upgrade to _know_ what suffixes
might exist in that directory.  Second, it assumes there can be no
number gaps in the file extent numbering (is that safe?).

I need recommendations on which direction to persue;  this would only be
for 9.3.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Robert Haas

On Fri, Nov 9, 2012 at 12:50 AM, Tom Lane  wrote:
> Jeff Janes  writes:
>> Are sure the server you are dumping out of is head?
>
> I experimented a bit with dumping/restoring 16000 tables matching
> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
> seems fairly flat, without any easy optimization targets.  But
> restoring the dump script shows a rather interesting backend profile:
>
> samples  %image name   symbol name
> 3086139.6289  postgres AtEOXact_RelationCache
> 9911 12.7268  postgres hash_seq_search
> 2682  3.4440  postgres init_sequence
> 2218  2.8482  postgres _bt_compare
> 2120  2.7223  postgres hash_search_with_hash_value
> 1976  2.5374  postgres XLogInsert
> 1429  1.8350  postgres CatalogCacheIdInvalidate
> 1282  1.6462  postgres LWLockAcquire
> 973   1.2494  postgres LWLockRelease
> 702   0.9014  postgres hash_any
>
> The hash_seq_search time is probably mostly associated with
> AtEOXact_RelationCache, which is run during transaction commit and scans
> the relcache hashtable looking for tables created in the current
> transaction.  So that's about 50% of the runtime going into that one
> activity.
>
> There are at least three ways we could whack that mole:
>
> * Run the psql script in --single-transaction mode, as I was mumbling
> about the other day.  If we were doing AtEOXact_RelationCache only once,
> rather than once per CREATE TABLE statement, it wouldn't be a problem.
> Easy but has only a narrow scope of applicability.
>
> * Keep a separate list (or data structure of your choice) so that
> relcache entries created in the current xact could be found directly
> rather than having to scan the whole relcache.  That'd add complexity
> though, and could perhaps be a net loss for cases where the relcache
> isn't so bloated.
>
> * Limit the size of the relcache (eg by aging out
> not-recently-referenced entries) so that we aren't incurring O(N^2)
> costs for scripts touching N tables.  Again, this adds complexity and
> could be counterproductive in some scenarios.

Although there may be some workloads that access very large numbers of
tables repeatedly, I bet that's not typical.  Rather, I bet that a
session which accesses 10,000 tables is most likely to access them
just once each - and right now we don't handle that case very well;
this is not the first complaint about big relcaches causing problems.
On the flip side, we don't want workloads that exceed some baked-in
cache size to fall off a cliff.  So I think we should be looking for a
solution that doesn't put a hard limit on the size of the relcache,
but does provide at least some latitude to get rid of old entries.

So maybe something like this.  Add a flag to each relcache entry
indicating whether or not it has been used.  After adding 1024 entries
to the relcache, scan all the entries: clear the flag if it's set,
flush the entry if it's already clear.  This allows the size of the
relcache to grow without bound, but only if we're continuing to access
the old tables in between adding new ones to the mix.  As an
additional safeguard, we could count the number of toplevel SQL
commands that have been executed and require that a flush not be
performed more often than, say, every 64 toplevel SQL commands.  That
way, if a single operation on an inheritance parent with many children
sucks a lot of stuff into the relcache, we'll avoid cleaning it out
too quickly.

Maybe this is all too ad-hoc, but I feel like we don't need to
overengineer this.  The existing system is fine in 99% of the cases,
so we really only need to find a way to detect the really egregious
case where we are doing a neverending series of one-time table
accesses and apply a very light tap to avoid the pain point in that
case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doc patch: Document names of automatically created constraints and indexes

2012-11-12 Thread Tom Lane

"Karl O. Pinc"  writes:
> Attached is a patch (against head) which documents the
> name of the constraint and index created when using
> PRIMARY KEY with CREATE TABLE, and the name of the
> index created when using UNIQUE.  I haven't read all
> the docs recently but I don't believe this is presently
> documented.

This is not actually correct: it ignores the corner cases where the name
would be overlength or conflict with something else.

Personally I don't think this should be documented, as it's an
implementation detail that we've changed in the past and may change
again.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Jeff Davis

On Mon, 2012-11-12 at 09:44 +0100, Markus Wanner wrote:
> Can we simply write a progress indicator to pg_control or someplace
> saying that all pages up to X of relation Y are supposed to have valid
> checksums?

pg_control would not be the right place for that structure. It's
intended to be fixed-size (it's just a serialized C structure) and it
should be smaller than a sector so that it doesn't suffer from torn
pages.

Not a bad approach overall, but requires some kind of new structure. And
that increases the risk that it doesn't make 9.3.

Right now, I'm honestly just trying to get the simplest approach that
doesn't restrict these kinds of ideas if we want to do them later.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix errcontext() function

2012-11-12 Thread Heikki Linnakangas


On 12.11.2012 17:52, Alvaro Herrera wrote:

Hopefully you noticed that contrib is broken.


Oops.. fixed.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Alvaro Herrera

Robert Haas escribió:
> On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila  wrote:

> >> I think I can see all of those things being potentially useful.  There
> >> are a couple of pending patches that will revise the WAL format
> >> slightly; not sure how much those are likely to interfere with any
> >> development you might do on (2) in the meantime.
> >
> > Based on above conclusion, I have prepared a patch which implements Option-1
> 
> I wonder if we shouldn't make this a separate utility, rather than
> something that is part of pg_resetxlog.  Anyone have a thought on that
> topic?

That thought did cross my mind too.

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Doc patch: Document names of automatically created constraints and indexes

2012-11-12 Thread Karl O. Pinc

Hi,

Attached is a patch (against head) which documents the
name of the constraint and index created when using
PRIMARY KEY with CREATE TABLE, and the name of the
index created when using UNIQUE.  I haven't read all
the docs recently but I don't believe this is presently
documented.

It's unclear to me that this is the right approach
but perhaps this will start a discussion that finds
the right approach.

The big problem I see is that these are somewhat of
an implementation internal while at the same
time being something that the user might have to
concern themselves with.  

First, the constraint and index
names are in the namespace used by the user so there
is potential for collision with user-defined
constraints and indexes.

Second, the only way (I know of) to remove primary-key-ness
is to drop the primary key constraint, by name.

This lead me right into another thought:
It would be nice to have ALTER TABLE be able to
drop the primary key constraint.  (Then the
user would not need to know the name of the
constraint related to primary-key-ness.)
However, it is probably more useful to be able
to drop the constraint (and attendant foreign
key meta-information) separately from the unique
index associated with the primary key, if for no
other reason than index re-creation can be expensive
and missing indexes make bad things happen.

This patch is the improvement I could come up
with.  Someone else can decide to commit or reject,
I don't believe I can contribute much more on this at this
time.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 445ca40..da87574 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -486,6 +486,14 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
   primary key constraint defined for the table.  (Otherwise it
   would just be the same constraint listed twice.)
  
+
+ 
+  The name of the index created when UNIQUE is
+  used is tablename_colname_key.
+ 
+
 

 
@@ -514,6 +522,13 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
   different from other sets of columns named by any unique
   constraint defined for the same table.
  
+
+ 
+   The name of the primary key constraint, and the name of the
+   index which enforces uniqueness, is tablename_pkey.
+ 
+
 

 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Tom Lane

Bruce Momjian  writes:
> On Fri, Nov  9, 2012 at 12:50:34AM -0500, Tom Lane wrote:
>> The hash_seq_search time is probably mostly associated with
>> AtEOXact_RelationCache, which is run during transaction commit and scans
>> the relcache hashtable looking for tables created in the current
>> transaction.  So that's about 50% of the runtime going into that one
>> activity.

> Thanks for finding this.  What is odd is that I am not seeing non-linear
> restores at 16k in git head, so I am confused how something that
> consumes ~50% of backend time could still perform linearly.  Would this
> consume 50% at lower table counts?

No, the cost from that is O(N^2), though with a pretty small multiplier.
16K tables is evidently where the cost reaches the point of being
significant --- if you went up from there, you'd probably start to
notice an overall O(N^2) behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Robert Haas

On Mon, Nov 12, 2012 at 11:20 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs  wrote:
>>> So what we're talking about here is a new mode for COPY, that when
>>> requested will pre-freeze tuples when loading into a newly
>>> created/truncated table. If the table isn't newly created/truncated
>>> then we'll just ignore it and continue. I see no need to throw an
>>> error, since that will just cause annoying usability issues.
>
>> Actually, why not just have it work always?  If people want to load
>> frozen tuples into a table that's not newly created/truncated, why not
>> let them?  Sure, there could be MVCC violations, but as long as the
>> behavior is opt-in, who cares?  I think it'd be useful to a lot of
>> people.
>
> I thought about that too, but there's a big problem.  It wouldn't be
> just MVCC that would be broken, but transactional integrity: if the
> COPY fails partway through, the already-loaded rows still look valid.
> The new-file requirement provides a way to roll them back.
>
> I'm willing to have an option that compromises MVCC semantics
> transiently, but giving up transactional integrity seems a bit much.

Hmm, good point.  There might be some way around that, but figuring it
out is probably material for a separate patch.

But I guess that raises the question - should COPY (FREEZE) silently
ignore the option for not-new relfilenodes, or should it error out?
Simon proposed the former, but I'm wondering if the latter would be
better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: question on foreign key lock

2012-11-12 Thread Robert Haas

On Thu, Nov 8, 2012 at 3:45 AM, Filip Rembiałkowski
 wrote:
> maybe this is a better group for this question?
>
> I can't see why creating foreign key on table A referencing table B,
> generates an AccessExclusiveLock on B.
> It seems (to a layman :-) ) that only writes to B should be blocked.
>
> I'm really interested if this is either expected effect or any open TODO
> item or suboptimal behavior of postgres.

This comment explains it:

/*
 * Grab an exclusive lock on the pk table, so that someone doesn't delete
 * rows out from under us. (Although a lesser lock would do for that
 * purpose, we'll need exclusive lock anyway to add triggers to the pk
 * table; trying to start with a lesser lock will just create a risk of
 * deadlock.)
 */
pkrel = heap_openrv(fkconstraint->pktable, AccessExclusiveLock);

Concurrent DDL is something that's been discussed in detail on this
list in the past; unfortunately, there are some tricky race conditions
are the shared invalidation queue and SnapshotNow that make it hard to
implement properly.  I'm hoping to have some time to work on this at
some point, but it hasn't happened yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Tom Lane

Robert Haas  writes:
> On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs  wrote:
>> So what we're talking about here is a new mode for COPY, that when
>> requested will pre-freeze tuples when loading into a newly
>> created/truncated table. If the table isn't newly created/truncated
>> then we'll just ignore it and continue. I see no need to throw an
>> error, since that will just cause annoying usability issues.

> Actually, why not just have it work always?  If people want to load
> frozen tuples into a table that's not newly created/truncated, why not
> let them?  Sure, there could be MVCC violations, but as long as the
> behavior is opt-in, who cares?  I think it'd be useful to a lot of
> people.

I thought about that too, but there's a big problem.  It wouldn't be
just MVCC that would be broken, but transactional integrity: if the
COPY fails partway through, the already-loaded rows still look valid.
The new-file requirement provides a way to roll them back.

I'm willing to have an option that compromises MVCC semantics
transiently, but giving up transactional integrity seems a bit much.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-12 Thread Robert Haas

On Tue, Jul 31, 2012 at 8:09 AM, Amit kapila  wrote:
>>> Based on the discussion and suggestions in this mail chain, following
>>> features can be implemented:
>>>
>>> 1. To compute the value of max LSN in data pages based on user input
>>> whether he wants it for an individual
>
>>>   file,  a particular directory or whole database.
>>>
>>> 2a. To search the available WAL files for the latest checkpoint record
>>> and prints the value.
>>> 2b. To search the available WAL files for the latest checkpoint record
>>> and recreates a pg_control file pointing at
>
>>> that checkpoint.
>
>>> I have kept both options to address different kind of corruption
>>> scenarios.
>
>> I think I can see all of those things being potentially useful.  There
>> are a couple of pending patches that will revise the WAL format
>> slightly; not sure how much those are likely to interfere with any
>> development you might do on (2) in the meantime.
>
> Based on above conclusion, I have prepared a patch which implements Option-1

I wonder if we shouldn't make this a separate utility, rather than
something that is part of pg_resetxlog.  Anyone have a thought on that
topic?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian

On Fri, Nov  9, 2012 at 12:50:34AM -0500, Tom Lane wrote:
> Jeff Janes  writes:
> > Are sure the server you are dumping out of is head?
> 
> I experimented a bit with dumping/restoring 16000 tables matching
> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
> seems fairly flat, without any easy optimization targets.  But
> restoring the dump script shows a rather interesting backend profile:
> 
> samples  %image name   symbol name
> 3086139.6289  postgres AtEOXact_RelationCache
> 9911 12.7268  postgres hash_seq_search
> 2682  3.4440  postgres init_sequence
> 2218  2.8482  postgres _bt_compare
> 2120  2.7223  postgres hash_search_with_hash_value
> 1976  2.5374  postgres XLogInsert
> 1429  1.8350  postgres CatalogCacheIdInvalidate
> 1282  1.6462  postgres LWLockAcquire
> 973   1.2494  postgres LWLockRelease
> 702   0.9014  postgres hash_any
> 
> The hash_seq_search time is probably mostly associated with
> AtEOXact_RelationCache, which is run during transaction commit and scans
> the relcache hashtable looking for tables created in the current
> transaction.  So that's about 50% of the runtime going into that one
> activity.

Thanks for finding this.  What is odd is that I am not seeing non-linear
restores at 16k in git head, so I am confused how something that
consumes ~50% of backend time could still perform linearly.  Would this
consume 50% at lower table counts?

I agree we should do something, even if this is a rare case, because 50%
is a large percentage.

> There are at least three ways we could whack that mole:
> 
> * Run the psql script in --single-transaction mode, as I was mumbling
> about the other day.  If we were doing AtEOXact_RelationCache only once,
> rather than once per CREATE TABLE statement, it wouldn't be a problem.
> Easy but has only a narrow scope of applicability.
> 
> * Keep a separate list (or data structure of your choice) so that
> relcache entries created in the current xact could be found directly
> rather than having to scan the whole relcache.  That'd add complexity
> though, and could perhaps be a net loss for cases where the relcache
> isn't so bloated.

I like this one.  Could we do it only when the cache gets to be above a
certain size, to avoid any penalty?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-12 Thread Robert Haas

On Fri, Nov 9, 2012 at 3:31 PM, Simon Riggs  wrote:
> So what we're talking about here is a new mode for COPY, that when
> requested will pre-freeze tuples when loading into a newly
> created/truncated table. If the table isn't newly created/truncated
> then we'll just ignore it and continue. I see no need to throw an
> error, since that will just cause annoying usability issues.

Actually, why not just have it work always?  If people want to load
frozen tuples into a table that's not newly created/truncated, why not
let them?  Sure, there could be MVCC violations, but as long as the
behavior is opt-in, who cares?  I think it'd be useful to a lot of
people.

If we want to reduce (not eliminate) the potential MVCC issues, which
I think would be a good idea, we could take AccessExclusiveLock on the
table when COPY (FREEZE) is used.  Someone using an old snapshot but
accessing the table for the first time after AEL is released could
still see MVCC anomalies, but at least it would rule out things
changing in mid-query, which is the case that I think would be most
problematic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-12 Thread Bruce Momjian

On Sat, Nov 10, 2012 at 05:59:54PM -0500, Bruce Momjian wrote:
> On Sat, Nov 10, 2012 at 02:45:54PM -0800, Jeff Janes wrote:
> > On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian  wrote:
> > > On Fri, Nov  9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:
> > >> On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian  wrote:
> > >> >
> > >> > Again, using SERIAL?
> > >>
> > >> Yep.
> > >
> > > Odd why yours is so much after.
> > 
> > You didn't build git head under --enable-cassert, did you?
> 
> Yikes, you got me!  I have not done performance testing in so long, I
> had forgotten I changed my defaults.  New numbers to follow.  Sorry.
> 
> > Any chance you can do a oprofile or gprof of head's pg_dump dumping
> > out of head's server?  That really should be a lot faster (since
> > commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
> > If it is not for you, I don't see how to figure it out without a
> > profile of the slow system.
> 
> Yes, coming.

OK, here are my results.  Again, apologies for posting non-linear
results based on assert builds:

 --  9.2 --   9.3 
 -- normal -- -- bin-up --   -- normal -- -- bin-up --
 dump   rest   dump   rest   dump   rest   dump   rest  pg_upgrade
  1  0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11
   1000  7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60
   2000  5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55
   4000 13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70
   8000 29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05
  16000 87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71


You can see the non-linear dump at 16k in 9.2, and the almost-linear in
9.3.  :-)

pg_upgrade shows non-linear, but that is probably because of the
non-linear behavior of 9.2, and because of the two non-linear loops that
Ants found, that I will address in a separate email.

Thanks for the feedback.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Dumping an Extension's Script

2012-11-12 Thread Dimitri Fontaine

Hi,

Please find attached to this email an RFC patch implementing the basics
of the pg_dump --extension-script option. After much discussion around
the concept of an inline extension, we decided last year that a good
first step would be pg_dump support for an extension's script.

The approach I've been using here is to dump the script from the catalog
current dependencies, which mean that a sequence of CREATE EXTENSION
followed by a number of ALTER EXTENSION … UPDATE … will be consolidated
into a single CREATE EXTENSION command in the dump, much the same as
with CREATE TABLE then ALTER TABLE … ADD COLUMN and the like.

Currently the option behavior is the following, that looks sane to me,
and is open for discussion: the dump's schema always include the CREATE
EXTENSION commands you need. The extensions listed in the -X option
(that you can use more than once) will get dumped with their's current
member objects in a script, inline.

To try the attached patch, you could do as following:

createdb foo
psql -c "create extension hstore" -d foo
pg_dump -X hstore -f /tmp/foo.sql foo

createdb bar
psql -1 -f /tmp/foo.sql -d bar

To be able to restore the dump, I've been adding some basic support to
the CREATE EXTENSION command so that it will find the data it needs from
the SQL command rather than the control file.

Note that the extension control file only contains information about how
to install an extension from a script file on disk. That's something we
don't need at all when installing the extension from a dump, using
either pg_restore or psql. We have some exceptions to that principle,
namely: requires (sets the search_path) and relocatable (found in the
catalogs, needs to survive dump/restore).

Given positive feedback on that way to attack the problem, the TODO list
includes:

 - document the new pg_dump --extension-script switch
 - add support for ALTER EXTENSION … WITH $$

Re: [HACKERS] Fix errcontext() function

2012-11-12 Thread Alvaro Herrera

Heikki Linnakangas wrote:
> On 10.11.2012 11:46, Chen Huajun wrote:
> >>Unfortunately not all compilers support varargs macros. I bumped into
> >>this in February, see
> >>http://archives.postgresql.org/message-id/4f3b72e0.8040...@enterprisedb.com.
> >>My last attempt to fix this was
> >>at http://archives.postgresql.org/pgsql-hackers/2012-04/msg00812.php.
> >>That patch is probably good to go, I just got busy with other things
> >>and forgot about it back then. Can you take a look at that
> >>patch and see if I missed anything, please?
> >
> >I think you are right,although the number of changed place is a a little
> >bit large.
> >Thanks for your answer!
> 
> Ok, I've committed this patch now, it will be fixed in 9.3. Thanks
> for reminding me about this.

Hopefully you noticed that contrib is broken.

-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Andres Freund

On 2012-11-12 10:19:09 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
> >> If any of this stuff were getting used by external modules, changing it
> >> would be problematic ... but fortunately, it isn't, because we lack
> >> support for plug-in rmgrs.  So I'm not worried about back-patching the
> >> change, and would prefer to keep the 9.x branches in sync.
>
> > XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
> > used by xlogdump. Not sure if either are worth that much attention, but
> > it seems worth noticing that such a change will break stuff.
>
> Hm.  Okay, how about we leave the old macros in place in the back
> branches?

Sounds good to me. The RestoreBkpBlocks change seems unproblematic to
me. If anything its good that it has been renamed.

Thanks,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Version 4.9 of PostgreSQL buildfarm client released

2012-11-12 Thread Andrew Dunstan

Version 4.9 of the PostgreSQL buildfarm client has been released and is 
available at 



Changes since version 4.8:

. adjust git status checking to allow for messages from recent git releases
. add a module to test FileTextArray_FDW as an example of checking an 
extension

. a few small bug fixes and tweaks


You will need this release if you are using or likely to use a fairly 
recent git version to avoid getting spurious git-Dirty failures.


Enjoy

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix errcontext() function

2012-11-12 Thread Heikki Linnakangas


On 10.11.2012 11:46, Chen Huajun wrote:

Unfortunately not all compilers support varargs macros. I bumped into
this in February, see
http://archives.postgresql.org/message-id/4f3b72e0.8040...@enterprisedb.com.
My last attempt to fix this was
at http://archives.postgresql.org/pgsql-hackers/2012-04/msg00812.php.
That patch is probably good to go, I just got busy with other things
and forgot about it back then. Can you take a look at that
patch and see if I missed anything, please?


I think you are right,although the number of changed place is a a little
bit large.
Thanks for your answer!


Ok, I've committed this patch now, it will be fixed in 9.3. Thanks for 
reminding me about this.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane

Andres Freund  writes:
> On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
>> If any of this stuff were getting used by external modules, changing it
>> would be problematic ... but fortunately, it isn't, because we lack
>> support for plug-in rmgrs.  So I'm not worried about back-patching the
>> change, and would prefer to keep the 9.x branches in sync.

> XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
> used by xlogdump. Not sure if either are worth that much attention, but
> it seems worth noticing that such a change will break stuff.

Hm.  Okay, how about we leave the old macros in place in the back
branches?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2012-11-12 Thread Merlin Moncure

On Fri, Sep 14, 2012 at 6:42 AM, Amit kapila  wrote:
> On Tuesday, September 11, 2012 7:07 PM Amit Kapila wrote:
> On Monday, September 10, 2012 8:20 PM Amit Kapila wrote:
> On Sunday, September 09, 2012 1:37 PM Amit Kapila wrote:
> On Friday, September 07, 2012 11:19 PM Tom Lane wrote:
> Heikki Linnakangas  writes:
> Would socketpair(2) be simpler?
>
>
>
>I've not done anything yet about the potential security issues
>associated with untrusted libpq connection strings.  I think this
is still at the proof-of-concept stage; in particular, it's probably
 time to see if we can make it work on Windows before we worry more
about that.
>
>>> I have started working on this patch to make it work on Windows. The 3
> main things to make it work are:
>
>>The patch which contains Windows implementation as well is attached with this 
>>mail. It contains changes related to following
>>1. waitpid
>>2. socketpair
>>3. fork-exec
>
>>The following is still left:
>> 1. Error handling in all paths
>
> The modified version 2 contains error handling in all paths.

I didn't see that this patch was added to a commitfest  -- should it
have been?  I very much like Tom's proposed starting point for this
feature as a replacement for --single.  Hate to see this die on the
vine.  Would some testing on windows be what's needed to get the ball
rolling?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Alvaro Herrera

Greg Smith wrote:
> On 11/11/12 2:56 PM, Jeff Davis wrote:
> >We could have a separate utility, pg_checksums, that can
> >alter the state and/or do an offline verification. And initdb would take
> >an option that would start everything out fully protected with
> >checksums.
> 
> Adding an initdb option to start out with everything checksummed
> seems an uncontroversial good first thing to have available.

+1

> Won't a pg_checksums program just grow until it looks like a limited
> version of vacuum though?  It's going to iterate over most of the
> table; it needs the same cost controls as autovacuum (and to respect
> the load of concurrent autovacuum work) to keep I/O under control;
> and those cost control values might change if there's a SIGHUP to
> reload parameters. It looks so much like vacuum that I think there
> needs to be a really compelling reason to split it into something
> new.  Why can't this be yet another autovacuum worker that does its
> thing?

I agree that much of the things it's gonna do are going to be pretty
much the same as vacuum, but vacuum does so many other things that I
think it should be kept separate.  Sure, we can make it be invoked from
autovacuum in background according to some (yet to be devised)
scheduling heuristics.  But I don't see that it needs to share any
vacuum code.

A couple of thoughts about autovacuum: it's important to figure out
whether checksumming can run concurrently with vacuuming the same table;
if not, which one defers to the other in case of lock conflict.  Also,
can checksumming be ignored by concurrent transactions when computing
Xmin (I don't see any reason not to ...)

> One of the really common cases I was expecting here is that
> conversions are done by kicking off a slow background VACUUM
> CHECKSUM job that might run in pieces.  I was thinking of an
> approach like this:
> 
> -Initialize a last_checked_block value for each table
> -Loop:
> --Grab the next block after the last checked one
> --When on the last block of the relation, grab an exclusive lock to
> protect against race conditions with extension

Note that we have a separate lock type for relation extension, so we can
use that to avoid a conflict here.

> --If it's marked as checksummed and the checksum matches, skip it
> ---Otherwise, add a checksum and write it out
> --When that succeeds, update last_checked_block
> --If that was the last block, save some state saying the whole table
> is checkedsummed

"Some state" can be a pg_class field that's updated per
heap_inplace_update.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2012-11-12 Thread Fujii Masao

On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila  wrote:
> On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:
>> On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila 
>> wrote:
>> > On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:
>> >> On 19.10.2012 14:42, Amit kapila wrote:
>> >> > On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
>> >> >> Before implementing the timeout parameter, I think that it's
>> better
>> >> to change
>> >> >> both pg_basebackup background process and pg_receivexlog so that
>> they
>> >> >> send back the reply message immediately when they receive the
>> >> keepalive
>> >> >> message requesting the reply. Currently, they always ignore such
>> >> keepalive
>> >> >> message, so status interval parameter (-s) in them always must be
>> set
>> >> to
>> >> >> the value less than replication timeout. We can avoid this
>> >> troublesome
>> >> >> parameter setting by introducing the same logic of walreceiver
>> into
>> >> both
>> >> >> pg_basebackup background process and pg_receivexlog.
>> >> >
>> >> > Please find the patch attached to address the modification
>> mentioned
>> >> by you (send immediate reply for keepalive).
>> >> > Both basebackup and pg_receivexlog uses the same function
>> >> ReceiveXLogStream, so single change for both will address the issue.
>> >>
>> >> Thanks, committed this one after shuffling it around the changes I
>> >> committed yesterday. I also updated the docs to not claim that -s
>> option
>> >> is required to avoid timeout disconnects anymore.
>> >
>> > Thank you.
>> > However I think still the issue will not be completely solved.
>> > pg_basebackup/pg_receivexlog can still take long time to
>> > detect network break as they don't have timeout concept. To do that I
>> have
>> > sent one proposal which is mentioned at end of mail chain:
>> > http://archives.postgresql.org/message-
>> id/6C0B27F7206C9E4CA54AE035729E9C3828
>> > 53BBED@szxeml509-mbs
>> >
>> > Do you think there is any need to introduce such mechanism in
>> > pg_basebackup/pg_receivexlog?
>>
>> Are you planning to introduce the timeout mechanism in pg_basebackup
>> main process? Or background process? It's useful to implement both.
>
> By background process, you mean ReceiveXlogStream?
> For both.
>
> I think for background process, it can be done in a way similar to what we
> have done for walreceiver.

Yes.

> But I have some doubts for how to do for main process:
>
> Logic similar to walreceiver can not be used incase network goes down during
> getting other database file from server.
> The reason for the same is to receive the data files PQgetCopyData() is
> called in synchronous mode, so it keeps waiting for infinite time till it
> gets some data.
> In order to solve this issue, I can think of following options:
> 1. Making this call also asynchronous (but now sure about impact of this).

+1

Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
solve the issue in the similar way to walreceiver's.

> 2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
> wait), we can send some finite time. This time can be received as command
> line argument
> from respective utility and set the same in PGconn structure.
> In order to have timeout value in PGconn, we can have:
> a. Add new parameter in PGconn to indicate the receive timeout.
> b. Use the existing parameter connect_timeout for receive timeout
> also but this may lead to confusion.
> 3. Any other better option?
>
> Apart from above issue, there is possibility that if during connect time
> network goes down, then it might hang,  because connect_timeout by default
> will be NULL and connectDBComplete will start waiting inifinitely for
> connection to become successful.
> So shall we have command line argument separately for this also or any other
> way as you suugest.

Yes, I think that we should add something like --conninfo option to
pg_basebackup
and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
application_name, ... by using such option accepting conninfo string.

>> BTW, IIRC the walsender has no timeout mechanism during sending
>> backup data to pg_basebackup. So it's also useful to implement the
>> timeout mechanism for the walsender during backup.
>
> Yes, its useful, but for walsender the main problem is that it uses blocking
> send call to send the data.
> I have tried using tcp_keepalive settings, but the send call doesn't comeout
> incase of network break.
> The only way I could get it out is:
> change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
> the command
> echo "8" > /proc/sys/net/ipv4/tcp_retries2
> As per recommendation, its value should be at-least 8 (equivalent to 100
> sec)
>
> Do you have any idea, how it can be achieved?

What about using pq_putmessage_noblock()?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane

Heikki Linnakangas  writes:
> On 12.11.2012 01:25, Tom Lane wrote:
>> Worse than that, GIST WAL replay seems to be a total disaster from this
>> standpoint, and I'm not convinced that it's even fixable without
>> incompatible WAL changes.

> Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
> the GiST index is always in a consistent state. Before the downlink for 
> a newly split page has been inserted yet, its left sibling is flagged so 
> that a search knows to follow the right-link to find it. In normal 
> operation, the lock continuity is needed to prevent another backend from 
> seeing the incomplete split and trying to fix it, but in hot standby, we 
> never try to fix incomplete splits anyway.

> So I think we're good on >= 9.1.

Okay.  I'll update the patch to make sure that the individual WAL replay
functions hold all locks, but not worry about holding locks across
actions.

> The 9.0 code is broken, however. In 
> 9.0, when a child page is split, the parent and new children are kept 
> locked until the downlinks are inserted/updated. If a concurrent scan 
> comes along and sees that incomplete state, it will miss tuples on the 
> new right siblings. We rely on a rm_cleanup operation at the end of WAL 
> replay to fix that situation, if the downlink insertion record is not 
> there. I don't see any easy way to fix that, unfortunately. Perhaps we 
> could backpatch the 9.1 rewrite, now that it's gotten some real-world 
> testing, but it was a big change so I don't feel very comfortable doing 
> that.

Me either.  Given the lack of field complaints, I think we're better
advised to just leave it unfixed in 9.0.  It'd not be a step forward
if we broke something trying to make this work.

>> Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
>> anything that emits XLOG_GIST_PAGE_DELETE.

> Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
> removed.

Okay.  I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer
any way that 9.0 or later versions could see this WAL record type.
I'll delete that replay function rather than bothering to fix it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Tom Lane

Robert Haas  writes:
> On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane  wrote:
>> I already pointed out the inconsistency in heap_xlog_freeze about
>> whether a cleanup lock is needed.  As is, this patch uses a cleanup
>> lock, but I suspect that a regular lock is sufficient --- comments?

> Well, according to storage/buffer/README:

> 1. To scan a page for tuples, one must hold a pin and either shared or
> exclusive content lock.  To examine the commit status (XIDs and status bits)
> of a tuple in a shared buffer, one must likewise hold a pin and either shared
> or exclusive lock.

> That does indeed make it sound like an x-lock is enough.

Yeah.  AFAICS, the only possible downside is that somebody might be
using the tuple (while holding just a buffer pin), and that its xmin
might change while that's happening.  So for instance a nestloop join
might read out different xmin values for the same row while the join
proceeds.  But that could happen anyway if a different plan had been
chosen (viz, this table on the inside not the outside of the nestloop).
The xmin update ought to be physically atomic, so you shouldn't be able
to get a garbage result, just the old value or the new.

The cleanup lock is needed for cases where we'd like to remove or move a
tuple, but that's not required here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Robert Haas

On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane  wrote:
> I already pointed out the inconsistency in heap_xlog_freeze about
> whether a cleanup lock is needed.  As is, this patch uses a cleanup
> lock, but I suspect that a regular lock is sufficient --- comments?

Well, according to storage/buffer/README:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive content lock.  To examine the commit status (XIDs and status bits)
of a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

That does indeed make it sound like an x-lock is enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-12 Thread Tom Lane

Greg Smith  writes:
> Regardless, exactly how to prevent two concurrent processes from writing 
> the same file feels like the last step in the small roadmap for what 
> this feature needs.

"Write a temp file and use rename(2) to move it into place" is the
standard solution for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why can't plpgsql return a row-expression?

2012-11-12 Thread Asif Rehman

Hi,

I have tried to solve this issue. Please see the attached patch.

With this patch, any expression is allowed in the return statement. For any
invalid expression an error is generated without doing any special handling.
When a row expression is used in the return statement then the resultant
tuple will have rowtype in a single column that needed to be extracted.
Hence I have handled that case in exec_stmt_return().

any comments/suggestions?

Regards,
--Asif

On Mon, Oct 8, 2012 at 9:53 PM, Tom Lane  wrote:

> Pavel Stehule  writes:
> > 2012/10/8 Tom Lane :
> >> Laziness, probably.  Feel free to have at it.
>
> > I wrote patch some years ago. It was rejected from performance reasons
> > - because every row had to be casted to resulted type.
>
> I don't recall that patch in any detail, but it's not apparent to me
> that an extra cast step *must* be required to implement this.  In the
> cases that are supported now, surely we could notice that no additional
> work is required.
>
> It's also worth commenting that if we were to switch the storage of
> composite-type plpgsql variables to HeapTuple, as has been suggested
> here for other reasons, the performance tradeoffs in this area would
> likely change completely anyway.
>
> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


return_rowtype.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

On 11/12/2012 10:44 AM, Craig Ringer wrote:
> That'll make it hard for VACUUM, hint-bit setting, etc to
> opportunistically checksum pages whenever they're doing a page write anyway.

It *is* a hard problem, yes. And the single bit doesn't really solve it.
So I'm arguing against opportunistically checksumming in general. Who
needs that anyway?

> Is it absurd to suggest using another bitmap, like the FSM or visibility
> map, to store information on page checksumming while checksumming is
> enabled but incomplete?

Not absurd. But arguably inefficient, because that bitmap may well
become a bottleneck itself. Plus there's the problem of making sure
those pages are safe against corruptions, so you'd need to checksum the
checksum bitmap... doesn't sound like a nice solution to me.

This has certainly been discussed before.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Identity projection

2012-11-12 Thread Kyotaro HORIGUCHI

Hello, This is new version of identity projection patch.

Reverted projectionInfo and ExecBuildProjectionInfo. Identity
projection is recognized directly in ExecGroup, ExecResult, and
ExecWindowAgg. nodeAgg is reverted because I couldn't make it
sane..

The following is the result of performance test posted before in
order to show the source of the gain.

regards,

-- 
-- 
Kyotaro Horiguchi
NTT Open Source Software Center



At Fri, 05 Oct 2012 16:04:16 +0900, Kyotaro HORIGUCHI wrote in 
<20121005.160416.256387378.horiguchi.kyot...@lab.ntt.co.jp>
> > Although I said as following, the gain seems a bit larger... I'll
> > recheck the testing conditions...
> 
> I had inspected more precisely on two aspects maginifying the
> effect of this patch by putting 300 columns into table.
> 
> 
> First, explain analyze says the difference caused by this patch
> is only in the actual time of Result node.
> 
> orig$ psql -c 'explain analyze select * from parenta'
>   QUERY PLAN
> --
>  Result  (cost=0.00..176667.00 rows=101 width=1202)
>  (actual time=0.013.. *2406.792* rows=100 loops=1)
>->  Append  (cost=0.00..176667.00 rows=101 width=1202)
>(actual time=0.011..412.749 rows=100 loops=1)
>  ->  Seq Scan on parenta  (cost=0.00..0.00 rows=1 width=1228)
>   (actual time=0.001..0.001 rows=0 loops=1)
>  ->  Seq Scan on childa000 parenta
> (cost=0.00..176667.00 rows=100 width=1202)
> (actual time=0.009..334.633 rows=100 loops=1)
>  Total runtime: 2446.474 ms
> (5 rows)
> 
> patched$ psql -c 'explain analyze select * from parenta'
>   QUERY PLAN
> --
>  Result  (cost=0.00..176667.00 rows=101 width=1202)
>  (actual time=0.011.. *507.239* rows=100 loops=1)
>->  Append  (cost=0.00..176667.00 rows=101 width=1202)
>(actual time=0.011..419.420 rows=100 loops=1)
>  ->  Seq Scan on parenta  (cost=0.00..0.00 rows=1 width=1228)
>   (actual time=0.000..0.000 rows=0 loops=1)
>  ->  Seq Scan on childa000 parenta
> (cost=0.00..176667.00 rows=100 width=1202)
> (actual time=0.010..335.721 rows=100 loops=1)
>  Total runtime: 545.879 ms
> (5 rows)
> 
> 
>  Second, the results of configure --enable-profiling shows that
> the exectime of ExecProject chages greately. This is consistent
> with what explain showed.
> 
> orig:
>  >  %   cumulative   self  self total   
>  > time   seconds   secondscalls   s/call   s/call  name
>  > 60.29  1.26 1.26  105 0.00 0.00  slot_deform_tuple
> !> 30.14  1.89 0.63  100 0.00 0.00  ExecProject
>  >  3.35  1.96 0.07  304 0.00 0.00  ExecProcNode
>  >  0.96  1.98 0.02  102 0.00 0.00  ExecScan
>  >  0.96  2.00 0.02   166379 0.00 0.00  TerminateBufferIO
>  >  0.48  2.01 0.01  304 0.00 0.00  InstrStartNode
>  >  0.48  2.02 0.01  304 0.00 0.00  InstrStopNode
> !>  0.48  2.03 0.01  101 0.00 0.00  ExecResult
>  >  0.48  2.04 0.01   830718 0.00 0.00  LWLockAcquire
>  >  0.48  2.05 0.01   506834 0.00 0.00  
> hash_search_with_hash_value
>  >  0.48  2.06 0.01   341656 0.00 0.00  LockBuffer
>  >  0.48  2.07 0.01   168383 0.00 0.00  ReadBuffer_common
>  >  0.48  2.08 0.014 0.00 0.00  InstrEndLoop
>  >  0.48  2.09 0.01 
> ExecCleanTargetListLength
>  >  0.00  2.09 0.00  205 0.00 0.00  MemoryContextReset
> 
> patched:
>  >   %   cumulative   self  self total   
>  >  time   seconds   secondscalls  ms/call  ms/call  name
>  >  23.08  0.03 0.03  304 0.00 0.00  ExecProcNode
>  >  15.38  0.05 0.02  102 0.00 0.00  heapgettup_pagemode
>  >  15.38  0.07 0.02   830718 0.00 0.00  LWLockAcquire
>  >   7.69  0.08 0.01  205 0.00 0.00  MemoryContextReset
>  >   7.69  0.09 0.01  102 0.00 0.00  ExecScan
>  >   7.69  0.10 0.01  100 0.00 0.00  ExecStoreTuple
>  >   7.69  0.11 0.01   841135 0.00 0.00  LWLockRelease
>  >   7.69  0.12 0.01   168383 0.00 0.00  ReadBuffer_common
>  >   7.69  0.13 0.01   168383 0.00 0.00  UnpinBuffer
>  >   0.00  0.13 0.00  304 0.00 0.00  InstrStartNode
>  ...
> !>   0.00  0.13 0.00  101 0.00 0.00  ExecResult
> !>   0.00  0.13 0.00  100 0.00 0.00

Re: [HACKERS] Reduce palloc's in numeric operations.

2012-11-12 Thread Kyotaro HORIGUCHI

Thanks for comments,

> Have to be careful to really not modify the
> operands. numeric_out() and numeric_out_sci() are wrong; they
> call get_str_from_var(), which modifies the argument. Same with
> numeric_expr(): it passes the argument to
> numericvar_to_double_no_overflow(), which passes it to
> get_str_from_var(). numericvar_to_int8() also modifies its
> argument, so all the functions that use that, directly or
> indirectly, must make a copy.

mmm. My carefulness was a bit short to pick up them...

I overlooked that get_str_from_var() and numeric_to_int8() calls
round_var() which rewrites the operand. I reverted numeric_out()
and numeric_int8(), numeric_int2().

Altough, I couldn't find in get_str_from_var_sci() where the
operand would be modified.

The lines where var showing in get_str_from_var_sci() is listed
below.

|  2:get_str_from_var_sci(NumericVar *var, int rscale)
| 21:  if (var->ndigits > 0)
| 23:exponent = (var->weight + 1) * DEC_DIGITS;
| 29:exponent -= DEC_DIGITS - (int) log10(var->digits[0]);
| 59:  div_var(var, &denominator, &significand, rscale, true);

The only suspect is div_var at this level, and do the same thing
for var1 in div_var.

|   2:div_var(NumericVar *var1, NumericVar *var2, NumericVar *result,
|  20:  int  var1ndigits = var1->ndigits;
|  35:  if (var1ndigits == 0)
|  47:  if (var1->sign == var2->sign)
|  51:  res_weight = var1->weight - var2->weight;
|  68:  div_ndigits = Max(div_ndigits, var1ndigits);
|  83:  memcpy(dividend + 1, var1->digits, var1ndigits * sizeof(NumericDigit));
| 132:  for (i = var1ndigits; i >= 0; i--)

No line seems to modify var1 as far as I see so I've left
numeric_out_sci() modified in this patch.


Well, I found some other bugs in numeric_stddev_internal.
vN was errorniously freed and vsumX2 in is used as work.
They are fixed in this new patch.

> Perhaps get_str_from_var(), and the other functions that
> currently scribble on the arguments, should be modified to not
> do so. They could easily make a copy of the argument within the
> function. Then the callers could safely use
> set_var_from_num_nocopy(). The performance would be the same,
> you would have the same number of pallocs, but you would get
> rid of the surprising argument-modifying behavior of those
> functions.

I agree with that. const qualifiers on parameters would rule this
mechanically. I try this for the next version of this patch.


> SELECT SUM(col) FROM numtest;
> 
> The execution time of that query fell from about 5300 ms to 4300 ms, ie. 
> about 20%.

Wow, it seems more promising than I expected. Thanks.

regards,

-- 
堀口恭太郎

日本電信電話株式会社 NTTオープンソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 68c1f1d..fcff325 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -367,6 +367,7 @@ static void zero_var(NumericVar *var);
 static const char *set_var_from_str(const char *str, const char *cp,
  NumericVar *dest);
 static void set_var_from_num(Numeric value, NumericVar *dest);
+static void set_var_from_num_nocopy(Numeric num, NumericVar *dest);
 static void set_var_from_var(NumericVar *value, NumericVar *dest);
 static char *get_str_from_var(NumericVar *var, int dscale);
 static char *get_str_from_var_sci(NumericVar *var, int rscale);
@@ -617,11 +618,10 @@ numeric_out_sci(Numeric num, int scale)
 		return pstrdup("NaN");
 
 	init_var(&x);
-	set_var_from_num(num, &x);
+	set_var_from_num_nocopy(num, &x);
 
 	str = get_str_from_var_sci(&x, scale);
 
-	free_var(&x);
 	return str;
 }
 
@@ -696,7 +696,7 @@ numeric_send(PG_FUNCTION_ARGS)
 	int			i;
 
 	init_var(&x);
-	set_var_from_num(num, &x);
+	set_var_from_num_nocopy(num, &x);
 
 	pq_begintypsend(&buf);
 
@@ -707,8 +707,6 @@ numeric_send(PG_FUNCTION_ARGS)
 	for (i = 0; i < x.ndigits; i++)
 		pq_sendint(&buf, x.digits[i], sizeof(NumericDigit));
 
-	free_var(&x);
-
 	PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
 }
 
@@ -1577,15 +1575,13 @@ numeric_add(PG_FUNCTION_ARGS)
 	init_var(&arg2);
 	init_var(&result);
 
-	set_var_from_num(num1, &arg1);
-	set_var_from_num(num2, &arg2);
+	set_var_from_num_nocopy(num1, &arg1);
+	set_var_from_num_nocopy(num2, &arg2);
 
 	add_var(&arg1, &arg2, &result);
 
 	res = make_result(&result);
 
-	free_var(&arg1);
-	free_var(&arg2);
 	free_var(&result);
 
 	PG_RETURN_NUMERIC(res);
@@ -1620,15 +1616,13 @@ numeric_sub(PG_FUNCTION_ARGS)
 	init_var(&arg2);
 	init_var(&result);
 
-	set_var_from_num(num1, &arg1);
-	set_var_from_num(num2, &arg2);
+	set_var_from_num_nocopy(num1, &arg1);
+	set_var_from_num_nocopy(num2, &arg2);
 
 	sub_var(&arg1, &arg2, &result);
 
 	res = make_result(&result);
 
-	free_var(&arg1);
-	free_var(&arg2);
 	free_var(&result);
 
 	PG_RETURN_NUMERIC(res);
@@ -1667,15 +1661,13 @@ numeric_mul(PG_FUNCTION_ARGS)
 	init_var(&arg2);
 	init_var(&result);
 
-	set_var_from_num(num1, &arg1);
-	set_var_from_num(num2, &arg2);
+	set_var_from_num_nocopy(num1, &arg1);
+	set_var_

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Craig Ringer

On 11/12/2012 04:44 PM, Markus Wanner wrote:
> Jeff,
>
> On 11/10/2012 12:08 AM, Jeff Davis wrote:
>> The bit indicating that a checksum is present may be lost due to
>> corruption.
> Hm.. I see.
>
> Sorry if that has been discussed before, but can't we do without that
> bit at all? It adds a checksum switch to each page, where we just agreed
> we don't event want a per-database switch.
>
> Can we simply write a progress indicator to pg_control or someplace
> saying that all pages up to X of relation Y are supposed to have valid
> checksums?
That'll make it hard for VACUUM, hint-bit setting, etc to
opportunistically checksum pages whenever they're doing a page write anyway.

Is it absurd to suggest using another bitmap, like the FSM or visibility
map, to store information on page checksumming while checksumming is
enabled but incomplete? As a much smaller file the bitmap could its self
be very quickly generated in one pass when checksumming is enabled, with
its starting state showing no pages having checksums.

It perhaps its self have page checksums since presumably the persistent
maps like the FSM and visibility map will support them? Some way to
ensure the checksum map is valid would be needed.

--
Craig Ringer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

On 11/12/2012 05:55 AM, Greg Smith wrote:
> Adding an initdb option to start out with everything checksummed seems
> an uncontroversial good first thing to have available.

+1

So the following discussion really is for a future patch extending on
that initial checkpoint support.

> One of the really common cases I was expecting here is that conversions
> are done by kicking off a slow background VACUUM CHECKSUM job that might
> run in pieces.  I was thinking of an approach like this:
> 
> -Initialize a last_checked_block value for each table
> -Loop:
> --Grab the next block after the last checked one
> --When on the last block of the relation, grab an exclusive lock to
> protect against race conditions with extension
> --If it's marked as checksummed and the checksum matches, skip it
> ---Otherwise, add a checksum and write it out
> --When that succeeds, update last_checked_block
> --If that was the last block, save some state saying the whole table is
> checkedsummed

Perfect, thanks. That's the rough idea I had in mind as well, written
out in detail and catching the extension case.

> With that logic, there is at least a forward moving pointer that removes
> the uncertainty around whether pages have been updated or not.  It will
> keep going usefully if interrupted too.  One obvious this way this can
> fail is if:
> 
> 1) A late page in the relation is updated and a checksummed page written
> 2) The page is corrupted such that the "is this checksummed?" bits are
> not consistent anymore, along with other damage to it
> 3) The conversion process gets to this page eventually
> 4) The corruption of (2) isn't detected

IMO this just outlines how limited the use of the "is this checksummed"
bit in the page itself is. It just doesn't catch all cases. Is it worth
having that bit at all, given your block-wise approach above?

It really only serves to catch corruptions to *newly* dirtied pages
*during* the migration phase that *keep* that single bit set. Everything
else is covered by the last_checked_block variable. Sounds narrow enough
to be negligible. Then again, it's just a single bit per page...

> The only guarantee I see that we can give for online upgrades is that
> after a VACUUM CHECKSUM sweep is done, and every page is known to both
> have a valid checksum on it and have its checksum bits set, *then* any
> page that doesn't have both set bits and a matching checksum is garbage.

>From that point in time on, we'd theoretically better use that bit as an
additional checksum bit rather than requiring it to be set all times.
Really just theoretically, I'm certainly not advocating a 33 bit
checksum :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

Jeff,

On 11/10/2012 12:08 AM, Jeff Davis wrote:
> The bit indicating that a checksum is present may be lost due to
> corruption.

Hm.. I see.

Sorry if that has been discussed before, but can't we do without that
bit at all? It adds a checksum switch to each page, where we just agreed
we don't event want a per-database switch.

Can we simply write a progress indicator to pg_control or someplace
saying that all pages up to X of relation Y are supposed to have valid
checksums?

That would mean having to re-calculate the checksums on pages that got
dirtied before VACUUM came along to migrate them to having a checksum,
but that seems acceptable. VACUUM could even detect that case and
wouldn't have to re-write it with the same contents.

I realize this doesn't support Jesper's use case of wanting to have the
checksums only for newly dirtied pages. However, I'd argue that
prolonging the migration to spread the load would allow even big shops
to go through this without much of an impact on performance.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-12 Thread Heikki Linnakangas


On 12.11.2012 01:25, Tom Lane wrote:

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.


Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0), 
the GiST index is always in a consistent state. Before the downlink for 
a newly split page has been inserted yet, its left sibling is flagged so 
that a search knows to follow the right-link to find it. In normal 
operation, the lock continuity is needed to prevent another backend from 
seeing the incomplete split and trying to fix it, but in hot standby, we 
never try to fix incomplete splits anyway.


So I think we're good on >= 9.1. The 9.0 code is broken, however. In 
9.0, when a child page is split, the parent and new children are kept 
locked until the downlinks are inserted/updated. If a concurrent scan 
comes along and sees that incomplete state, it will miss tuples on the 
new right siblings. We rely on a rm_cleanup operation at the end of WAL 
replay to fix that situation, if the downlink insertion record is not 
there. I don't see any easy way to fix that, unfortunately. Perhaps we 
could backpatch the 9.1 rewrite, now that it's gotten some real-world 
testing, but it was a big change so I don't feel very comfortable doing 
that.



Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.


Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was 
removed.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

75 matches

Mail list logo