[HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Simon Riggs
On 30 July 2015 at 12:26, Greg Stark st...@mit.edu wrote:


 On Thu, Jul 30, 2015 at 12:09 PM, Heikki Linnakangas hlinn...@iki.fi
 wrote:


 True, you need a heap to hold the next tuple from each tape in the merge
 step. To avoid exceeding work_mem, you'd need to push some tuples from the
 in-memory array to the tape to make room for that. In practice, though, the
 memory needed for the merge step's heap is tiny. Even if you merge 1000
 tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll
 need some logic to share/divide the in-memory array between the heap and
 the in-memory tail of the last tape.


 It's a bit worse than that because we buffer up a significant chunk of the
 tape to avoid randomly seeking between tapes after every tuple read. But I
 think in today's era of large memory we don't need anywhere near the entire
 work_mem just to buffer to avoid random access. Something simple like a
 fixed buffer size per tape probably much less than 1MB/tape.


MERGE_BUFFER_SIZE is currently 0.25 MB, but there was benefit seen above
that. I'd say we should scale that up to 1 MB if memory allows.

Yes, that could be tiny for small number of runs. I mention it because
Heikki's comment that we could use this in the general case would not be
true for larger numbers of runs. Number of runs decreases quickly with more
memory anyway, so the technique is mostly good for larger memory but
certainly not for small memory allocations.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-07-30 Thread Alexander Korotkov
Hi!

On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova 
lubennikov...@gmail.com wrote:

 I have written microvacuum support for gist access method.
 Briefly microvacuum includes two steps:
 1. When search tells us that the tuple is invisible to all transactions it
 is marked LP_DEAD and page is marked as has dead tuples,
 2. Then, when insert touches full page which has dead tuples it calls
 microvacuum instead of splitting page.
 You can find a kind of review here [1].

 [1]
 http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120

 Patch is in attachements. Please review it.


Nice!

Some notes about this patch.

1) Could you give same test case demonstrating that microvacuum really work
with patch? Finally, we should get index less growing with microvacuum.

2) Generating notices for every dead tuple would be too noisy. I suggest to
replace notice with one of debug levels.

+  elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d,
offnum, BufferGetBlockNumber(buffer));


3) Please, recheck coding style. For instance, this line needs more spaces
and open brace should be on the next line.

+ if ((scan-kill_prior_tuple)(so-curPageData  0)(so-curPageData ==
so-nPageData)) {

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Tomas Vondra

Hi,

On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:

On 05/25/2015 11:43 PM, Tomas Vondra wrote:

There are 6 files attached, but only 0002-0006 are actually part of the
multivariate statistics patch itself.


All of these patches are huge. In order to review this in a reasonable
amount of time, we need to do this in several steps. So let's see what
would be the minimal set of these patches that could be reviewed and
committed, while still being useful.

The main patches are:

1. shared infrastructure and functional dependencies
2. clause reduction using functional dependencies
3. multivariate MCV lists
4. multivariate histograms
5. multi-statistics estimation

Would it make sense to commit only patches 1 and 2 first? Would that be
enough to get a benefit from this?


I agree that the patch can't be reviewed as a single chunk - that was 
the idea when I split the original (single chunk) patch into multiple 
smaller pieces.


And yes, I believe committing pieces 12 might be enough to get 
something useful, which can then be improved by adding the usual MCV 
and histogram stats on top of that.



I have some doubts about the clause reduction and functional
dependencies part of this. It seems to treat functional dependency as
a boolean property, but even with the classic zipcode and city case,
it's not always an all or nothing thing. At least in some countries,
there can be zipcodes that span multiple cities. So zipcode=X does
not completely imply city=Y, although there is a strong correlation
(if that's the right term). How strong does the correlation need to
be for this patch to decide that zipcode implies city? I couldn't
actually see a clear threshold stated anywhere.

So rather than treating functional dependence as a boolean, I think
it would make more sense to put a 0.0-1.0 number to it. That means
that you can't do clause reduction like it's done in this patch,
where you actually remove clauses from the query for cost esimation
purposes. Instead, you need to calculate the selectivity for each
clause independently, but instead of just multiplying the
selectivities together, apply the dependence factor to it.

Does that make sense? I haven't really looked at the MCV, histogram
and multi-statistics estimation patches yet. Do those patches make
the clause reduction patch obsolete? Should we forget about the
clause reduction and functional dependency patch, and focus on those
later patches instead?


Perhaps. It's true that most real-world data sets are not 100% valid 
with respect to functional dependencies - either because of natural 
imperfections (multiple cities with the same ZIP code) or just noise in 
the data (incorrect entries ...). And it's even mentioned in the code 
comments somewhere, I guess.


But there are two main reasons why I chose not to extend the functional 
dependencies with the [0.0-1.0] value you propose.


Firstly, functional dependencies were meant to be the simplest possible 
implementation, illustrating how the infrastructure is supposed to 
work (which is the main topic of the first patch).


Secondly, all kinds of statistics are simplifications of the actual 
data. So I think it's not incorrect to ignore the exceptions up to some 
threshold.


I also don't think this will make the estimates globally better. Let's 
say you have 1% of rows that contradict the functional dependency - you 
may either ignore them and have good estimates for 99% of the values and 
incorrect estimates for 1%, or tweak the rule a bit and make the 
estimates worse for 99% (and possibly better for 1%).


That being said, I'm not against improving the functional dependencies. 
I already do have some improvements on my TODO - like for example 
dependencies on more columns (not just A=B but [A,B]=C and such), but 
I think we should not squash this into those two patches.


And yet another point - ISTM these cases might easily be handled better 
by the statistics based on ndistinct coefficients, as proposed by 
Kyotaro-san some time ago. That is, compute and track


ndistinct(A) * ndistinct(B) / ndistinct(A,B)

for all pairs of columns (or possibly larger groups). That seems to be 
similar to the coefficient you propose.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5a1 BUG FIX: pgbench negative latencies

2015-07-30 Thread Heikki Linnakangas

On 07/27/2015 03:43 PM, Fabien COELHO wrote:

Under 9.5a1 pgbench -r negative latencies are reported on meta commands,
probably as an oversight of 84f0ea3f.

This patch ensures that now is reset on each loop inside doCustom.


Applied, thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Row-Level Security Policies (RLS)

2015-07-30 Thread Dean Rasheed
On 30 July 2015 at 01:35, Joe Conway m...@joeconway.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 06/01/2015 02:21 AM, Dean Rasheed wrote:
 While going through this, I spotted another issue --- in a DML
 query with additional non-target relations, such as UPDATE t1 ..
 FROM t2 .., the old code was checking the UPDATE policies of both
 t1 and t2, but really I think it ought to be checking the SELECT
 policies of t2 (in the same way as this query requires SELECT table
 permissions on t2, not UPDATE permissions). I've changed that and
 added new regression tests to test that change.

 I assume the entire refactoring patch needs a fair bit of work to
 rebase against current HEAD,

Actually, there haven't been any conflicting changes so far, so a git
rebase was able to automatically merge correctly -- new patch
attached, with some minor comment rewording (not affecting the bug-fix
part).

Even so, I agree that it makes sense to apply the bug-fix separately,
since it's not really anything to do with the refactoring.


 but I picked out the attached to address
 just the above issue. Does this look correct, and if so does it make
 sense to apply at least this part right now?


Looks correct to me.

Thanks.

Regards,
Dean
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
new file mode 100644
index bcf4a8f..cb689ec
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -186,9 +186,6 @@ policy_role_list_to_array(List *roles, i
 /*
  * Load row security policy from the catalog, and store it in
  * the relation's relcache entry.
- *
- * We will always set up some kind of policy here.  If no explicit policies
- * are found then an implicit default-deny policy is created.
  */
 void
 RelationBuildRowSecurity(Relation relation)
@@ -246,7 +243,6 @@ RelationBuildRowSecurity(Relation relati
 			char	   *with_check_value;
 			Expr	   *with_check_qual;
 			char	   *policy_name_value;
-			Oid			policy_id;
 			bool		isnull;
 			RowSecurityPolicy *policy;
 
@@ -298,14 +294,11 @@ RelationBuildRowSecurity(Relation relati
 			else
 with_check_qual = NULL;
 
-			policy_id = HeapTupleGetOid(tuple);
-
 			/* Now copy everything into the cache context */
 			MemoryContextSwitchTo(rscxt);
 
 			policy = palloc0(sizeof(RowSecurityPolicy));
 			policy-policy_name = pstrdup(policy_name_value);
-			policy-policy_id = policy_id;
 			policy-polcmd = cmd_value;
 			policy-roles = DatumGetArrayTypePCopy(roles_datum);
 			policy-qual = copyObject(qual_expr);
@@ -326,40 +319,6 @@ RelationBuildRowSecurity(Relation relati
 
 		systable_endscan(sscan);
 		heap_close(catalog, AccessShareLock);
-
-		/*
-		 * Check if no policies were added
-		 *
-		 * If no policies exist in pg_policy for this relation, then we need
-		 * to create a single default-deny policy.  We use InvalidOid for the
-		 * Oid to indicate that this is the default-deny policy (we may decide
-		 * to ignore the default policy if an extension adds policies).
-		 */
-		if (rsdesc-policies == NIL)
-		{
-			RowSecurityPolicy *policy;
-			Datum		role;
-
-			MemoryContextSwitchTo(rscxt);
-
-			role = ObjectIdGetDatum(ACL_ID_PUBLIC);
-
-			policy = palloc0(sizeof(RowSecurityPolicy));
-			policy-policy_name = pstrdup(default-deny policy);
-			policy-policy_id = InvalidOid;
-			policy-polcmd = '*';
-			policy-roles = construct_array(role, 1, OIDOID, sizeof(Oid), true,
-			'i');
-			policy-qual = (Expr *) makeConst(BOOLOID, -1, InvalidOid,
-		   sizeof(bool), BoolGetDatum(false),
-			  false, true);
-			policy-with_check_qual = copyObject(policy-qual);
-			policy-hassublinks = false;
-
-			rsdesc-policies = lcons(policy, rsdesc-policies);
-
-			MemoryContextSwitchTo(oldcxt);
-		}
 	}
 	PG_CATCH();
 	{
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
new file mode 100644
index 2c65a90..c28eb2b
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1815,14 +1815,26 @@ ExecWithCheckOptions(WCOKind kind, Resul
 	break;
 case WCO_RLS_INSERT_CHECK:
 case WCO_RLS_UPDATE_CHECK:
-	ereport(ERROR,
-			(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+	if (wco-polname != NULL)
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+			 errmsg(new row violates row level security policy \%s\ for \%s\,
+	wco-polname, wco-relname)));
+	else
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 			 errmsg(new row violates row level security policy for \%s\,
 	wco-relname)));
 	break;
 case WCO_RLS_CONFLICT_CHECK:
-	ereport(ERROR,
-			(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+	if (wco-polname != NULL)
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+			 errmsg(new row violates row level security policy \%s\ (USING expression) for \%s\,
+	wco-polname, wco-relname)));
+	else
+		ereport(ERROR,
+

[HACKERS] [PATCH] Microvacuum for gist.

2015-07-30 Thread Anastasia Lubennikova
Hi,

I have written microvacuum support for gist access method.
Briefly microvacuum includes two steps:
1. When search tells us that the tuple is invisible to all transactions it
is marked LP_DEAD and page is marked as has dead tuples,
2. Then, when insert touches full page which has dead tuples it calls
microvacuum instead of splitting page.
You can find a kind of review here [1].

[1]
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120

Patch is in attachements. Please review it.

-- 
Best regards,
Lubennikova Anastasia


microvacuum_for_gist.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 64-bit XIDs again

2015-07-30 Thread Alexander Korotkov
Hackers,

I know there were already couple of threads about 64bit XIDs.
http://www.postgresql.org/message-id/42cccfe9.9040...@intellilink.co.jp
http://www.postgresql.org/message-id/4f6c0e13.3080...@wiesinger.com
I read them carefully, but I didn't find all the arguments for 64bit XIDs
mentioned. That's why I'd like to raise this subject again.

Now hardware capabilities are much higher than when Postgres was designed.
In the modern PostgreSQL scalability tests it's typical to achieve 400 000
- 500 000 tps with pgbench. With such tps it takes only few minutes to
achieve default autovacuum_freeze_max_age = 200 millions.

Notion of wraparound is evolutioning during the time. Initially it was
something that almost never happens. Then it becomes something that could
happen rarely, and we should be ready to it (freeze tuples in advance).
Now, it becomes quite frequent periodic event for high load database. DB
admins should take into account its performance impact.

Typical scenario that I've faced in real life was so. Database is divided
into operative and archive parts. Operative part is small (dozens of
gigabytes) and it serves most of transactions. Archive part is relatively
large (some terabytes) and it serves rare selects and bulk inserts.
Autovacuum work very active for operative part and very lazy for archive
part (as it's expected). System works well until one day age of archive
tables exceeds autovacuum_freeze_max_age. Then all autovacuum workers
starts to do autovacuum to prevent wraparound on archive tables. If even
system IO survive this, operative tables get bloated because all autovacuum
workers are busy with archive tables. In such situation I typically advise
to increase autovacuum_freeze_max_age and run vacuum freeze manually when
system have enough of free resources.

As I mentioned in CSN thread, it would be nice to replace XID with CSN when
setting hint bits for tuple. In this case when hint bits are set we don't
need any additional lookups to check visibility.
http://www.postgresql.org/message-id/CAPpHfdv7BMwGv=ofug3s-jgvfkqhi79pr_zk1wsk-13oz+c...@mail.gmail.com
Introducing 32-bit CSN doesn't seem reasonable for me, because it would
double our troubles with wraparound.

Also, I think it's possible to migrate to 64-bit XIDs without breaking
pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
distinguish old and new formats.

Any thoughts? Do you think 64-bit XIDs worth it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] [BUG] CustomScan-custom_plans are not copied

2015-07-30 Thread Kouhei Kaigai
Hello,

The attached patch fixes up my careless misses when custom_plans field was 
added.
Please apply this fix.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com




pgsql-v9.5-fix-copy-custom-scan.patch
Description: pgsql-v9.5-fix-copy-custom-scan.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] creating extension including dependencies

2015-07-30 Thread Petr Jelinek

On 2015-07-27 15:18, Michael Paquier wrote:

On Sun, Jul 26, 2015 at 1:01 AM, Petr Jelinek wrote:

Yes that's what I meant by the change of checking order in the explanation
above. I did that because I thought code would be more complicated
otherwise, but apparently I was stupid...


+In case the extension specifies schema in its control file, the schema
s/schema/literalschema//

+++ b/src/test/modules/test_extensions/.gitignore
@@ -0,0 +1,3 @@
+# Generated subdirectories
+/results/
+/tmp_check/
test_extensions/.gitignore is missing /log/.

Something also has not been discussed yet: what to do with new_version
and old_version (the options of CreateExtensionStmt)? As of now if
those options are defined they are not passed down to the parent
extensions but shouldn't we raise an error if they are used in
combination with CASCADE? In any case, I think that the behavior
chosen should be documented.



I don't see why we should raise error, they are used just for the 
top-level extension and I think it makes sense that way. CASCADE 
documentation says SCHEMA option is cascaded to required extensions, do 
we need to say something more than that (ie explicitly saying that the 
old_version and new_version aren't)?


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Heikki Linnakangas

On 07/30/2015 05:57 PM, Alexander Korotkov wrote:

On Thu, Jul 30, 2015 at 5:24 PM, Heikki Linnakangas hlinn...@iki.fi wrote:


I think we should move to 64-bit XIDs in in-memory structs snapshots, proc
array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax
fields on heap pages at 32-bits, and add an epoch-like field to the page
header so that logically the xmin/xmax fields on the page are 64 bits wide,
but physically stored in 32 bits. That's possible as long as no two XIDs on
the same page are more than 2^31 XIDs apart. So you still need to freeze
old tuples on the page when that's about to happen, but it would make it
possible to have more than 2^32 XID transactions in the clog. You'd never
be forced to do anti-wraparound vacuums, you could just let the clog grow
arbitrarily large.


Nice idea. Storing extra epoch would be extra 4 bytes per heap tuple
instead of extra 8 bytes per tuple if storing 64 bits xmin/xmax.


No, I was thinking that the epoch would be stored *per page*, in the 
page header.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-07-30 Thread Heikki Linnakangas

On 07/29/2015 09:35 PM, Andres Freund wrote:

On 2015-07-29 20:23:24 +0300, Heikki Linnakangas wrote:

Backend A has called LWLockWaitForVar(X) on a lock, and is now waiting on
it. The lock holder releases the lock, and wakes up A. But before A wakes up
and sees that the lock is free, another backend acquires the lock again. It
runs LWLockAcquireWithVar to the point just before setting the variable's
value. Now A wakes up, sees that the lock is still (or again) held, and that
the variable's value still matches the old one, and goes back to sleep. The
new lock holder won't wake it up until it updates the value again, or to
releases the lock.


I'm not sure whether this observation is about my patch or the general
lwlock variable mechanism. In my opinion that behaviour exists today
both in 9.4 and 9.5.


In 9.4, LWLockAcquire holds the spinlock when it marks the lock as held, 
until it has updated the variable. And LWLockWaitForVar() holds the 
spinlock when it checks that the lock is held and that the variable's 
value matches. So it cannot happen on 9.4.


To reiterate, with 9.5, it's possible that a backend is sleeping in 
LWLockWaitForVar(oldvar=123), even though the lock is currently held by 
another backend with value 124. That seems wrong, or surprising at the 
very least.



But I think that's fine because that race seems pretty
fundamental. After all, you could have called LWLockWaitForVar() just
after the second locker had set the variable to the same value.


I'm not talking about setting it to the same value. I'm talking about 
setting it to a different value. (I talked about setting it to the same 
value later in the email, and I agree that's a pretty fundamental 
problem and exists with 9.4 as well).


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Joshua D. Drake


On 07/30/2015 08:04 AM, Simon Riggs wrote:


There is a big downside to expanding xmin/xmax to 64 bits: it takes
space. More space means more memory needed for caching, more memory
bandwidth, more I/O, etc.


My feeling is that the overhead will recede in time. Having a nice,
simple change to remove old bugs and new would help us be more robust.

But let's measure the overhead before we try to optimize it away.


In field experience would agree with you. The amount of memory people 
are arbitrarily throwing at databases now is pretty significant. It is 
common to have 64GB of memory. Heck, I run into 128GB all the time and 
seeing 192GB is no longer a, Wow.


JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing I'm offended is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Joe Conway
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/30/2015 07:14 AM, Simon Riggs wrote:
 On 30 July 2015 at 14:26, Alexander Korotkov
 a.korot...@postgrespro.ru
 
 Any thoughts? Do you think 64-bit XIDs worth it?
 
 The problem of freezing is painful, but not impossible, which is
 why we have held out so long.
 
 The problem of very long lived snapshots is coming closer at the
 same speed as freezing; there is no solution to that without 64-bit
 xids throughout whole infrastructure, or CSNs.
 
 The opportunity for us to have SQL Standard historical databases
 becomes possible with 64-bit xids, or CSNs. That is a high value
 goal.
 
 I personally now think we should thoroughly investigate 64-bit
 xids. I don't see this as mere debate, I see this as something that
 we can make a patch for and scientifically analyze the pros and
 cons through measurement.

+1

I've been thinking along similar lines to both of you for quite some
time now. I think at the least we should explore an initdb time option
- -- we can and should measure the pros and cons.


- -- 
Joe Conway
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJVujVzAAoJEDfy90M199hlphUP/j/+TEJO05h+aLD1TrddZ01f
Fq2ijyvQjfe3aBN/4DEKuVBPMsQ6ZWLWtYJ3/FpktlFUIoDdDWJY//8rb63KqUut
tiSMI4MNzIp/ImvyR1pMJpAmfF9zHsJOiC8Hjpj9J8ity1BXm5My0XYzf9cux/KN
Qr8e5RTiPNKZyCB7w5Ci9byYIQKwHS9UyoHhgXQhZTopYLqrN9G7KxjKHZjTYxAs
6uJowQqsoevlgi15L8Ojk+KuJuowEHhVthhZ0f147twrOu2PwvhPP0/tf3TCSzKW
I3TGC8ChQ67+h/x4lF2LMENvDwGZFh0fB4foeu0F3oR5YX4jG6pic/k7BJFPke3f
YPk8PnA4fn5PM2otgikExIM6NFm+1y4JEeVOcGaA0GungdbgcBuN4p8gLvO9zJRa
qsJp6U+FHK7m68jBVAlo0aVERikh29devypOSWhz474nvYsZIm9bfQ4te+DQECzw
m3a9KJWJUy7Bj8xkwLpMMXmm83bIbvNMj8oDlg9tMo//CEzSsXyNjGUPG0/U9jIs
YHZUYd24i8Wg4+BjdQ19ULJH22ROZa2JBq658t6n97vab7HS3ZWGPhao0piYW20i
/q8wmd52KE0e4gg4Jixc1p8kPvIItFeJliEPgbRC1+7vnZu0rkENxXpTuS/g1fn0
Ql/P9C7Nb97cux9EvlZv
=gX7d
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Alexander Korotkov
On Thu, Jul 30, 2015 at 5:24 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/30/2015 04:26 PM, Alexander Korotkov wrote:

 Also, I think it's possible to migrate to 64-bit XIDs without breaking
 pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
 would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
 distinguish old and new formats.


 I think we should move to 64-bit XIDs in in-memory structs snapshots, proc
 array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax
 fields on heap pages at 32-bits, and add an epoch-like field to the page
 header so that logically the xmin/xmax fields on the page are 64 bits wide,
 but physically stored in 32 bits. That's possible as long as no two XIDs on
 the same page are more than 2^31 XIDs apart. So you still need to freeze
 old tuples on the page when that's about to happen, but it would make it
 possible to have more than 2^32 XID transactions in the clog. You'd never
 be forced to do anti-wraparound vacuums, you could just let the clog grow
 arbitrarily large.


Nice idea. Storing extra epoch would be extra 4 bytes per heap tuple
instead of extra 8 bytes per tuple if storing 64 bits xmin/xmax.
But if first column is aligned to 8 bytes (i.e. bigserial) would we loose
this 4 bytes win for alignment?


There is a big downside to expanding xmin/xmax to 64 bits: it takes space.
 More space means more memory needed for caching, more memory bandwidth,
 more I/O, etc.



--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Simon Riggs
On 30 July 2015 at 14:26, Alexander Korotkov a.korot...@postgrespro.ru
wrote:


 Any thoughts? Do you think 64-bit XIDs worth it?


The problem of freezing is painful, but not impossible, which is why we
have held out so long.

The problem of very long lived snapshots is coming closer at the same speed
as freezing; there is no solution to that without 64-bit xids throughout
whole infrastructure, or CSNs.

The opportunity for us to have SQL Standard historical databases becomes
possible with 64-bit xids, or CSNs. That is a high value goal.

I personally now think we should thoroughly investigate 64-bit xids. I
don't see this as mere debate, I see this as something that we can make a
patch for and scientifically analyze the pros and cons through measurement.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Heikki Linnakangas

On 07/30/2015 04:26 PM, Alexander Korotkov wrote:

Also, I think it's possible to migrate to 64-bit XIDs without breaking
pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
distinguish old and new formats.


I think we should move to 64-bit XIDs in in-memory structs snapshots, 
proc array etc. And expand clog to handle 64-bit XIDs. But keep the 
xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field 
to the page header so that logically the xmin/xmax fields on the page 
are 64 bits wide, but physically stored in 32 bits. That's possible as 
long as no two XIDs on the same page are more than 2^31 XIDs apart. So 
you still need to freeze old tuples on the page when that's about to 
happen, but it would make it possible to have more than 2^32 XID 
transactions in the clog. You'd never be forced to do anti-wraparound 
vacuums, you could just let the clog grow arbitrarily large.


There is a big downside to expanding xmin/xmax to 64 bits: it takes 
space. More space means more memory needed for caching, more memory 
bandwidth, more I/O, etc.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Simon Riggs
On 30 July 2015 at 15:24, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 07/30/2015 04:26 PM, Alexander Korotkov wrote:

 Also, I think it's possible to migrate to 64-bit XIDs without breaking
 pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
 would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
 distinguish old and new formats.


 I think we should move to 64-bit XIDs in in-memory structs snapshots, proc
 array etc. And expand clog to handle 64-bit XIDs. But keep the xmin/xmax
 fields on heap pages at 32-bits, and add an epoch-like field to the page
 header so that logically the xmin/xmax fields on the page are 64 bits wide,
 but physically stored in 32 bits. That's possible as long as no two XIDs on
 the same page are more than 2^31 XIDs apart. So you still need to freeze
 old tuples on the page when that's about to happen, but it would make it
 possible to have more than 2^32 XID transactions in the clog. You'd never
 be forced to do anti-wraparound vacuums, you could just let the clog grow
 arbitrarily large.


This is a good scheme, but it assumes, as you say, that you can freeze
tuples that are more than 2^31 xids apart. That is no longer a safe
assumption on high transaction rate systems with longer lived snapshots.


 There is a big downside to expanding xmin/xmax to 64 bits: it takes space.
 More space means more memory needed for caching, more memory bandwidth,
 more I/O, etc.


My feeling is that the overhead will recede in time. Having a nice, simple
change to remove old bugs and new would help us be more robust.

But let's measure the overhead before we try to optimize it away.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-07-30 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Kevin Grittner wrote:
 If you run `make installcheck` against a cluster with
 default_transaction_isolation = 'repeatable read' you get one
 failure:

 I am tempted to say that we should just disallow to run vacuum on a
 table containing a brin index in a transaction-snapshot transaction.

Huh?  We don't allow vacuum inside a (user started) transaction now.
What new restriction are you proposing?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Tom Lane
Gavin Flower gavinflo...@archidevsys.co.nz writes:
 On 31/07/15 02:24, Heikki Linnakangas wrote:
 There is a big downside to expanding xmin/xmax to 64 bits: it takes 
 space. More space means more memory needed for caching, more memory 
 bandwidth, more I/O, etc.

 I think having a special case to save 32 bits per tuple would cause 
 unnecessary complications, and the savings are minimal compared to the 
 size of current modern storage devices and the typical memory used in 
 serious database servers.

I think the argument that the savings are minimal is pretty thin.
It all depends on how wide your tables are --- but on a narrow table, say
half a dozen ints, the current tuple size is 24 bytes header plus the same
number of bytes of data.  We'd be going up to 32 bytes header which makes
for a 16% increase in physical table size.  If your table is large,
claiming that 16% doesn't hurt is just silly.

But the elephant in the room is on-disk compatibility.  There is
absolutely no way that we can just change xmin/xmax to 64 bits without a
disk format break.  However, if we do something like what Heikki is
suggesting, it's at least conceivable that we could convert incrementally
(ie, if you find a page with the old header format, assume all tuples in
it are part of epoch 0; and do not insert new tuples into it unless there
is room to convert the header to new format ... but I'm not sure what we
do about tuple deletion if the old page is totally full and we need to
write an xmax that's past 4G).

Only if you are willing to kiss off on-disk compatibility is it even
worth having a discussion about whether we can afford more bloat in
HeapTupleHeader.  And that would be a pretty big pain point for a lot
of users.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Josh Berkus
On 07/30/2015 07:24 AM, Heikki Linnakangas wrote:
 
 I think we should move to 64-bit XIDs in in-memory structs snapshots,
 proc array etc. And expand clog to handle 64-bit XIDs. But keep the
 xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field
 to the page header so that logically the xmin/xmax fields on the page
 are 64 bits wide, but physically stored in 32 bits. That's possible as
 long as no two XIDs on the same page are more than 2^31 XIDs apart. So
 you still need to freeze old tuples on the page when that's about to
 happen, but it would make it possible to have more than 2^32 XID
 transactions in the clog. You'd never be forced to do anti-wraparound
 vacuums, you could just let the clog grow arbitrarily large

When I introduced the same idea a few years back, having the clog get
arbitrarily large was cited as a major issue.  I was under the
impression that clog size had some major performance impacts.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Gavin Flower

On 31/07/15 02:24, Heikki Linnakangas wrote:

On 07/30/2015 04:26 PM, Alexander Korotkov wrote:

Also, I think it's possible to migrate to 64-bit XIDs without breaking
pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
would be created with 64-bit XIDs. We can use free bits in 
t_infomask2 to

distinguish old and new formats.


I think we should move to 64-bit XIDs in in-memory structs snapshots, 
proc array etc. And expand clog to handle 64-bit XIDs. But keep the 
xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field 
to the page header so that logically the xmin/xmax fields on the page 
are 64 bits wide, but physically stored in 32 bits. That's possible as 
long as no two XIDs on the same page are more than 2^31 XIDs apart. So 
you still need to freeze old tuples on the page when that's about to 
happen, but it would make it possible to have more than 2^32 XID 
transactions in the clog. You'd never be forced to do anti-wraparound 
vacuums, you could just let the clog grow arbitrarily large.


There is a big downside to expanding xmin/xmax to 64 bits: it takes 
space. More space means more memory needed for caching, more memory 
bandwidth, more I/O, etc.


- Heikki



I think having a special case to save 32 bits per tuple would cause 
unnecessary complications, and the savings are minimal compared to the 
size of current modern storage devices and the typical memory used in 
serious database servers.


I think it is too much pain for very little gain, especially when 
looking into the future growth in storage capacity andbandwidth.


The early mainframes used a base displacement technique to keep the size 
of addresses down in instructions: 16 bit addresses, comprising 4 bits 
for a base register and 12 bits for the displacement (hence the use of 
4KB pages sizes now!).  Necessary at the time when mainframes were often 
less than 128 KB!  Now it would ludicrous to do that for modern servers!



Cheers,
Gavin

(Who is ancient enough, to have programmed such MainFrames!)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Tomas Vondra

Hi,

On 07/30/2015 06:58 PM, Heikki Linnakangas wrote:


The problem with a threshold is that around that threshold, even a
small change in the data set can drastically change the produced
estimates. For example, imagine that we know from the stats that zip
code implies city. But then someone adds a single row to the table
with an odd zip code  city combination, which pushes the estimator
over the threshold, and the columns are no longer considered
dependent, and the estimates are now completely different. We should
avoid steep cliffs like that.

BTW, what is the threshold in the current patch?


There's not a simple threshold - the algorithm mining the functional 
dependencies is a bit more complicated. I tried to explain it in the 
comment before build_mv_dependencies (in dependencies.c), but let me 
briefly summarize it here.


To mine dependency [A = B], build_mv_dependencies does this:

(1) sort the sample by {A,B}

(2) split the sample into groups with the same value of A

(3) for each group, decide if it's consistent with the dependency

(a) if the group is too small (less than 3 rows), ignore it

(a) if the group is consistent, update

n_supporting
n_supporting_rows

(b) if the group is inconsistent, update

n_contradicting
n_contradicting_rows

(4) decide whether the dependency is valid by checking

n_supporting_rows = n_contradicting_rows * 10

The limit is rather arbitrary and yes - I can imagine a more complex 
condition (e.g. looking at average number of tuples per group etc.), but 
I haven't looked into that - the point was to use something very simple, 
only to illustrate the infrastructure.


I think we might come up with some elaborate way of associating degree 
with the functional dependency, but at that point we really loose the 
simplicity, and also make it indistinguishable from the remaining 
statistics (because it won't be possible to reduce the clauses like 
this, before performing the regular estimation). Which is exactly what 
makes the functional dependencies so neat and efficient, so I'm not 
overly enthusiastic about doing that.


What seems more interesting is implementing the ndistinct coefficient 
instead, as proposed by Kyotaro-san - that seems to have the nice 
smooth behavior you desire, while keeping the simplicity.


Both statistics types (functional dependencies and ndistinct coeff) have 
one weak point, though - they somehow assume the queries use 
compatible values. For example if you use a query with


   WHERE city = 'New York' AND zip = 'zip for Detroit'

they can't detect cases like this, because those statistics types are 
oblivious to individual values. I don't see this as a fatal flaw, though 
- it's rather a consequence of the nature of the stats. And I tend to 
look at the functional dependencies the same way.


If you need stats without these issues you'll have to use MCV list or 
a histogram. Trying to fix the simple statistics types is futile, IMHO.


regards
Tomas

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-07-30 Thread Alvaro Herrera
Kevin Grittner wrote:
 If you run `make installcheck` against a cluster with
 default_transaction_isolation = 'repeatable read' you get one
 failure:

I am tempted to say that we should just disallow to run vacuum on a
table containing a brin index in a transaction-snapshot transaction.
It is possible to silence the problem by checking for vacuum flags, but
(without testing) I think there will be a problem because the snapshot
is acquired too early and it is possible for concurrent transactions to
insert tuples in the table that the summarizing scan will not see, which
will cause the index to become effectively corrupt.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-07-30 Thread Robert Haas
On Sun, Jul 19, 2015 at 4:16 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 On 07/17/2015 04:36 PM, Jim Nasby wrote:
 I'm guessing it'd be really ugly/hard to support at least this GUC being
 multi-line?

 Mind you, multi-line GUCs would be useful otherwise, but we don't want
 to hinge this feature on making that work.

 I'm pretty sure that changing the GUC parser to allow quoted strings to
 continue across lines would be trivial.  The problem with it is not that
 it's hard, it's that omitting a closing quote mark would then result in
 the entire file being syntactically broken, with the error message(s)
 almost certainly pointing somewhere else than where the actual mistake is.
 Do we really want such a global reduction in friendliness to make this
 feature easier?

Maybe shoehorning this into the GUC mechanism is the wrong thing, and
what we really need is a new config file for this.  The information
we're proposing to store seems complex enough to justify that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Arthur Silva
On Thu, Jul 30, 2015 at 5:31 PM, Gavin Flower gavinflo...@archidevsys.co.nz
 wrote:

 On 31/07/15 02:24, Heikki Linnakangas wrote:

 On 07/30/2015 04:26 PM, Alexander Korotkov wrote:

 Also, I think it's possible to migrate to 64-bit XIDs without breaking
 pg_upgrade. Old tuples can be leaved with 32-bit XIDs while new tuples
 would be created with 64-bit XIDs. We can use free bits in t_infomask2 to
 distinguish old and new formats.


 I think we should move to 64-bit XIDs in in-memory structs snapshots,
 proc array etc. And expand clog to handle 64-bit XIDs. But keep the
 xmin/xmax fields on heap pages at 32-bits, and add an epoch-like field to
 the page header so that logically the xmin/xmax fields on the page are 64
 bits wide, but physically stored in 32 bits. That's possible as long as no
 two XIDs on the same page are more than 2^31 XIDs apart. So you still need
 to freeze old tuples on the page when that's about to happen, but it would
 make it possible to have more than 2^32 XID transactions in the clog. You'd
 never be forced to do anti-wraparound vacuums, you could just let the clog
 grow arbitrarily large.

 There is a big downside to expanding xmin/xmax to 64 bits: it takes
 space. More space means more memory needed for caching, more memory
 bandwidth, more I/O, etc.

 - Heikki



 I think having a special case to save 32 bits per tuple would cause
 unnecessary complications, and the savings are minimal compared to the size
 of current modern storage devices and the typical memory used in serious
 database servers.

 I think it is too much pain for very little gain, especially when looking
 into the future growth in storage capacity andbandwidth.

 The early mainframes used a base displacement technique to keep the size
 of addresses down in instructions: 16 bit addresses, comprising 4 bits for
 a base register and 12 bits for the displacement (hence the use of 4KB
 pages sizes now!).  Necessary at the time when mainframes were often less
 than 128 KB!  Now it would ludicrous to do that for modern servers!


 Cheers,
 Gavin

 (Who is ancient enough, to have programmed such MainFrames!)



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


In the other hand PG tuple overhead is already the largest among the
alternatives.
Even if storage keeps getting faster and cheaper stuff you can't ignore the
overhead of adding yet another 8bytes to each tuple.


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-07-30 Thread Michael Paquier
On Fri, Jul 31, 2015 at 6:41 AM, Andreas Karlsson andr...@proxel.se wrote:
 The comment is good, but what I personally do not like is that the path to
 the test suite is non-obvious and not self explanatory I would not expect
 src/test/tables_fk/t/ to test pg_dump and extensions. I find it to confusing
 to regard the test suite as testing an extension called tables_fk since in
 my mind that is just a necessary tool built to test something else.

 This is obviously a subjective thing, and I see that for example Heikki
 likes it the way it is. I am fine with setting this as ready for committer
 and and let a committer weigh in on the naming.

Well, honestly, I would just have it in src/test/modules and call it a
deal. Because it is now the only extension that has such interactions
with perl tests. And because if modules/ gets bloated later on, we
could extend prove_check to install modules from extra paths, just
that it does not seem worth it to do it now for one module, and there
is no guarantee that we'll have that many around. Of course there is
no guarantee either that modules/ is not going to get bloated.

Note as well that I will be fine with any decision taken by the
committer who picks up this, this test case is useful by itself in any
case.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-07-30 Thread Jim Nasby

On 7/30/15 7:33 AM, Alexander Korotkov wrote:

2) Generating notices for every dead tuple would be too noisy. I suggest
to replace notice with one of debug levels.

+ elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d,
offnum, BufferGetBlockNumber(buffer));


Even that seems like pretty serious overkill. vacuumlazy.c doesn't have 
anything like that, and I don't think the BTree code does either. If you 
were debugging something and actually needed it I'd say drop in a 
temporary printf().

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Peter Geoghegan
On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 Hmm. You don't really need to merge the in-memory array into the tape, as
 you know that all the tuples in the in-memory must come after the tuples
 already on the tape. You can just return all the tuples from the tape first,
 and then all the tuples from the array.

It's more complicated than it appears, I think. Tuples may be variable
sized. WRITETUP() performs a pfree(), and gives us back a variable
amount of availMem. What if we dumped a single, massive, outlier tuple
out when a caller passes it and it goes to the root of the heap? We'd
dump that massive tuple in one go (this would be an incremental
dumptuples() call, which we still do in the patch), making things
!LACKMEM() again, but by an usually comfortable margin. We read in a
few more regular tuples, but we're done consuming tuples before things
ever get LACKMEM() again (no more dumping needed, at least with this
patch applied).

What prevents the tuple at the top of the in-memory heap at the point
of tuplesort_performsort() (say, one of the ones added to the heap as
our glut of memory was *partially* consumed) being less than the
last/greatest tuple on tape? If the answer is nothing, a merge step
is clearly required.

This is not a problem when every single tuple is dumped, but that
doesn't happen anymore. I probably should have shown more tests, that
tested HeapTuple sorts (not just datum tuple sorts). I agree that
things at least usually happen as you describe, FWIW.

 I think it'd make sense to structure the code differently, to match the way
 I described this optimization above. Instead of adding a new tuplesort state
 for this, abstract this in the logical tape code. Add a function to attach
 an in-memory tail to a tape, and have LogicalTapeRead() read from the tail
 after reading the on-disk tape. The rest of the code wouldn't need to care
 that sometimes part of the tape is actually in memory.

I'll need to think about all of that. Certainly, quicksorting runs in
a more general way seems like a very promising idea, and I know that
this patch does not go far enough. But it also add this idea of not
dumping out most tuples, which seems impossible to generalize further,
so maybe it's a useful special case to start from for that reason (and
also because it's where the pain is currently felt most often).

 +* Note that there might actually be 2 runs, but only the
 +* contents of one of them went to tape, and so we can
 +* safely pretend that there is only 1 run (since we're
 +* about to give up on the idea of the memtuples array being
 +* a heap).  This means that if our sort happened to require
 +* random access, the similar single run optimization
 +* below (which sets TSS_SORTEDONTAPE) might not be used at
 +* all.  This is because dumping all tuples out might have
 +* forced an otherwise equivalent randomAccess case to
 +* acknowledge a second run, which we can avoid.


 Is that really true? We don't start a second run until we have to, i.e. when
 it's time to dump the first tuple of the second run to tape. So I don't
 think the case you describe above, where you have two runs but only one of
 them has tuples on disk, can actually happen.

I think we're talking about two slightly different things. I agree
that I am avoiding starting a second run because I am avoiding
dumping tuples, just as you say (I describe this as avoiding
acknowledging a second run). But there could still be SortTuples
that have a tupindex that is  0 (they could be 1, to be specific).
It's pretty clear from looking at the TSS_BUILDRUNS case within
puttuple_common() that this is true. So, if instead you define
starting a tuple as adding a sortTuple with a tupindex that is  0,
then yes, this comment is true.

The important thing is that since we're not dumping every tuple, it
doesn't matter whether or not a that TSS_BUILDRUNS case within
puttuple_common() ever took the currentRun + 1 insertion path (which
can easily happen), provided things aren't so skewed that it ends up
on tape even without dumping all tuples (which seems much less
likely). As I've said, this optimization will occur a lot more often
then the existing one run optimization (assuming !randomAccess), as a
nice side benefit of not dumping every tuple. Quicksort does not use
HEAPCOMPARE(), so clearly having multiple runs in that subrun is a
non-issue.

Whether or not we say that a second run started, or that there was
merely the unfulfilled intent to start a new, second run is just
semantics. While I certainly care about semantics, my point is that we
agree that this useful pretend there is only one run thing happens
(I think). The existing one run optimization only really occurs when
the range of values in the set of tuples is well characterized by the
tuple values observed during initial heapification, which is bad. Or
would be bad, if the existing 

Re: [HACKERS] TAP tests are badly named

2015-07-30 Thread Andrew Dunstan


On 07/30/2015 12:40 PM, Andrew Dunstan wrote:


We should describe test sets by what they test, not by how they test. 
TAP is a testing tool/protocol. The current set of tests we have test 
the programs in src/bin, and we should really name the test set by a 
name that reflects that, rather than the fact that we are using TAP 
tools to run the tests. What if we decide to test something else using 
TAP? Would we call that set of tests TAP tests too?


--enable-tap-tests is a reasonable configuration setting, because it's 
about whether or not we have a TAP testing framework available, but I 
think we should stop calling the bin tests TAP tests and we should 
change the test name in vcregress.pl to a more appropriate name. In 
the buildfarm I'm calling the step bin-check: 
http://www.pgbuildfarm.org/cgi-bin/show_stage_log.pl?nm=crakedt=2015-07-30%2012%3A25%3A58stg=bin-check


Thoughts?





In fact, looking more closely at the changes that have been made to 
vcregress.pl, I don't really like the way this has been done. I'm 
putting together some changes to bring things more into line with how 
the Makefiles work.


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-07-30 Thread Josh Berkus
Hackers,

In guc.c, the maximum for wal_buffers is INT_MAX.  However, wal_buffers
is actually measured in 8KB buffers, not in bytes.  This means that
users are able to set wal_buffers  2GB.  When the database is started,
this can cause a core dump if the WAL offset is  2GB.

Attached patch fixes this issue by lowering the maximum for wal_buffers:

josh@Radegast:~/pg94a$ bin/pg_ctl -D data start
server starting
josh@Radegast:~/pg94a$ LOG:  393216 is outside the valid range for
parameter wal_buffers (-1 .. 262143)
FATAL:  configuration file /home/josh/pg94a/data/postgresql.conf
contains errors

Thanks to Andrew Gierth for diagnosing this issue.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 1b7b914..b3dac51
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*** static struct config_int ConfigureNamesI
*** 2215,2221 
  			GUC_UNIT_XBLOCKS
  		},
  		XLOGbuffers,
! 		-1, -1, INT_MAX,
  		check_wal_buffers, NULL, NULL
  	},
  
--- 2215,2221 
  			GUC_UNIT_XBLOCKS
  		},
  		XLOGbuffers,
! 		-1, -1, (INT_MAX / XLOG_BLCKSZ),
  		check_wal_buffers, NULL, NULL
  	},
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 64-bit XIDs again

2015-07-30 Thread Petr Jelinek

On 2015-07-30 23:23, Tom Lane wrote:

Gavin Flower gavinflo...@archidevsys.co.nz writes:

On 31/07/15 02:24, Heikki Linnakangas wrote:

There is a big downside to expanding xmin/xmax to 64 bits: it takes
space. More space means more memory needed for caching, more memory
bandwidth, more I/O, etc.



I think having a special case to save 32 bits per tuple would cause
unnecessary complications, and the savings are minimal compared to the
size of current modern storage devices and the typical memory used in
serious database servers.


I think the argument that the savings are minimal is pretty thin.
It all depends on how wide your tables are --- but on a narrow table, say
half a dozen ints, the current tuple size is 24 bytes header plus the same
number of bytes of data.  We'd be going up to 32 bytes header which makes
for a 16% increase in physical table size.  If your table is large,
claiming that 16% doesn't hurt is just silly.

But the elephant in the room is on-disk compatibility.  There is
absolutely no way that we can just change xmin/xmax to 64 bits without a
disk format break.  However, if we do something like what Heikki is
suggesting, it's at least conceivable that we could convert incrementally
(ie, if you find a page with the old header format, assume all tuples in
it are part of epoch 0; and do not insert new tuples into it unless there
is room to convert the header to new format ...


We could theoretically do similar thing with 64bit xmin/xmax though - 
detect page is in old format and convert all tuples there to 64bit 
xmin/xmax.


But I agree that we don't want to increase bloat per tuple as it's 
already too big.



but I'm not sure what we
do about tuple deletion if the old page is totally full and we need to
write an xmax that's past 4G).



If the page is too full we could move some data to different (or new) page.

For me bigger issue is that we'll still have to refreeze pages because 
if tuples are updated or deleted in different epoch than the one they 
were inserted in, the new version of tuple has to go to different page 
and the old page will have free space that can't be used by new tuples 
since the system is now in different epoch.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2015-07-30 Thread Andreas Karlsson

On 07/29/2015 03:24 AM, Tom Lane wrote:

Peter Eisentraut pete...@gmx.net writes:

I don't have a problem with rebuilding the SSL context on every reload
cycle.  We already do a lot of extra reloading every time, so a bit more
shouldn't hurt.  But I'm not so sure whether we should do that in the
SIGHUP handler.  I don't know how we got into the situation of doing all
the file reloads directly in the handler, but at least we can control
that code.  Making a bunch of calls into an external library is a
different thing, though.  Can we find a way to do this differently?


Do we have an idea how expensive it is to load that data?

A brute-force answer is to not have the postmaster load it at all,
but to have new backends do so (if needed) during their connection
acceptance/authentication phase.  I'm not sure how much that would
add to the SSL connection startup time though.  It would also mean
that problems with the SSL config files would only be reported during
subsequent connection starts, not at SIGHUP time, and indeed that
SIGHUP is more or less meaningless for SSL file changes: the instant
you change a file, it's live for later connections.  On the plus side,
it would make Windows and Unix behavior closer, since (I suppose)
we're reloading that stuff anyway in EXEC_BACKEND builds.


I measured it taking ~0.3ms to build the new SSL context in a simple 
benchmark (certificate + CA + small crl).


Personally I do not think moving this to connection start would be worth 
it since reporting errors that late is not nice for people who have 
misconfigured their database, and even if my benchmarks indicates it is 
relatively cheap to reload SSL adding more work to connection 
establishing is something I would want to avoid unless it gives us a 
clear benefit.



I'm not entirely sure your concern is valid, though.  We have always had
the principle that almost everything of interest in the postmaster happens
in signal handler functions.  We could possibly change things so that
reloading config files is done in the main loop of ServerLoop, but
if we did, it would have to execute with all signals blocked, which seems
like just about as much of a risk for third-party code as executing that

 code in a signal handler is.

Agreed, I am not sure what moving it to the main loop would gain us.

--
Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Peter Geoghegan
On Thu, Jul 30, 2015 at 11:32 AM, Robert Haas robertmh...@gmail.com wrote:
 Very interesting.  And great performance numbers.  Thanks for taking
 the time to investigate this - really cool.

Thanks.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Peter Geoghegan
On Thu, Jul 30, 2015 at 3:47 AM, Simon Riggs si...@2ndquadrant.com wrote:
 This is a good optimization for the common case where tuples are mostly
 already in order. We could increase the usefulness of this by making UPDATE
 pick blocks that are close to a tuple's original block, rather than putting
 them near the end of a relation.

Not sure what you mean here.

 So here's a shorter/different explanation of this optimization: When it's
 time to perform the sort, instead of draining the in-memory heap one tuple
 at a time to the last tape, you sort the heap with quicksort, and pretend
 that the sorted heap belongs to the last tape, after all the other tuples in
 the tape.

 Some questions/thoughts on that:

 Isn't that optimization applicable even when you have multiple runs?
 Quicksorting the heap and keeping it as an array in memory is surely always
 faster than heapsorting and pushing it to the tape.


 It's about use of memory. If you have multiple runs on tape, then they will
 need to be merged and you need memory to do that efficiently. If there are
 tuples in the last batch still in memory then it can work, but it depends
 upon how full memory is from the last batch and how many batches there are.

I agree that this optimization has a lot to do with exploiting the
fact that you don't need to free the memtuples array for future runs
because you've already received all tuples (or keep space free for
previous runs). I think that we should still use quicksort on runs
where this optimization doesn't work out, but I also still think that
that's a different idea. Doing what I've proposed here when there are
multiple runs seems less valuable, if only because you're not going to
avoid that much writing.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Updatable view?

2015-07-30 Thread Jim Nasby

On 7/30/15 1:03 AM, Tatsuo Ishii wrote:

Is anyone working on implementing or interested in implementing
automatic updatable view which uses two or more tables involved
(join)? SQL1999 allows it in certain conditions.


I think it would be nice to have... but not to the point of working on 
it myself.


Might be worth an email to -general to see how many people have 
immediate use for it.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: backend niceness / session_priority

2015-07-30 Thread Jim Nasby

On 7/30/15 10:54 AM, Tom Lane wrote:

=?ISO-8859-15?Q?Jos=E9_Luis_Tall=F3n?= jltal...@adv-solutions.net writes:

  Since PostgreSQL lacks the resource management capabilities of the
Big Ones ( Resource Groups - Red, WorkLoad Manager - Blue ) or the
Resource Governor in MS SQL Server, we can try and approximate the
requested behaviour by reducing the CPU priority (nice) of the backend
in question. Please note that we would be using scheduler priority to
try and modulate I/O, though I'm aware of the limitations of this mechanism.


This has been proposed before, and rejected before, and I'm not seeing
anything particularly new here.  Without a credible mechanism for
throttling I/O, nice alone does not seem very promising.


Some OSes respect nice when it comes to IO scheduling, so it might still 
be useful. What I'm worried about is priority inversion[1].


What might be useful would be to add a set of GUCs similar to 
vacuum_cost_* that operated at the shared buffer level. Dunno where 
you'd put the sleep though (presumably all the functions where you'd put 
the accounting are too low-level to sleep in).


[1] https://en.wikipedia.org/wiki/Priority_inversion
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Peter Geoghegan
On Thu, Jul 30, 2015 at 4:26 AM, Greg Stark st...@mit.edu wrote:
 I'm a bit confused where the big win comes from though. Is what's going on
 that the external sort only exceeded memory by a small amount  so nearly all
 the tuples are still in memory?

Yes, that's why this can be much faster just as the work_mem threshold
is crossed. You get an almost internal sort, which means you can
mostly quicksort, and you can avoid dumping most tuples. It's still a
pretty nice win when less than half of tuples fit in memory, though --
just not as nice. Below that, the optimization isn't used.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving test coverage of extensions with pg_dump

2015-07-30 Thread Andreas Karlsson

On 07/30/2015 04:48 AM, Michael Paquier wrote:

On Thu, Jul 30, 2015 at 5:54 AM, Andreas Karlsson andr...@proxel.se wrote:

What I do not like though is how the path src/test/tables_fk/t/ tells us
nothing about what features are of PostgreSQL are tested here. For this I
personally prefer the earlier versions where I think that was clear.


+Simple module used to check data dump ordering of pg_dump with tables
+linked with foreign keys.
The README mentions that this is to test pg_dump, perhaps referring to
the TAP tests makes sense as well?


The comment is good, but what I personally do not like is that the path 
to the test suite is non-obvious and not self explanatory I would not 
expect src/test/tables_fk/t/ to test pg_dump and extensions. I find it 
to confusing to regard the test suite as testing an extension called 
tables_fk since in my mind that is just a necessary tool built to test 
something else.


This is obviously a subjective thing, and I see that for example Heikki 
likes it the way it is. I am fine with setting this as ready for 
committer and and let a committer weigh in on the naming.



Another thought: would it be worthwhile to also add an assertion to check if
the data really was restored properly or would that just be redundant code?


That seems worth doing, hence added something for it. Thanks for the suggestion.

At the same time I have added an entry in .gitignore for the generated
tmp_check/.


Nice changes.

--
Andreas



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] creating extension including dependencies

2015-07-30 Thread Michael Paquier
On Thu, Jul 30, 2015 at 10:58 PM, Petr Jelinek wrote:
 On 2015-07-27 15:18, Michael Paquier wrote:
 Something also has not been discussed yet: what to do with new_version
 and old_version (the options of CreateExtensionStmt)? As of now if
 those options are defined they are not passed down to the parent
 extensions but shouldn't we raise an error if they are used in
 combination with CASCADE? In any case, I think that the behavior
 chosen should be documented.


 I don't see why we should raise error, they are used just for the top-level
 extension and I think it makes sense that way. CASCADE documentation says
 SCHEMA option is cascaded to required extensions, do we need to say
 something more than that (ie explicitly saying that the old_version and
 new_version aren't)?

OK, let's do so then. I think that we should still document the fact
that the old and new version strings and not passed to the parent
extensions when cascade is used for clarity. Something like:
Other options are not recursively applied when the CASCASE clause is used.

I have been through this patch one last time and changed the following:
- Improved documentation: missing markups with literal, SCHEMA is a
clause and not a parameter, added explanation that options other than
SCHEMA are not applied recursively with CASCADE
- Corrected .gitignore in test_extensions, log/ was missing.
Those are minor things though, hence I just switched patch as Ready
for committer.
-- 
Michael
diff --git a/contrib/hstore_plperl/expected/hstore_plperl.out b/contrib/hstore_plperl/expected/hstore_plperl.out
index cf384eb..25fc506 100644
--- a/contrib/hstore_plperl/expected/hstore_plperl.out
+++ b/contrib/hstore_plperl/expected/hstore_plperl.out
@@ -1,6 +1,6 @@
-CREATE EXTENSION hstore;
-CREATE EXTENSION plperl;
-CREATE EXTENSION hstore_plperl;
+CREATE EXTENSION hstore_plperl CASCADE;
+NOTICE:  installing required extension hstore
+NOTICE:  installing required extension plperl
 SELECT transforms.udt_schema, transforms.udt_name,
routine_schema, routine_name,
group_name, transform_type
diff --git a/contrib/hstore_plperl/expected/hstore_plperlu.out b/contrib/hstore_plperl/expected/hstore_plperlu.out
index 8c689ad..3fc3201 100644
--- a/contrib/hstore_plperl/expected/hstore_plperlu.out
+++ b/contrib/hstore_plperl/expected/hstore_plperlu.out
@@ -1,6 +1,6 @@
-CREATE EXTENSION hstore;
-CREATE EXTENSION plperlu;
-CREATE EXTENSION hstore_plperlu;
+CREATE EXTENSION hstore_plperlu CASCADE;
+NOTICE:  installing required extension hstore
+NOTICE:  installing required extension plperlu
 SELECT transforms.udt_schema, transforms.udt_name,
routine_schema, routine_name,
group_name, transform_type
diff --git a/contrib/hstore_plperl/sql/hstore_plperl.sql b/contrib/hstore_plperl/sql/hstore_plperl.sql
index 0f70f14..9398aed 100644
--- a/contrib/hstore_plperl/sql/hstore_plperl.sql
+++ b/contrib/hstore_plperl/sql/hstore_plperl.sql
@@ -1,6 +1,4 @@
-CREATE EXTENSION hstore;
-CREATE EXTENSION plperl;
-CREATE EXTENSION hstore_plperl;
+CREATE EXTENSION hstore_plperl CASCADE;
 
 SELECT transforms.udt_schema, transforms.udt_name,
routine_schema, routine_name,
diff --git a/contrib/hstore_plperl/sql/hstore_plperlu.sql b/contrib/hstore_plperl/sql/hstore_plperlu.sql
index 3cfb2fd..8d8508c 100644
--- a/contrib/hstore_plperl/sql/hstore_plperlu.sql
+++ b/contrib/hstore_plperl/sql/hstore_plperlu.sql
@@ -1,6 +1,4 @@
-CREATE EXTENSION hstore;
-CREATE EXTENSION plperlu;
-CREATE EXTENSION hstore_plperlu;
+CREATE EXTENSION hstore_plperlu CASCADE;
 
 SELECT transforms.udt_schema, transforms.udt_name,
routine_schema, routine_name,
diff --git a/contrib/hstore_plpython/expected/hstore_plpython.out b/contrib/hstore_plpython/expected/hstore_plpython.out
index b7a6a92..55f4efe 100644
--- a/contrib/hstore_plpython/expected/hstore_plpython.out
+++ b/contrib/hstore_plpython/expected/hstore_plpython.out
@@ -1,5 +1,5 @@
-CREATE EXTENSION plpython2u;
-CREATE EXTENSION hstore_plpython2u;
+CREATE EXTENSION hstore_plpython2u CASCADE;
+NOTICE:  installing required extension plpython2u
 -- test hstore - python
 CREATE FUNCTION test1(val hstore) RETURNS int
 LANGUAGE plpythonu
diff --git a/contrib/hstore_plpython/sql/hstore_plpython.sql b/contrib/hstore_plpython/sql/hstore_plpython.sql
index 9ff2ebc..d55beda 100644
--- a/contrib/hstore_plpython/sql/hstore_plpython.sql
+++ b/contrib/hstore_plpython/sql/hstore_plpython.sql
@@ -1,5 +1,4 @@
-CREATE EXTENSION plpython2u;
-CREATE EXTENSION hstore_plpython2u;
+CREATE EXTENSION hstore_plpython2u CASCADE;
 
 
 -- test hstore - python
diff --git a/contrib/ltree_plpython/expected/ltree_plpython.out b/contrib/ltree_plpython/expected/ltree_plpython.out
index 934529e..9bee6be 100644
--- a/contrib/ltree_plpython/expected/ltree_plpython.out
+++ b/contrib/ltree_plpython/expected/ltree_plpython.out
@@ -1,5 +1,5 @@
-CREATE EXTENSION plpython2u;
-CREATE EXTENSION ltree_plpython2u;
+CREATE EXTENSION ltree_plpython2u CASCADE;
+NOTICE:  

Re: [HACKERS] Proposal: backend niceness / session_priority

2015-07-30 Thread Andreas Karlsson

On 07/31/2015 12:18 AM, Jim Nasby wrote:

This has been proposed before, and rejected before, and I'm not seeing
anything particularly new here.  Without a credible mechanism for
throttling I/O, nice alone does not seem very promising.


Some OSes respect nice when it comes to IO scheduling, so it might still
be useful.


Wouldn't the bgwriter remove a lot of the usefulness?

Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-07-30 Thread Amit Kapila
On Wed, Jul 29, 2015 at 11:48 PM, Andres Freund and...@anarazel.de wrote:

 On 2015-07-29 12:54:59 -0400, Robert Haas wrote:
  I would try to avoid changing lwlock.c.  It's pretty easy when so
  doing to create mechanisms that work now but make further upgrades to
  the general lwlock mechanism difficult.  I'd like to avoid that.

 I'm massively doubtful that re-implementing parts of lwlock.c is the
 better outcome. Then you have two different infrastructures you need to
 improve over time.

I agree and modified the patch to use 32-bit atomics based on idea
suggested by Robert and didn't modify lwlock.c.



 I understand. IMHO it will be a good idea though to ensure that the patch
does not cause regression for other setups such as a less powerful
 machine or while running with lower number of clients.


Okay, I have tried it on CentOS VM, but the data is totally screwed (at
same client count across 2 runs the variation is quite high ranging from 300
to 3000 tps) to make any meaning out of it.  I think if you want to collect
data
on less powerful m/c, then also atleast we should ensure that it is taken in
a way that we are sure that it is not due to noise and there is actual
regression,
then only we can decide whether we should investigate that or not.
Can you please try taking data with attached script
(perf_pgbench_tpcb_write.sh), few things you need to change in script
based on your setup (environment variables defined in beginning of script
like PGDATA), other thing is that you need to ensure that name of binaries
for HEAD and Patch should be changed in script if you are naming them
differently.  It will generate the performance data in test*.txt files.
Also try
to ensure that checkpoint should be configured such that it should not
occur in-between tests, else it will be difficult to conclude the results.
Some parameters you might want to consider for the same are
checkpoint_timeout, checkpoint_completion_target, min_wal_size,
max_wal_size.



 I was telling that fact even without my patch. Basically I have
 tried by commenting ProcArrayLock in ProcArrayEndTransaction.


 I did not get that. You mean the TPS is same if you run with commenting
out ProcArrayLock in ProcArrayEndTransaction?

Yes, TPS is almost same as with Patch.

 Is that safe to do?

No, that is not safe.  I have done that just to see what is the maximum
gain we can get by reducing the contention around ProcArrayLock.




 No, autovacuum generates I/O due to which sometimes there
 is more variation in Write tests.


 Sure, but on an average does it still show similar improvements?


Yes, number with and without autovacuum show similar trend, it's just
that you can see somewhat better results (more performance improvement)
with autovacuum=off and do manual vacuum at end of each test.

 Or does the test becomes IO bound and hence the bottleneck shifts
somewhere
 else? Can you please post those numbers as well when you get chance?

The numbers posted in initial mail or in this mail are with autovacuum =on.




 So among these only step 2 can be common among different
 algorithms, other's need some work specific to each optimization.


 Right, but if we could encapsulate that work in a function that just
needs to work on some shared memory, I think we can build such
infrastructure.

For now, I have encapsulated the code into 2 separate functions, rather
than extending LWLock.c as that could easily lead to code which might
not be used in future even though currently it sounds like that and
in-future
if we need to use same technique elsewhere then we can look into it.



 
  Regarding the patch, the compare-and-exchange function calls that
you've used would work only for 64-bit machines, right? You would need to
use equivalent 32-bit calls on a 32-bit machine.
 

 I thought that internal API will automatically take care of it,
 example for msvc it uses _InterlockedCompareExchange64
 which if doesn't work on 32-bit systems or is not defined, then
 we have to use 32-bit version, but I am not certain about
 that fact.


 Hmm. The pointer will be a 32-bit field on a 32-bit machine. I don't know
how exchanging that with 64-bit integer be safe.


True, but that has to be in-general taken care by 64bit atomic API's,
like in this case it should fallback to either 32-bit version of API or
something that can work on 32-bit m/c.  I think fallback support
is missing as of now in 64-bit API's which we should have if we want
to use those API's, but anyway for now I have modified the patch to
use 32-bit atomics.


Performance Data with modified patch.

pgbench setup

scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

Data is median of 3 - 30 min runs, the detailed data for
all the 3 runs is in attached document
(perf_write_procarraylock_data.ods)

Head : commit c53f7387
Patch : group_xid_clearing_at_trans_end_rel_v2

Client_Count/Patch_ver (TPS) 1 8 16 32 64 128 HEAD 972 6004 11060 20074
23839 

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-07-30 Thread Rahila Syed
I think it's related to the problem of figuring out how many dead tuples
you expect to find in the overall heap, which you need to do to have any
hope of this being a comprehensive estimate.

An  estimate of number of index scans while vacuuming can be done using
estimate of total dead tuples in the relation and maintenance work mem.
n_dead_tuples in pg_stat_all_tables can be used as an estimate of dead
tuples.

Following can be a way to estimate,

if nindexes == 0
  index_scans =0
else if pages_all_visible
  index_scans =0
else
  index_scans = Max((n_dead_tuples * space occupied by single dead
tuple)/m_w_m,1)

This estimates index_scans = 1 if n_dead_tuples = 0 assuming lazy scan heap
is likely to find some dead_tuples.
If n_dead_tuples is non zero the above estimate gives a lower bound on
number of index scans possible.


Thank you,
Rahila Syed


Re: [HACKERS] Updatable view?

2015-07-30 Thread Tatsuo Ishii
 I think it would be nice to have... but not to the point of working on
 it myself.
 
 Might be worth an email to -general to see how many people have
 immediate use for it.

What I am thinking about is,

1) Implement certain class of updatable views allowed in SQL:1999
   (UNION ALL, natural joins)

2) Anything beyond #1 (I have no idea for now)

Let me see how people are interested in...

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-07-30 Thread Fabrízio de Royes Mello
On Thu, Jul 30, 2015 at 11:28 PM, Andres Freund and...@anarazel.de wrote:

  @@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] =

 If we go through this list, I'd rather make informed decisions about
 each reloption. Otherwise we're going to get patches for each of them
 separately over the next versions.


I have no problem to do this change now instead of wait for next versions...


  @@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] =
{
fastupdate,
Enables \fast update\ feature for this GIN
index,
  - RELOPT_KIND_GIN
  + RELOPT_KIND_GIN,
  + AccessExclusiveLock
},
true
},


  @@ -95,7 +99,8 @@ static relopt_int intRelOpts[] =
{
fillfactor,
Packs table pages only to this percentage,
  - RELOPT_KIND_HEAP
  + RELOPT_KIND_HEAP,
  + AccessExclusiveLock
},
HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100

  [some other fillfactor settings]

{
{
gin_pending_list_limit,
Maximum size of the pending list for this GIN
index, in kilobytes.,
  - RELOPT_KIND_GIN
  + RELOPT_KIND_GIN,
  + AccessExclusiveLock
},
-1, 64, MAX_KILOBYTES
},
  @@ -297,7 +325,8 @@ static relopt_string stringRelOpts[] =
{
buffering,
Enables buffering build for this GiST index,
  - RELOPT_KIND_GIST
  + RELOPT_KIND_GIST,
  + AccessExclusiveLock
},
4,
false,

 Why? These options just change things for the future and don't influence
 past decisions. It seems unproblematic to change them.


+1


  @@ -259,7 +283,8 @@ static relopt_real realRelOpts[] =
{
seq_page_cost,
Sets the planner's estimate of the cost of a
sequentially fetched disk page.,
  - RELOPT_KIND_TABLESPACE
  + RELOPT_KIND_TABLESPACE,
  + AccessExclusiveLock
},
-1, 0.0, DBL_MAX
},
  @@ -267,7 +292,8 @@ static relopt_real realRelOpts[] =
{
random_page_cost,
Sets the planner's estimate of the cost of a
nonsequentially fetched disk page.,
  - RELOPT_KIND_TABLESPACE
  + RELOPT_KIND_TABLESPACE,
  + AccessExclusiveLock
},
-1, 0.0, DBL_MAX
},
  @@ -275,7 +301,8 @@ static relopt_real realRelOpts[] =
{
n_distinct,
Sets the planner's estimate of the number of
distinct values appearing in a column (excluding child relations).,
  - RELOPT_KIND_ATTRIBUTE
  + RELOPT_KIND_ATTRIBUTE,
  + AccessExclusiveLock
},
0, -1.0, DBL_MAX
},
  @@ -283,7 +310,8 @@ static relopt_real realRelOpts[] =
{
n_distinct_inherited,
Sets the planner's estimate of the number of
distinct values appearing in a column (including child relations).,
  - RELOPT_KIND_ATTRIBUTE
  + RELOPT_KIND_ATTRIBUTE,
  + AccessExclusiveLock
},
0, -1.0, DBL_MAX
},

 These probably are the settings that are most interesting to change
 without access exlusive locks.


+1


j = 0;
for (i = 0; boolRelOpts[i].gen.name; i++)
  + {
  + Assert(DoLockModesConflict(boolRelOpts[i].gen.lockmode,
  +
 boolRelOpts[i].gen.lockmode));
j++;
  + }
for (i = 0; intRelOpts[i].gen.name; i++)
  + {
  + Assert(DoLockModesConflict(intRelOpts[i].gen.lockmode,
  +
 intRelOpts[i].gen.lockmode));
j++;
  + }
for (i = 0; realRelOpts[i].gen.name; i++)
  + {
  + Assert(DoLockModesConflict(realRelOpts[i].gen.lockmode,
  +
 realRelOpts[i].gen.lockmode));
j++;
  + }
for (i = 0; stringRelOpts[i].gen.name; i++)
  + {
  + Assert(DoLockModesConflict(stringRelOpts[i].gen.lockmode,
  +
 stringRelOpts[i].gen.lockmode));
j++;
  + }
j += num_custom_options;

 Doesn't really seem worth it to assert individually in each case here to
 me.


What do you suggest then?



  +/*
  + * Determine the required LOCKMODE from an option list
  + */
  +LOCKMODE
  +GetRelOptionsLockLevel(List *defList)
  +{
  + 

Re: [HACKERS] security labels on databases are bad for dump restore

2015-07-30 Thread Adam Brightwell
 1. pg_dumpall -g
 2. pg_dump --create per database

 Gah, OK, I see your point.  But we better document this, because if
 you need a PhD in PostgreSQL-ology to take a backup, we're not in a
 good place.

Agreed.  Though, honestly, I find this to be a cumbersome approach.  I
think it just makes things more confusing, even if it is well
documented.  Perhaps it might be necessary as a bridge to get to a
better place.  But my first question as an end user would be, 'why
can't one tool do this?'.  Also, by using 'pg_dumpall -g' aren't you
potentially getting things that you don't want/need/care about?  For
instance, if database 'foo' is owned by 'user1' and database 'bar' is
owned by 'user2' and neither have any knowledge/relation of/to the
other, then when I dump 'foo', in this manner, wouldn't I also be
including 'user2'?  Said differently, a restore of a 'foo'-only dump
would also include a 'bar' related role.  That seems like a bad idea,
IMHO.  Maybe it can't be avoided, but I'd expect that only relevant
information for the database being dumped would be included.

-Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-07-30 Thread Michael Paquier
On Fri, Jul 31, 2015 at 11:28 AM, Andres Freund and...@anarazel.de wrote:

 @@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] =

 If we go through this list, I'd rather make informed decisions about
 each reloption. Otherwise we're going to get patches for each of them
 separately over the next versions.

Just dropping quickly a reply: I meant table relopts only, excluding
the index stuff for now regarding the isolation tests.

 + AccessExclusiveLock
 + foreach(cell, defList)
 + {
 + DefElem *def = (DefElem *) lfirst(cell);
 + int i;
 +
 + for (i = 0; relOpts[i]; i++)
 + {
 + if (pg_strncasecmp(relOpts[i]-name, def-defname, 
 relOpts[i]-namelen + 1) == 0)
 + {
 + if (lockmode  relOpts[i]-lockmode)
 + lockmode = relOpts[i]-lockmode;
 + }
 + }
 + }
 +
 + return lockmode;
 +}

 We usually don't compare lock values that way, i.e. there's not
 guaranteed to be a strict monotonicity between lock levels. I don't
 really agree with that policy, but it's nonetheless there.

Yeah, there are some in lock.c but that's rather localized.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-07-30 Thread Fabrízio de Royes Mello
On Thu, Jul 30, 2015 at 10:46 PM, Michael Paquier michael.paqu...@gmail.com
wrote:

 This patch size has increased from 16k to 157k because of the output
 of the isolation tests you just added. This is definitely too large
 and actually the test coverage is rather limited. Hence I think that
 they should be changed as follows:
 - Use only one table, the locks of tables a and b do not conflict, and
 that's what we want to look at here.
 - Check the locks of all the relation parameters, by that I mean as
 well fillfactor and user_catalog_table which still take
 AccessExclusiveLock on the relation
 - Use custom permutations. I doubt that there is much sense to test
 all the permutations in this context, and this will reduce the
 expected output size drastically.


Ok.


 On further notice, I would recommend not to use the same string name
 for the session and the query identifiers. And I think that inserting
 only one tuple at initialization is just but fine.

 Here is an example of such a spec:
 setup
 {
  CREATE TABLE a (id int PRIMARY KEY);
  INSERT INTO a SELECT generate_series(1,100);
 }
 teardown
 {
  DROP TABLE a;
 }
 session s1
 step b1   { BEGIN; }
 # TODO add one query per parameter
 step at11 { ALTER TABLE a SET (fillfactor=10); }
 step at12 { ALTER TABLE a SET
(autovacuum_vacuum_scale_factor=0.001); }
 step c1   { COMMIT; }
 session s2
 step b2   { BEGIN; }
 step wx1  { UPDATE a SET id = id + 1; }
 step c2   { COMMIT; }

 And by testing permutations like for example that it is possible to
 see which session is waiting for what. Input:
 permutation b1 b2 at11 wx1 c1 c2
 Output where session 2 waits for lock taken after fillfactor update:
 step b1: BEGIN;
 step b2: BEGIN;
 step at11: ALTER TABLE a SET (fillfactor=10);
 step wx1: UPDATE a SET id = id + 1; waiting ...
 step c1: COMMIT;
 step wx1: ... completed
 step c2: COMMIT;

 Be careful as well to not include incorrect permutations in the
 expected output file, the isolation tester is smart enough to ping you
 about that.


Changed the isolation tests according your suggestions.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1c1c181..ad985cd 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -543,6 +543,10 @@ ALTER TABLE ALL IN TABLESPACE replaceable class=PARAMETERname/replaceable
   of commandALTER TABLE/ that forces a table rewrite.
  /para
 
+ para
+  Changing autovacuum storage parameters acquires a literalSHARE UPDATE EXCLUSIVE/literal lock.
+ /para
+
  note
   para
While commandCREATE TABLE/ allows literalOIDS/ to be specified
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 8176b6a..b8d2a92 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			autovacuum_enabled,
 			Enables autovacuum in this relation,
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
 		},
 		true
 	},
@@ -65,7 +66,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			user_catalog_table,
 			Declare a table as an additional catalog table, e.g. for the purpose of logical replication,
-			RELOPT_KIND_HEAP
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
 		},
 		false
 	},
@@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			fastupdate,
 			Enables \fast update\ feature for this GIN index,
-			RELOPT_KIND_GIN
+			RELOPT_KIND_GIN,
+			AccessExclusiveLock
 		},
 		true
 	},
@@ -81,7 +84,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			security_barrier,
 			View acts as a row security barrier,
-			RELOPT_KIND_VIEW
+			RELOPT_KIND_VIEW,
+			AccessExclusiveLock
 		},
 		false
 	},
@@ -95,7 +99,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs table pages only to this percentage,
-			RELOPT_KIND_HEAP
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
 		},
 		HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100
 	},
@@ -103,7 +108,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs btree index pages only to this percentage,
-			RELOPT_KIND_BTREE
+			RELOPT_KIND_BTREE,
+			AccessExclusiveLock
 		},
 		BTREE_DEFAULT_FILLFACTOR, BTREE_MIN_FILLFACTOR, 100
 	},
@@ -111,7 +117,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs hash index pages only to this percentage,
-			RELOPT_KIND_HASH
+			RELOPT_KIND_HASH,
+			AccessExclusiveLock
 		},
 		HASH_DEFAULT_FILLFACTOR, HASH_MIN_FILLFACTOR, 100
 	},
@@ -119,7 +126,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs gist index pages only 

Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-07-30 Thread Andres Freund

 @@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] =

If we go through this list, I'd rather make informed decisions about
each reloption. Otherwise we're going to get patches for each of them
separately over the next versions.

 @@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] =
   {
   fastupdate,
   Enables \fast update\ feature for this GIN index,
 - RELOPT_KIND_GIN
 + RELOPT_KIND_GIN,
 + AccessExclusiveLock
   },
   true
   },


 @@ -95,7 +99,8 @@ static relopt_int intRelOpts[] =
   {
   fillfactor,
   Packs table pages only to this percentage,
 - RELOPT_KIND_HEAP
 + RELOPT_KIND_HEAP,
 + AccessExclusiveLock
   },
   HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100

 [some other fillfactor settings]

   {
   {
   gin_pending_list_limit,
   Maximum size of the pending list for this GIN index, 
 in kilobytes.,
 - RELOPT_KIND_GIN
 + RELOPT_KIND_GIN,
 + AccessExclusiveLock
   },
   -1, 64, MAX_KILOBYTES
   },
 @@ -297,7 +325,8 @@ static relopt_string stringRelOpts[] =
   {
   buffering,
   Enables buffering build for this GiST index,
 - RELOPT_KIND_GIST
 + RELOPT_KIND_GIST,
 + AccessExclusiveLock
   },
   4,
   false,

Why? These options just change things for the future and don't influence
past decisions. It seems unproblematic to change them.

 @@ -259,7 +283,8 @@ static relopt_real realRelOpts[] =
   {
   seq_page_cost,
   Sets the planner's estimate of the cost of a 
 sequentially fetched disk page.,
 - RELOPT_KIND_TABLESPACE
 + RELOPT_KIND_TABLESPACE,
 + AccessExclusiveLock
   },
   -1, 0.0, DBL_MAX
   },
 @@ -267,7 +292,8 @@ static relopt_real realRelOpts[] =
   {
   random_page_cost,
   Sets the planner's estimate of the cost of a 
 nonsequentially fetched disk page.,
 - RELOPT_KIND_TABLESPACE
 + RELOPT_KIND_TABLESPACE,
 + AccessExclusiveLock
   },
   -1, 0.0, DBL_MAX
   },
 @@ -275,7 +301,8 @@ static relopt_real realRelOpts[] =
   {
   n_distinct,
   Sets the planner's estimate of the number of distinct 
 values appearing in a column (excluding child relations).,
 - RELOPT_KIND_ATTRIBUTE
 + RELOPT_KIND_ATTRIBUTE,
 + AccessExclusiveLock
   },
   0, -1.0, DBL_MAX
   },
 @@ -283,7 +310,8 @@ static relopt_real realRelOpts[] =
   {
   n_distinct_inherited,
   Sets the planner's estimate of the number of distinct 
 values appearing in a column (including child relations).,
 - RELOPT_KIND_ATTRIBUTE
 + RELOPT_KIND_ATTRIBUTE,
 + AccessExclusiveLock
   },
   0, -1.0, DBL_MAX
   },

These probably are the settings that are most interesting to change
without access exlusive locks.

   j = 0;
   for (i = 0; boolRelOpts[i].gen.name; i++)
 + {
 + Assert(DoLockModesConflict(boolRelOpts[i].gen.lockmode,
 +
 boolRelOpts[i].gen.lockmode));
   j++;
 + }
   for (i = 0; intRelOpts[i].gen.name; i++)
 + {
 + Assert(DoLockModesConflict(intRelOpts[i].gen.lockmode,
 +
 intRelOpts[i].gen.lockmode));
   j++;
 + }
   for (i = 0; realRelOpts[i].gen.name; i++)
 + {
 + Assert(DoLockModesConflict(realRelOpts[i].gen.lockmode,
 +
 realRelOpts[i].gen.lockmode));
   j++;
 + }
   for (i = 0; stringRelOpts[i].gen.name; i++)
 + {
 + Assert(DoLockModesConflict(stringRelOpts[i].gen.lockmode,
 +
 stringRelOpts[i].gen.lockmode));
   j++;
 + }
   j += num_custom_options;

Doesn't really seem worth it to assert individually in each case here to
me.

 +/*
 + * Determine the required LOCKMODE from an option list
 + */
 +LOCKMODE
 +GetRelOptionsLockLevel(List *defList)
 +{
 + LOCKMODElockmode = NoLock;
 + ListCell*cell;

Re: [HACKERS] LWLock deadlock and gdb advice

2015-07-30 Thread Andres Freund
On 2015-07-30 09:03:01 -0700, Jeff Janes wrote:
 On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund and...@anarazel.de wrote:
  What do you think about something roughly like the attached?
 
 
 I've not evaluated the code, but applying it does solve the problem I was
 seeing.

Cool, thanks for testing! How long did you run the test and how long did
it, on average, previously take for the issue to occur?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Updatable view?

2015-07-30 Thread Tatsuo Ishii
Hi,

Is anyone working on implementing or interested in implementing
automatic updatable view which uses two or more tables involved
(join)? SQL1999 allows it in certain conditions.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] security labels on databases are bad for dump restore

2015-07-30 Thread Noah Misch
On Wed, Jul 29, 2015 at 10:50:53AM -0400, Robert Haas wrote:
 On Wed, Jul 29, 2015 at 12:39 AM, Noah Misch n...@leadboat.com wrote:
  On Tue, Jul 28, 2015 at 03:36:13PM -0400, Robert Haas wrote:
  On Tue, Jul 28, 2015 at 3:33 PM, Andres Freund and...@anarazel.de wrote:
   Hm?  Let me try again: If the admin does a ALTER DATABASE ... SET guc =
   ... *before* restoring a backup and the backup does contain a setting
   for the same guc, but with a different value it'll overwrite the
   previous explicit action by the DBA without any warning.  If the backup
   does *not* contain that guc the previous action survives.  That's
   confusing, because you're more likely to be in the 'the backup does not
   contain the guc' situation when testing where it thus will work.
 
  True.  But I don't think modifying a database before restoring into it
  is terribly supported.  Even pg_dump --clean, which is supposed to do
  this sort of thing, doesn't seem to work terribly reliably.  We could
  try to fix this by having a command like ALTER DATABASE ... RESET ALL
  that we issue before restoring the settings, but I'm afraid that will
  take us into all sorts of unreasonable scenarios that are better just
  labeled as don't do that.
 
  Andres's example is a harbinger of the semantic morass ahead.  Excepting
  database objects and the public schema object, pg_dump and pg_dumpall 
  mutate
  only the objects they CREATE.  They consistently restore object properties
  (owner, ACLs, security label, etc.) if and only if issuing a CREATE 
  statement
  for the object.  For example, restoring objects contained in a schema 
  without
  restoring the schema itself changes none of those schema properties.  
  pg_dump
  and pg_dumpall have mostly followed that rule for databases, too, but they
  depart from it for comment and security label.  That was a mistake.  We 
  can't
  in general mutate an existing database to match, because we can't mutate the
  encoding, datcollate or datctype.  Even discounting that problem, I value
  consistency with the rest of the dumpable object types.
 
 What we've proven so far (if Craig's comments are to be believed) is
 that the oft-recommended formula of pg_dumpall -g plus pg_dump of each
 database doesn't completely work.  That's absolutely gotta be fixed.

What exact formula did you have in mind?  It must not be merely

1. pg_dumpall -g
2. pg_dump (without --create) per database

which _never_ works: it emits no CREATE DATABASE statements.  Perhaps this?

1. pg_dumpall -g
2. Issue a handwritten CREATE DATABASE statement per database with correct
   encoding, lc_ctype and lc_collate parameters.  All other database
   properties can be wrong; the dump will fix them.
3. pg_dump (without --create) per database

That neglects numerous database properties today, but we could make it work.
Given the problems I described upthread, it's an inferior formula that I
recommend against propping up.  I much prefer making this work completely:

1. pg_dumpall -g
2. pg_dump --create per database

Another formula I wouldn't mind offering:

1. pg_dumpall -g
2. pg_dumpall --empty-databases
3. pg_dump (without --create) per database

Code for an --empty-databases option already exists for pg_dumpall -g
--binary-upgrade.  A patch turning that into a user-facing feature might be
quite compact.  I don't see much point given a complete pg_dump --create,
but I wouldn't object.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Don'st start streaming after creating a slot in pg_receivexlog

2015-07-30 Thread Michael Paquier
On Wed, Jul 29, 2015 at 10:20 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-07-29 22:17:27 +0900, Michael Paquier wrote:
 Here is a patch implementing those things. IMO if-not-exists does not
 make much sense anymore

 What? It's rather useful to be able to discern between 'slot was already
 there' and 'oops, some error occured'. -1

OK, fine.

 To me the pg_recvlogical changes are pretty pointless.

OK, you wrote it after all. I won't insist on it. So, perhaps the
attached is more convincing then? It just changes --create-slot to
leave immediately after creation to address the complain of this
thread.
-- 
Michael


20150730_pgrecv_slots_v2.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Heikki Linnakangas

On 05/25/2015 11:43 PM, Tomas Vondra wrote:

There are 6 files attached, but only 0002-0006 are actually part of the
multivariate statistics patch itself.


All of these patches are huge. In order to review this in a reasonable 
amount of time, we need to do this in several steps. So let's see what 
would be the minimal set of these patches that could be reviewed and 
committed, while still being useful.


The main patches are:

1. shared infrastructure and functional dependencies
2. clause reduction using functional dependencies
3. multivariate MCV lists
4. multivariate histograms
5. multi-statistics estimation

Would it make sense to commit only patches 1 and 2 first? Would that be 
enough to get a benefit from this?


I have some doubts about the clause reduction and functional 
dependencies part of this. It seems to treat functional dependency as a 
boolean property, but even with the classic zipcode and city case, it's 
not always an all or nothing thing. At least in some countries, there 
can be zipcodes that span multiple cities. So zipcode=X does not 
completely imply city=Y, although there is a strong correlation (if 
that's the right term). How strong does the correlation need to be for 
this patch to decide that zipcode implies city? I couldn't actually see 
a clear threshold stated anywhere.


So rather than treating functional dependence as a boolean, I think it 
would make more sense to put a 0.0-1.0 number to it. That means that you 
can't do clause reduction like it's done in this patch, where you 
actually remove clauses from the query for cost esimation purposes. 
Instead, you need to calculate the selectivity for each clause 
independently, but instead of just multiplying the selectivities 
together, apply the dependence factor to it.


Does that make sense? I haven't really looked at the MCV, histogram and 
multi-statistics estimation patches yet. Do those patches make the 
clause reduction patch obsolete? Should we forget about the clause 
reduction and functional dependency patch, and focus on those later 
patches instead?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink: add polymorphic functions.

2015-07-30 Thread Tom Lane
Joe Conway m...@joeconway.com writes:
 What about just TYPE then?

 SELECT x::TYPE(some_expression) FROM ...
 SELECT CAST (x AS TYPE(some_expression)) FROM ...

Yeah, that would work.  Quick-hack proof-of-concept patch attached.
Some usage examples in the regression database:

regression=# select pg_typeof(43::type(q1)) from int8_tbl;
 pg_typeof 
---
 bigint
 bigint
 bigint
 bigint
 bigint
(5 rows)

regression=# select pg_typeof(43::type(q1/0.0)) from int8_tbl;
 pg_typeof 
---
 numeric
 numeric
 numeric
 numeric
 numeric
(5 rows)

regression=# select pg_typeof(43::type(f1)) from point_tbl;
ERROR:  cannot cast type integer to point
LINE 1: select pg_typeof(43::type(f1)) from point_tbl;
   ^

The main limitation of this patch is that it won't work for call sites
that pass pstate == NULL to LookupTypeName.  There are a fair number
of them, some of which wouldn't care because they could never invoke
this notation anyway, but for others we'd need to do some work to cons
up a suitable pstate.

regards, tom lane

diff --git a/src/backend/parser/parse_type.c b/src/backend/parser/parse_type.c
index 6616639..d5d0f73 100644
*** a/src/backend/parser/parse_type.c
--- b/src/backend/parser/parse_type.c
***
*** 19,25 
--- 19,27 
  #include catalog/pg_type.h
  #include lib/stringinfo.h
  #include nodes/makefuncs.h
+ #include nodes/nodeFuncs.h
  #include parser/parser.h
+ #include parser/parse_expr.h
  #include parser/parse_type.h
  #include utils/array.h
  #include utils/builtins.h
*** static int32 typenameTypeMod(ParseState 
*** 52,58 
   * found but is a shell, and there is typmod decoration, an error will be
   * thrown --- this is intentional.
   *
!  * pstate is only used for error location info, and may be NULL.
   */
  Type
  LookupTypeName(ParseState *pstate, const TypeName *typeName,
--- 54,61 
   * found but is a shell, and there is typmod decoration, an error will be
   * thrown --- this is intentional.
   *
!  * In most cases pstate is only used for error location info, and may be NULL.
!  * However, the TYPE(expression) syntax is not accepted when pstate is NULL.
   */
  Type
  LookupTypeName(ParseState *pstate, const TypeName *typeName,
*** LookupTypeName(ParseState *pstate, const
*** 143,148 
--- 146,188 
  			format_type_be(typoid;
  		}
  	}
+ 	else if (pstate != NULL 
+ 			 list_length(typeName-typmods) == 1 
+ 			 list_length(typeName-names) == 1 
+ 			 strcmp(strVal(linitial(typeName-names)), type) == 0)
+ 	{
+ 		/* TYPE(expression) notation */
+ 		Node	   *typexpr = (Node *) linitial(typeName-typmods);
+ 
+ 		/* XXX should invent a new EXPR_KIND for this, likely */
+ 		typexpr = transformExpr(pstate, typexpr, EXPR_KIND_SELECT_TARGET);
+ 
+ 		/* We needn't bother assigning collations to the expr */
+ 
+ 		/* We use the expression's type/typmod and then throw the expr away */
+ 		typoid = exprType(typexpr);
+ 
+ 		/* If an array reference, return the array type instead */
+ 		if (typeName-arrayBounds != NIL)
+ 			typoid = get_array_type(typoid);
+ 
+ 		if (!OidIsValid(typoid))
+ 		{
+ 			if (typmod_p)
+ *typmod_p = -1;
+ 			return NULL;
+ 		}
+ 
+ 		if (typmod_p)
+ 			*typmod_p = exprTypmod(typexpr);
+ 
+ 		/* Duplicative, but I'm too lazy to refactor this function right now */
+ 		tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typoid));
+ 		if (!HeapTupleIsValid(tup)) /* should not happen */
+ 			elog(ERROR, cache lookup failed for type %u, typoid);
+ 
+ 		return (Type) tup;
+ 	}
  	else
  	{
  		/* Normal reference to a type name */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE FUNCTION .. LEAKPROOF docs

2015-07-30 Thread Joe Conway
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/29/2015 04:46 PM, Joe Conway wrote:
 On 06/14/2015 03:46 AM, Dean Rasheed wrote:
 I think the docs for the LEAKPROOF option in
 create_function.sgml ought to mention RLS as well as security
 barrier views. Also the current text is no longer strictly
 correct in light of commit 
 dcbf5948e12aa60b4d6ab65b6445897dfc971e01. Suggested rewording 
 attached.
 
 Any objections out there to this doc patch? It rings true and
 reads well to me. I'd like to check another item off the 9.5 Open
 Items list..

Pushed to HEAD and 9.5

Joe


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJVulxZAAoJEDfy90M199hlpL8P/0S5tpWTNNGpk4suaoWVhULX
bAopCCZJLyWS7uZ4/PqMZvV2iVTiQvKrSMsKwXQsvm0fod5V4nC7UwSp9vqw0oE1
o0rzUFpcE1ZeUwbec5S98WZ4wi9HONO51/A6/QYf3I5LlWQBIvU5/5g8DINxtIlp
xe60Ihr5K1QUEAOGDqYWhMAyMpQKm0hBjriIIzC/+SW1gr0bXqiQcSytqTi6HV+w
y8x/YByJtYZEWgRuuMK+Cd6V7xFfBgdMlFepk6vPsnOZi1m+NVYGxVLGR5fzBDan
BTunV7d+uTQfrzwNXR8BUrreU/ED0R55CaSNCQEF5vnERPjLuGaiWxZjSOqI/8GF
FLBLlCJAH9V+o2krBeM74aAp/JkI3y1F7I/3otrxH0nFjdw4fpKrEPPiG4esG5zv
Y2J66aFHlykc7rmjSAWDiJGY6wfvinmjkeRTYCyt8cBRdeXuKO3LrjxY/9ZX7R33
cfUhCsBczRwg/7v5uEniE5k3SQ+YS7xwfLz9VZUuYCKJyMx4nego/VGVtGfvlJ2d
BWjcfHgjRdFy1alfPWZNEdl9F/DoGtAoEpcpawydtMcQCn5g/kg1mfCqg2P97VXe
rcay37smjACFBsTghDd23ufa0eNCTOj0hf7Fb19Ecmcm8XQwlVW7sByPjg89qRKd
mAIR5yIP5g/ptz38R8RC
=LzHH
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-07-30 Thread Jeff Janes
On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund and...@anarazel.de wrote:

 On 2015-07-29 14:22:23 +0200, Andres Freund wrote:
  On 2015-07-29 15:14:23 +0300, Heikki Linnakangas wrote:
   Ah, ok, that should work, as long as you also re-check the variable's
 value
   after queueing. Want to write the patch, or should I?
 
  I'll try. Shouldn't be too hard.

 What do you think about something roughly like the attached?


I've not evaluated the code, but applying it does solve the problem I was
seeing.

Cheers,

Jeff


[HACKERS] TAP tests are badly named

2015-07-30 Thread Andrew Dunstan


We should describe test sets by what they test, not by how they test. 
TAP is a testing tool/protocol. The current set of tests we have test 
the programs in src/bin, and we should really name the test set by a 
name that reflects that, rather than the fact that we are using TAP 
tools to run the tests. What if we decide to test something else using 
TAP? Would we call that set of tests TAP tests too?


--enable-tap-tests is a reasonable configuration setting, because it's 
about whether or not we have a TAP testing framework available, but I 
think we should stop calling the bin tests TAP tests and we should 
change the test name in vcregress.pl to a more appropriate name. In the 
buildfarm I'm calling the step bin-check: 
http://www.pgbuildfarm.org/cgi-bin/show_stage_log.pl?nm=crakedt=2015-07-30%2012%3A25%3A58stg=bin-check


Thoughts?

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-07-30 Thread Andres Freund
On 2015-07-30 17:36:52 +0300, Heikki Linnakangas wrote:
 In 9.4, LWLockAcquire holds the spinlock when it marks the lock as held,
 until it has updated the variable. And LWLockWaitForVar() holds the spinlock
 when it checks that the lock is held and that the variable's value matches.
 So it cannot happen on 9.4.

The first paragraph talked about the same value, but that was just
referring to it not yet having been cleared i think...

 To reiterate, with 9.5, it's possible that a backend is sleeping in
 LWLockWaitForVar(oldvar=123), even though the lock is currently held by
 another backend with value 124. That seems wrong, or surprising at the very
 least.

With my patch that can't really happen that way though? The value is
re-checked after queuing. If it has changed by then we're done. And if
it hasn't yet changed we're guaranteed to be woken up once it's being
changed?


I generaly don't mind adding some sort of flag clearing or such, but I'd
rather not have it in the retry loop in the general LWLockAttemptLock
path - I found that very small amounts of instructions in there have a
measurable impact.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Volatility of pg_xact_commit_timestamp() and pg_last_committed_xact()

2015-07-30 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Robert Haas wrote:
  On Thu, Jul 16, 2015 at 9:49 AM, Fujii Masao masao.fu...@gmail.com wrote:
   Volatilities of pg_xact_commit_timestamp() and pg_last_committed_xact()
   are now STABLE. But ISTM that those functions can return different results
   even within a single statement. So we should change the volatilities of 
   them
   to VOLATILE?
  
  Sounds right to me.
 
 Will fix.

Fix pushed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] security labels on databases are bad for dump restore

2015-07-30 Thread Robert Haas
On Thu, Jul 30, 2015 at 2:49 AM, Noah Misch n...@leadboat.com wrote:
 What exact formula did you have in mind?  It must not be merely

 1. pg_dumpall -g
 2. pg_dump (without --create) per database

 which _never_ works: it emits no CREATE DATABASE statements.  Perhaps this?

 1. pg_dumpall -g
 2. Issue a handwritten CREATE DATABASE statement per database with correct
encoding, lc_ctype and lc_collate parameters.  All other database
properties can be wrong; the dump will fix them.
 3. pg_dump (without --create) per database

 That neglects numerous database properties today, but we could make it work.
 Given the problems I described upthread, it's an inferior formula that I
 recommend against propping up.  I much prefer making this work completely:

 1. pg_dumpall -g
 2. pg_dump --create per database

Gah, OK, I see your point.  But we better document this, because if
you need a PhD in PostgreSQL-ology to take a backup, we're not in a
good place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Robert Haas
On Wed, Jul 29, 2015 at 9:05 PM, Peter Geoghegan p...@heroku.com wrote:
 The behavior of external sorts that do not require any merge step due
 to only having one run (what EXPLAIN ANALYZE output shows as an
 external sort, and not a merge sort) seems like an area that can
 be significantly improved upon. As noted in code comments, this
 optimization did not appear in The Art of Computer Programming, Volume
 III. It's not an unreasonable idea, but it doesn't work well on modern
 machines due to using heapsort, which is known to use the cache
 ineffectively. It also often implies significant additional I/O for
 little or no benefit. I suspect that all the reports we've heard of
 smaller work_mem sizes improving sort performance are actually down to
 this one-run optimization *hurting* performance.

Very interesting.  And great performance numbers.  Thanks for taking
the time to investigate this - really cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-07-30 Thread Robert Haas
On Wed, Jul 29, 2015 at 2:18 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-07-29 12:54:59 -0400, Robert Haas wrote:
 I would try to avoid changing lwlock.c.  It's pretty easy when so
 doing to create mechanisms that work now but make further upgrades to
 the general lwlock mechanism difficult.  I'd like to avoid that.

 I'm massively doubtful that re-implementing parts of lwlock.c is the
 better outcome. Then you have two different infrastructures you need to
 improve over time.

That is also true, but I don't think we're going to be duplicating
anything from lwlock.c in this case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Heikki Linnakangas

On 07/30/2015 04:05 AM, Peter Geoghegan wrote:

Patch -- quicksort with spillover
=

With the attached patch, I propose to add an additional, better one
run special case optimization. This new special case splits the
single run into 2 subruns. One of the runs is comprised of whatever
tuples were in memory when the caller finished passing tuples to
tuplesort. To sort that, we use quicksort, which in general has
various properties that make it much faster than heapsort -- it's a
cache oblivious algorithm, which is important these days. The other
subrun is whatever tuples were on-tape when tuplesort_performsort()
was called. This will often be a minority of the total, but certainly
never much more than half. This is already sorted when
tuplesort_performsort() is reached. This spillover is already inserted
at the front of the sorted-on-tape tuples, and so already has
reasonably good cache characteristics.

With the patch, we perform an on-the-fly merge that is somewhat
similar to the existing (unaffected) merge sort TSS_FINALMERGE case,
except that one of the runs is in memory, and is potentially much
larger than the on-tape/disk run (but never much smaller), and is
quicksorted. The existing merge sort case similarly is only used
when the caller specified !randomAccess.


Hmm. You don't really need to merge the in-memory array into the tape, 
as you know that all the tuples in the in-memory must come after the 
tuples already on the tape. You can just return all the tuples from the 
tape first, and then all the tuples from the array.


So here's a shorter/different explanation of this optimization: When 
it's time to perform the sort, instead of draining the in-memory heap 
one tuple at a time to the last tape, you sort the heap with quicksort, 
and pretend that the sorted heap belongs to the last tape, after all the 
other tuples in the tape.


Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs? 
Quicksorting the heap and keeping it as an array in memory is surely 
always faster than heapsorting and pushing it to the tape.


I think it'd make sense to structure the code differently, to match the 
way I described this optimization above. Instead of adding a new 
tuplesort state for this, abstract this in the logical tape code. Add a 
function to attach an in-memory tail to a tape, and have 
LogicalTapeRead() read from the tail after reading the on-disk tape. The 
rest of the code wouldn't need to care that sometimes part of the tape 
is actually in memory.


It should be pretty easy to support randomAccess too. If you think of 
the in-memory heap as a tail of the last tape, you can easily move 
backwards from the in-memory heap back to the on-disk tape, too.



+* Note that there might actually be 2 runs, but only the
+* contents of one of them went to tape, and so we can
+* safely pretend that there is only 1 run (since we're
+* about to give up on the idea of the memtuples array being
+* a heap).  This means that if our sort happened to require
+* random access, the similar single run optimization
+* below (which sets TSS_SORTEDONTAPE) might not be used at
+* all.  This is because dumping all tuples out might have
+* forced an otherwise equivalent randomAccess case to
+* acknowledge a second run, which we can avoid.


Is that really true? We don't start a second run until we have to, i.e. 
when it's time to dump the first tuple of the second run to tape. So I 
don't think the case you describe above, where you have two runs but 
only one of them has tuples on disk, can actually happen.



Performance
==


Impressive!


Predictability
==


Even more impressive!


Future work
=


As an extra optimization, you could delay quicksorting the in-memory 
array until it's time to read the first tuple from it. If the caller 
reads only the top-N tuples from the sort for some reason (other than 
LIMIT, which we already optimize for), that could avoid a lot of work.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-07-30 Thread Andres Freund
On 2015-07-29 20:23:24 +0300, Heikki Linnakangas wrote:
 Backend A has called LWLockWaitForVar(X) on a lock, and is now waiting on
 it. The lock holder releases the lock, and wakes up A. But before A wakes up
 and sees that the lock is free, another backend acquires the lock again. It
 runs LWLockAcquireWithVar to the point just before setting the variable's
 value. Now A wakes up, sees that the lock is still (or again) held, and that
 the variable's value still matches the old one, and goes back to sleep. The
 new lock holder won't wake it up until it updates the value again, or to
 releases the lock.

I'm not sure whether this observation is about my patch or the general
lwlock variable mechanism. In my opinion that behaviour exists today
both in 9.4 and 9.5.

But I think that's fine because that race seems pretty
fundamental. After all, you could have called LWLockWaitForVar() just
after the second locker had set the variable to the same value.


 You didn't like the new LW_FLAG_VAR_SET flag and the API changes I
 proposed?

I do slightly dislike the additional bit of math in AttemptLock. But
what I really disliked about the patch is the reliance on holding the
spinlock over longer times and conditionally acquiring spinlocks.  I
didn't see a need for the flags change to fix the issue at hand, that's
why I didn't incorporate it. I'm not fundamentally against it.

 Either way, there is a race condition that if the new lock holder sets the
 variable to the *same* value as before, the old waiter won't necessarily
 wake up even though the lock was free or had a different value in between.
 But that's easier to explain and understand than the fact that the value set
 by LWLockAcquireWithVar() might not be seen by a waiter, until you release
 the lock or update it again.

I can't really se a use for the API that'd allow that and care about the
waits. Because quite fundamentally you could just started waiting after
the variable was set to the value at the same time.

 Another idea is to have LWLockAcquire check the wait queue after setting the
 variable, and wake up anyone who's already queued up.

Ugh, not a fan of that.

 The changes in LWLockWaitForVar() and LWLockConflictsWithVar() seem OK in
 principle. You'll want to change LWLockConflictsWithVar() so that the
 spinlock is held only over the value = *valptr line, though; the other
 stuff just modifies local variables and don't need to be protected by the
 spinlock.

good point.,

 Also, when you enter LWLockWaitForVar(), you're checking if the
 lock is held twice in a row; first at the top of the function, and again
 inside LWLockConflictsWithVar. You could just remove the quick test at the
 top.

Yea, I was thinking about removing it. The first check was there
previously, so I left it in place. We do execute a bit more code once
we've disabled interrupts (extraWaits, re-enabling interrupts). I don't
think it'll matter much, but it seemed like an independent thing.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-07-30 Thread Andres Freund
On 2015-07-29 12:54:59 -0400, Robert Haas wrote:
 I would try to avoid changing lwlock.c.  It's pretty easy when so
 doing to create mechanisms that work now but make further upgrades to
 the general lwlock mechanism difficult.  I'd like to avoid that.

I'm massively doubtful that re-implementing parts of lwlock.c is the
better outcome. Then you have two different infrastructures you need to
improve over time.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] security labels on databases are bad for dump restore

2015-07-30 Thread Alvaro Herrera
Noah Misch wrote:

 What exact formula did you have in mind?  It must not be merely
 
 1. pg_dumpall -g
 2. pg_dump (without --create) per database
 
 which _never_ works: it emits no CREATE DATABASE statements.  Perhaps this?
 
 1. pg_dumpall -g
 2. Issue a handwritten CREATE DATABASE statement per database with correct
encoding, lc_ctype and lc_collate parameters.  All other database
properties can be wrong; the dump will fix them.
 3. pg_dump (without --create) per database
 
 That neglects numerous database properties today, but we could make it work.
 Given the problems I described upthread, it's an inferior formula that I
 recommend against propping up.

Agreed, and IMO it's embarrasing that it's so complicated to get a fully
working backup.

 I much prefer making this work completely:
 
 1. pg_dumpall -g
 2. pg_dump --create per database

My full support for this proposal.

 Another formula I wouldn't mind offering:
 
 1. pg_dumpall -g
 2. pg_dumpall --empty-databases
 3. pg_dump (without --create) per database
 
 Code for an --empty-databases option already exists for pg_dumpall -g
 --binary-upgrade.  A patch turning that into a user-facing feature might be
 quite compact.

I don't mind if this one is also made to work, but I don't care about
this case all that much.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: backend niceness / session_priority

2015-07-30 Thread Tom Lane
=?ISO-8859-15?Q?Jos=E9_Luis_Tall=F3n?= jltal...@adv-solutions.net writes:
  Since PostgreSQL lacks the resource management capabilities of the 
 Big Ones ( Resource Groups - Red, WorkLoad Manager - Blue ) or the 
 Resource Governor in MS SQL Server, we can try and approximate the 
 requested behaviour by reducing the CPU priority (nice) of the backend 
 in question. Please note that we would be using scheduler priority to 
 try and modulate I/O, though I'm aware of the limitations of this mechanism.

This has been proposed before, and rejected before, and I'm not seeing
anything particularly new here.  Without a credible mechanism for
throttling I/O, nice alone does not seem very promising.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench stats per script other stuff

2015-07-30 Thread Fabien COELHO



v6 is just a rebase after a bug fix by Andres Freund.

Also a small question: The patch currently displays pgbench scripts 
starting numbering at 0. Probably a little too geek... should start at 
1?


v7 is a rebase after another small bug fix in pgbench.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2517a3a..99670d4 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -261,6 +261,23 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
 benchmarking arguments:
 
 variablelist
+ varlistentry
+  termoption-b/ replaceablescriptname[@weight]//term
+  termoption--builtin/ replaceablescriptname[@weight]//term
+  listitem
+   para
+Add the specified builtin script to the list of executed scripts.
+An optional integer weight after literal@/ allows to adjust the
+probability of drawing the test.
+Available builtin scripts are: literaltpcb-like/,
+literalsimple-update/ and literalselect-only/.
+The provided repleacablescriptname/ needs only to be a prefix
+of the builtin name, hence literalsimp/ would be enough to select
+literalsimple-update/.
+   /para
+  /listitem
+ /varlistentry
+
 
  varlistentry
   termoption-c/option replaceableclients//term
@@ -307,14 +324,15 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
  /varlistentry
 
  varlistentry
-  termoption-f/option replaceablefilename//term
-  termoption--file=/optionreplaceablefilename//term
+  termoption-f/ replaceablefilename[@weight]//term
+  termoption--file=/replaceablefilename[@weight]//term
   listitem
para
-Read transaction script from replaceablefilename/.
+Add a transaction script read from replaceablefilename/ to
+the list of executed scripts.
+An optional integer weight after literal@/ allows to adjust the
+probability of drawing the test.
 See below for details.
-option-N/option, option-S/option, and option-f/option
-are mutually exclusive.
/para
   /listitem
  /varlistentry
@@ -404,10 +422,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
   termoption--skip-some-updates/option/term
   listitem
para
-Do not update structnamepgbench_tellers/ and
-structnamepgbench_branches/.
-This will avoid update contention on these tables, but
-it makes the test case even less like TPC-B.
+Shorthand for option-b simple-update@1/.
/para
   /listitem
  /varlistentry
@@ -499,9 +514,9 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
 Report the specified scale factor in applicationpgbench/'s
 output.  With the built-in tests, this is not necessary; the
 correct scale factor will be detected by counting the number of
-rows in the structnamepgbench_branches/ table.  However, when testing
-custom benchmarks (option-f/ option), the scale factor
-will be reported as 1 unless this option is used.
+rows in the structnamepgbench_branches/ table.
+However, when testing only custom benchmarks (option-f/ option),
+the scale factor will be reported as 1 unless this option is used.
/para
   /listitem
  /varlistentry
@@ -511,7 +526,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
   termoption--select-only/option/term
   listitem
para
-Perform select-only transactions instead of TPC-B-like test.
+Shorthand for option-b select-only@1/.
/para
   /listitem
  /varlistentry
@@ -567,6 +582,16 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
  /varlistentry
 
  varlistentry
+  termoption--per-script-stats/option/term
+  listitem
+   para
+Report some statistics per script run by pgbench.
+   /para
+  /listitem
+ /varlistentry
+
+
+ varlistentry
   termoption--sampling-rate=replaceablerate//option/term
   listitem
para
@@ -661,7 +686,20 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/
   titleWhat is the quoteTransaction/ Actually Performed in pgbench?/title
 
   para
-   The default transaction script issues seven commands per transaction:
+   Pgbench executes test scripts chosen randomly from a specified list.
+   They include built-in scripts with option-b/ and
+   user-provided custom scripts with option-f/.
+   Each script may be given a relative weight specified after a
+   literal@/ so as to change its drawing probability.
+   The default weight is literal1/.
+ /para
+
+  para
+   The default builtin transaction script (also invoked with option-b tpcb-like/)
+   issues seven commands per transaction over randomly chosen literalaid/,
+   literaltid/, literalbid/ and 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Heikki Linnakangas

On 07/30/2015 03:55 PM, Tomas Vondra wrote:

On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:

I have some doubts about the clause reduction and functional
dependencies part of this. It seems to treat functional dependency as
a boolean property, but even with the classic zipcode and city case,
it's not always an all or nothing thing. At least in some countries,
there can be zipcodes that span multiple cities. So zipcode=X does
not completely imply city=Y, although there is a strong correlation
(if that's the right term). How strong does the correlation need to
be for this patch to decide that zipcode implies city? I couldn't
actually see a clear threshold stated anywhere.

So rather than treating functional dependence as a boolean, I think
it would make more sense to put a 0.0-1.0 number to it. That means
that you can't do clause reduction like it's done in this patch,
where you actually remove clauses from the query for cost esimation
purposes. Instead, you need to calculate the selectivity for each
clause independently, but instead of just multiplying the
selectivities together, apply the dependence factor to it.

Does that make sense? I haven't really looked at the MCV, histogram
and multi-statistics estimation patches yet. Do those patches make
the clause reduction patch obsolete? Should we forget about the
clause reduction and functional dependency patch, and focus on those
later patches instead?


Perhaps. It's true that most real-world data sets are not 100% valid
with respect to functional dependencies - either because of natural
imperfections (multiple cities with the same ZIP code) or just noise in
the data (incorrect entries ...). And it's even mentioned in the code
comments somewhere, I guess.

But there are two main reasons why I chose not to extend the functional
dependencies with the [0.0-1.0] value you propose.

Firstly, functional dependencies were meant to be the simplest possible
implementation, illustrating how the infrastructure is supposed to
work (which is the main topic of the first patch).

Secondly, all kinds of statistics are simplifications of the actual
data. So I think it's not incorrect to ignore the exceptions up to some
threshold.


The problem with a threshold is that around that threshold, even a small 
change in the data set can drastically change the produced estimates. 
For example, imagine that we know from the stats that zip code implies 
city. But then someone adds a single row to the table with an odd zip 
code  city combination, which pushes the estimator over the threshold, 
and the columns are no longer considered dependent, and the estimates 
are now completely different. We should avoid steep cliffs like that.


BTW, what is the threshold in the current patch?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal: backend niceness / session_priority

2015-07-30 Thread José Luis Tallón

Hackers,

I have found myself needing to run some maintenance routines 
(VACUUM, REINDEX, REFRESH MATERIALIZED VIEW mostly) at a lower priority 
so as not to disturb concurrent *highly transactional* connections. This 
issue is also noted within the TODO[0] list in the Wiki .


* There was some discussion on 2007 [1] regarding Priorities for users 
or queries? 
http://archives.postgresql.org/pgsql-general/2007-02/msg00493.php


Since PostgreSQL lacks the resource management capabilities of the 
Big Ones ( Resource Groups - Red, WorkLoad Manager - Blue ) or the 
Resource Governor in MS SQL Server, we can try and approximate the 
requested behaviour by reducing the CPU priority (nice) of the backend 
in question. Please note that we would be using scheduler priority to 
try and modulate I/O, though I'm aware of the limitations of this mechanism.


Using renice(1) from outside is not only cumbersome and error prone 
but very much unuseable for the use cases I am contemplating.



* Moveover, as seen in the Priorities wiki page [2], there exists an 
extension providing a set_backend_priority() function, to be called 
set_backend_priority(pg_backend_pid(), 20).
This approach is, sadly, not portable to non-POSIX operating systems 
(e.g. Windows), and IMO quite too convoluted to use and tied to actual 
implementation details.




* I have been playing with some code which uses a GUC for this purpose, 
though only define/support three different priorities would make sense 
for the final implementation IMO: NORMAL, LOW_PRIORITY, IDLE
Checked platform compatibility too: this behaviour can be 
implemented on Windows, too. For everything else, there's nice (2)




However, there is a relatively minor catch here which is the reason 
behind this e-mail: user interface


- Inventing a new command seems overkill to me. Plus, I don't know 
what we could model it on --- given that the real solution for this 
problem would be a fully featured priority manager ---


- I have been playing with a GUC that ignores being reset --- so as to 
comply with nice's specification when not running as a privileged user 
--- but I reckon that this behaviour might be surprising at best:

SET session_priority TO 'low';-- Ok, low priority
VACUUM FREEZE my_test_table;
RESET session_priority;-- Nope, still low prio. Emit notice?

The way to reset the priority would be to RECONNECT. And this is my 
main pain point though it does fullfill the need.



However, this approach does fullfill my needs and ---it seems--- 
the OP's needs: be able to run a maintenance task at a low priority 
(i.e. disturbing other concurrent queries as little as possible). 
Expected use case:  cronjob running  psql -c 'SET session_priority TO 
low; REINDEX blabla CONCURRENTLY; VACUUM foobar;'



All suggestions welcome.

I'll be wrapping a more-or-less-done patch on monday if somebody wants 
to take a look and criticize on actual code (I won't be working on this 
tomorrow) unless somebody points me at a better solution



Thanks,

/ J.L.



[0] https://wiki.postgresql.org/wiki/Todo - Miscellaneous performance
[1] http://archives.postgresql.org/pgsql-general/2007-02/msg00493.php
[2] https://wiki.postgresql.org/wiki/Priorities

[3] http://docs.oracle.com/cd/E11882_01/server.112/e25494/dbrm.htm
[4] 
http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.doc/com.ibm.db2.luw.doc-gentopic6.html

[5] https://msdn.microsoft.com/en-us/library/bb933866.aspx


Re: [HACKERS] Division by zero in selfuncs.c:estimate_hash_bucketsize()

2015-07-30 Thread Tom Lane
Piotr Stefaniak postg...@piotr-stefaniak.me writes:
 the two asserts below will fail with this query (ran against the 
 regression db):

I've applied fixes for this and the other thing.  Thanks for the report!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Row-Level Security Policies (RLS)

2015-07-30 Thread Joe Conway
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/30/2015 06:52 AM, Dean Rasheed wrote:
 On 30 July 2015 at 01:35, Joe Conway m...@joeconway.com wrote:
 On 06/01/2015 02:21 AM, Dean Rasheed wrote:
 While going through this, I spotted another issue --- in a DML 
 query with additional non-target relations, such as UPDATE t1
 .. FROM t2 .., the old code was checking the UPDATE policies of
 both t1 and t2, but really I think it ought to be checking the
 SELECT policies of t2 (in the same way as this query requires
 SELECT table permissions on t2, not UPDATE permissions). I've
 changed that and added new regression tests to test that
 change.
 
 I assume the entire refactoring patch needs a fair bit of work
 to rebase against current HEAD,
 
 Actually, there haven't been any conflicting changes so far, so a
 git rebase was able to automatically merge correctly -- new patch 
 attached, with some minor comment rewording (not affecting the
 bug-fix part).

Good to hear.

 Even so, I agree that it makes sense to apply the bug-fix
 separately, since it's not really anything to do with the
 refactoring.
 
 but I picked out the attached to address just the above issue.
 Does this look correct, and if so does it make sense to apply at
 least this part right now?
 
 Looks correct to me.

Thanks -- committed and pushed to HEAD and 9.5

- -- 
Joe Conway
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJVulOVAAoJEDfy90M199hlURUP/2TF7r77taLPhQtEk3CFFQEn
mrt90N4DJ43VwGwC7mfdBsKXoJ+3jF3Hpghw/7ulI731/Io7C514gYDDvwGkrJWu
vK3vzXEQu9CIIfh97CsJ5mmenaaxrF9ZSaWDbYvQQMwMnxUS5CGWwR6VSp+NhCZ9
cnfA7idbxNjfBsjG0nQvtSgV/HOp0tP3A6dlYPTXPiIzbT9IiOpxWPwoQYgMSTcP
TBgK5MHG5JWrK1Bcg3BlQzYZefKEwN+LGU6zal4P4AjM14FfyMKfMc9A6EP9vtRl
jFekmRUddbXxWddl0ZSSV5BY9BLTRL2CZFhMQNQ9xKDlsK1cuYORN4v+TgRCjPKy
PdMH2tgndWsNNRTmj/vWUJxMXnHARl8MmtY8pau5Z6PKNUcASqd5Xq+zfKxOw8vf
lS8c4eRsLcCD+TZ1rv5K6VULmwBViU0gKP6Xs62yDHsz/Zwvt2ar1fW/25peohhb
m4j8vOCsdik9DDcJf6dF8sbb/z0FVh+JQqWhjbSJYioX9BOw/1AoNbi44wS7HzV1
vdhWx6qGWZnxi5qtb9j8BU0eFr/Q65kU3hsp2smRA/IK0bCQaO08rDQlPYsIHq10
SFodULNKFzpGvkzQ2Yqv1oyJ6jMvLtWgr9vBUZvPo8OHUyAkR8kfrZyzWyr/L/Mm
6jcuFNdYSS5F7W/S7ost
=GJw9
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink: add polymorphic functions.

2015-07-30 Thread Alvaro Herrera
Tom Lane wrote:
 Joe Conway m...@joeconway.com writes:
  What about just TYPE then?
 
  SELECT x::TYPE(some_expression) FROM ...
  SELECT CAST (x AS TYPE(some_expression)) FROM ...
 
 Yeah, that would work.  Quick-hack proof-of-concept patch attached.

I'm amazed that this works without hacking the grammar itself, but in
retrospect that's expected.

 + else if (pstate != NULL 
 +  list_length(typeName-typmods) == 1 
 +  list_length(typeName-names) == 1 
 +  strcmp(strVal(linitial(typeName-names)), type) == 0)
 + {
 + /* TYPE(expression) notation */
 + Node   *typexpr = (Node *) linitial(typeName-typmods);

This is a rather ugly, but I guess not untenable.  I suppose we don't
care about any actual typmod, do we?.  Will this be of any use with the
PostGIS types and such, for which the typmod is not merely a size limit?
Also INTERVAL has some funny typmod rules, not sure if that affects
usage of this construct.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Planner debug views

2015-07-30 Thread Jim Nasby

On 7/29/15 2:40 PM, Alvaro Herrera wrote:

Qingqing Zhou wrote:


Can we simplify above with foreign table methods? There are two major
concerns about this method per previous discussions: security and
usability. I think the main cause is the sharing foreign table design.


I think foreign data wrappers are great.  I do not think that we should
try to shape every problem to look like foreign data so that we can
solve it with a foreign data wrapper.  I am a bit nervous that this
keeps being brought up.


Agreed.

I think a better option would be shoving it into a backend tuplestore 
and just leaving it there (maybe with a command to clear it for the 
paranoid). That gives a relation you can query against, insert into 
another table, etc.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink: add polymorphic functions.

2015-07-30 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 Yeah, that would work.  Quick-hack proof-of-concept patch attached.

 This is a rather ugly, but I guess not untenable.  I suppose we don't
 care about any actual typmod, do we?.

We get the type and typmod both from the expression.  Example:

regression=# create table ref (f1 varchar, f2 varchar(3));
CREATE TABLE
regression=# insert into ref values('1','2');
INSERT 0 1
regression=# select '1234567890'::type(f1) from ref;
type

 1234567890
(1 row)

regression=# select '1234567890'::type(f2) from ref;
 type 
--
 123
(1 row)

 Will this be of any use with the
 PostGIS types and such, for which the typmod is not merely a size limit?

Don't see why not.  They still have to follow the rule that typmod is a
static property of an expression.

 Also INTERVAL has some funny typmod rules, not sure if that affects
 usage of this construct.

I would not think so.  The weirdness about INTERVAL mainly has to do with
SQL's, ahem, inventive collection of ways to write an interval constant,
and that wouldn't really be an issue for any practical use of this
construct AFAICS.  Whatever you write as the expression, we're going to
be able to reduce to a type OID and typmod.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-30 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  Bottom line is that somebody failed to consider the possibility of a
  null comparison value reaching the BRIN index lookup machinery.
  The code stanza that's failing supposes that only IS NULL or IS NOT NULL
  tests could have SK_ISNULL set, but that's just wrong.
 
  I think the easiest way to solve this is to consider that all indexable
  operators are strict, and have the function return false in that case.
  The attached patch implements that.
 
 This looks fine to me as a localized fix.  I was wondering whether we
 could short-circuit the index lookup further upstream, but I take it from
 your comment about _bt_preprocess_keys that BRIN has no convenient place
 for that today.  (Even if it did, I'd still vote for making this change,
 for safety's sake.)

Yeah, it doesn't currently.  Hopefully we will improve that in the
future.  Particularly with regards to array keys I think there's a lot
to be done there.

I pushed this as is.  I hesitated about adding a regression test, but it
didn't seem worthwhile in the end, because if in the future we improve
scan key analysis, we will need much more extensive testing, and this
doesn't look like the type of bug that we could reintroduce in a whim.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-07-30 Thread Michael Paquier
On Fri, Jul 31, 2015 at 8:30 AM, Fabrízio de Royes Mello
fabriziome...@gmail.com wrote:

 On Fri, Jul 24, 2015 at 4:05 AM, Michael Paquier michael.paqu...@gmail.com
 wrote:

 On Fri, Jul 24, 2015 at 7:11 AM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:
  On Thu, Jul 2, 2015 at 2:03 PM, Simon Riggs si...@2ndquadrant.com
  wrote:
  Looks functionally complete
 
  Need a test to show that ALTER TABLE works on views, as discussed on
  this
  thread. And confirmation that pg_dump is not broken by this.
 
  Message-ID: 20140321205828.gb3969...@tornado.leadboat.com
 
 
  Added more test cases to cover ALTER TABLE on views.
 
  I'm thinking about the isolation tests, what about add another
  'alter-table'
  spec for isolation tests enabling and disabling 'autovacuum' options?

 Yes, please.


 Added. I really don't know if my isolation tests are completely correct, is
 my first time writing this kind of tests.

This patch size has increased from 16k to 157k because of the output
of the isolation tests you just added. This is definitely too large
and actually the test coverage is rather limited. Hence I think that
they should be changed as follows:
- Use only one table, the locks of tables a and b do not conflict, and
that's what we want to look at here.
- Check the locks of all the relation parameters, by that I mean as
well fillfactor and user_catalog_table which still take
AccessExclusiveLock on the relation
- Use custom permutations. I doubt that there is much sense to test
all the permutations in this context, and this will reduce the
expected output size drastically.

On further notice, I would recommend not to use the same string name
for the session and the query identifiers. And I think that inserting
only one tuple at initialization is just but fine.

Here is an example of such a spec:
setup
{
 CREATE TABLE a (id int PRIMARY KEY);
 INSERT INTO a SELECT generate_series(1,100);
}
teardown
{
 DROP TABLE a;
}
session s1
step b1   { BEGIN; }
# TODO add one query per parameter
step at11 { ALTER TABLE a SET (fillfactor=10); }
step at12 { ALTER TABLE a SET (autovacuum_vacuum_scale_factor=0.001); }
step c1   { COMMIT; }
session s2
step b2   { BEGIN; }
step wx1  { UPDATE a SET id = id + 1; }
step c2   { COMMIT; }

And by testing permutations like for example that it is possible to
see which session is waiting for what. Input:
permutation b1 b2 at11 wx1 c1 c2
Output where session 2 waits for lock taken after fillfactor update:
step b1: BEGIN;
step b2: BEGIN;
step at11: ALTER TABLE a SET (fillfactor=10);
step wx1: UPDATE a SET id = id + 1; waiting ...
step c1: COMMIT;
step wx1: ... completed
step c2: COMMIT;

Be careful as well to not include incorrect permutations in the
expected output file, the isolation tester is smart enough to ping you
about that.

 +GetRelOptionsLockLevel(List *defList)
 +{
 +   LOCKMODElockmode = NoLock;
 Shouldn't this default to AccessExclusiveLock instead of NoLock?


 No, because it will break the logic bellow (get the highest locklevel
 according the defList), but I force return an AccessExclusiveLock if
 defList == NIL.

Yep, OK with this change.

The rest of the patch looks good to me, so once the isolation test
coverage is done I think that it could be put in the hands of a
committer.
Regards
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remaining 'needs review' patchs in July commitfest

2015-07-30 Thread Robert Haas
On Wed, Jul 29, 2015 at 5:08 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jul 28, 2015 at 3:51 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 plpgsql raise statement with context
 Impasse. Everyone wants this feature in some form, but no consensus on
 whether to do this client-side or server-side.

 +1 for server-side.  Does anyone other than you even think that the
 client side is a reasonable way to go?

 Yes.  This is presupposing on the server side what the client will want
 to display.

Fair enough.  I'm still not convinced we're doing anything other than
complicating what ought to be a simple matter.  It is just a fact that
logging tracing messages in PL/pgsql functions is a pain in the butt
right now in some situations because you get a huge number of CONTEXT
lines that you don't want.  Can we agree on some solution to that
problem without over-engineering this to infinity?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump quietly ignore missing tables - is it bug?

2015-07-30 Thread Heikki Linnakangas

On 07/25/2015 07:08 PM, Pavel Stehule wrote:

I am sending a new patch - without checking wildcard chars.


The documentation says the option is called --strict-names, while the 
code has --strict-mode. I like --strict-names more, mode seems 
redundant, and it's not clear what it's strict about.


For symmetry, it would be good to also support this option in 
pg_restore. It seems even more useful there.


Can we do better than issuing a separate query for each table/schema 
name? The performance of this isn't very important, but still it seems 
like you could fairly easily refactor the code to avoid that. Perhaps 
return an extra constant for part of the UNION to distinguish which 
result row came from which pattern, and check that at least one row is 
returned for each.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Simon Riggs
On 30 July 2015 at 08:00, Heikki Linnakangas hlinn...@iki.fi wrote:


 Hmm. You don't really need to merge the in-memory array into the tape, as
 you know that all the tuples in the in-memory must come after the tuples
 already on the tape. You can just return all the tuples from the tape
 first, and then all the tuples from the array.


Agreed

This is a good optimization for the common case where tuples are mostly
already in order. We could increase the usefulness of this by making UPDATE
pick blocks that are close to a tuple's original block, rather than putting
them near the end of a relation.


 So here's a shorter/different explanation of this optimization: When it's
 time to perform the sort, instead of draining the in-memory heap one tuple
 at a time to the last tape, you sort the heap with quicksort, and pretend
 that the sorted heap belongs to the last tape, after all the other tuples
 in the tape.

 Some questions/thoughts on that:

 Isn't that optimization applicable even when you have multiple runs?
 Quicksorting the heap and keeping it as an array in memory is surely always
 faster than heapsorting and pushing it to the tape.


It's about use of memory. If you have multiple runs on tape, then they will
need to be merged and you need memory to do that efficiently. If there are
tuples in the last batch still in memory then it can work, but it depends
upon how full memory is from the last batch and how many batches there are.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Autonomous Transaction is back

2015-07-30 Thread Rajeev rastogi
On 28 July 2015 03:21, Josh Berkus Wrote:
 
On 07/27/2015 02:47 AM, Rajeev rastogi wrote:
 Why have any fixed maximum?
 Since we are planning to have nested autonomous transaction, so it is 
 required to have limit on this so that resources can be controlled.

Is there a particular reason why this limit wouldn't just be max_stack_depth?

We will require to allocate some initial resources in order to handle all 
nested autonomous transaction. 
So I think it is better to have some different configuration parameter.

Thanks and Regards,
Kumar Rajeev Rastogi

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Configurable location for extension .control files

2015-07-30 Thread Heikki Linnakangas

On 07/02/2015 11:37 PM, Oskari Saarenmaa wrote:

I'm somewhat interested in both #1  #2 for other projects, but I wrote
this patch to address #3, i.e. to simplify the test setup we have in
place for pgmemcache
(https://github.com/ohmu/pgmemcache/blob/master/localtests.sh) and other
extensions. I'd like to be able to set up a local PG cluster in /tmp or
some other location and load the extensions I just built in there.


Now that's a laudable goal. It indeed would be nice to be able to do 
make check to test an extension, using pgxs. The way make check 
within the PostgreSQL source tree works is that it performs make 
install to install PostgreSQL a temporary location, and installs the 
extension to that. We can't use make install to create a new 
PostgreSQL installation in an extension, but perhaps you could have a 
substitute of that that copies an existing installation to a temporary 
location. I hacked that pgmemcache localtest.sh script to do just that, 
see attached. That's usable for pgmemcache as it is, but for a general 
facility, you'd need to put that logic into pgxs instead, and it should 
also take care of running initdb and starting and stopping the cluster. 
But pg_regress can already do those things, so that should be easy.


So, I think that's the direction this should be going. In summary, the 
goal is to make make check to work for extensions, and the way to do 
that is to teach pgxs to create a temporary installation.


- Heikki

diff --git a/.travis.yml b/.travis.yml
index b30af7e..9197690 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -11,7 +11,7 @@ script:
   - PATH=/usr/lib/postgresql/$PGVER/bin/:$PATH ./localtests.sh
 
 after_failure:
-  - cat regressiondata/regression.diffs
+  - cat regression.diffs
 
 env:
   - PGVER=9.1
diff --git a/localtests.sh b/localtests.sh
index d43e15f..956e300 100755
--- a/localtests.sh
+++ b/localtests.sh
@@ -2,17 +2,55 @@
 
 # Create a clean PostgreSQL cluster for our testing
 
-TESTDIR=$(pwd)/regressiondata
+TESTDIR=$(pwd)/tmp_check
 export PGPORT=$((10240 + RANDOM / 2))
-export PGDATA=$TESTDIR/pg
+export PGDATA=$TESTDIR/data
 
 rm -rf $TESTDIR
+
+# Create a minimal PostgreSQL installation, by copying an existing
+# installation. This is a replacement for doing make install in the
+# PostgreSQL source tree, for when you don't have the source tree available.
+#
+# The minimum we need to copy from the existing installation are server 'lib'
+# and 'share' directories, and a few binaries.
+#
+# Note that 'pg_config --libdir' is the path to client-side libraries, i.e.
+# libpq, and we don't want to copy that. (on many distributions, libdir points
+# to just /usr/lib, and we certainly don't want to copy that in whole).
+# Also note that we cannot use symlinks for the binaries, because the binaries
+# look at the absolute path they are run from to find the rest of the files
+# they need.
+
+TMP_INSTALL=$TESTDIR/install
+
+bindir=`pg_config --bindir`
+pkglibdir=`pg_config --pkglibdir`
+sharedir=`pg_config --sharedir`
+
+mkdir -p $TMP_INSTALL
+mkdir -p $TMP_INSTALL$bindir
+mkdir -p $TMP_INSTALL$pkglibdir
+mkdir -p $TMP_INSTALL$sharedir
+
+cp -a $bindir/postgres $TMP_INSTALL$bindir
+cp -a $bindir/initdb $TMP_INSTALL$bindir
+cp -a $bindir/pg_ctl $TMP_INSTALL$bindir
+cp -a $pkglibdir/* $TMP_INSTALL$pkglibdir
+cp -a $sharedir/* $TMP_INSTALL$sharedir
+
+export PATH=$TMP_INSTALL$bindir:$PATH
+
+# Install pgmemcache to the temporary installation
+make install DESTDIR=$TMP_INSTALL
+
+# Set up a temporary cluster
 mkdir -p $PGDATA
 
 initdb -E UTF-8 --no-locale
+# XXX: Should use pg_config --config-auth to lock down the cluster
 sed -e s%^#port =.*%port = $PGPORT% \
 -e s%^#\(unix_socket_director[a-z]*\) =.*%\1 = '$PGDATA'% \
--e s%^#dynamic_library_path = .*%dynamic_library_path = '$(pwd):\$libdir'% \
 -e s%^#fsync = .*%fsync = off% \
 -e s%^#synchronous_commit = .*%synchronous_commit = off% \
 -i $PGDATA/postgresql.conf
@@ -20,15 +58,5 @@ pg_ctl -l $PGDATA/log start
 while [ ! -S $PGDATA/.s.PGSQL.$PGPORT ]; do sleep 2; done
 trap pg_ctl stop -m immediate EXIT
 
-# It's not possible to override the extension path, so we'll just execute
-# the extension SQL directly after mangling it a bit with sed
-
-cp -a Makefile test.sql sql/ expected/ $TESTDIR
-sed -e s%MODULE_PATHNAME%pgmemcache% \
--e /CREATE EXTENSION/d -e /^--/d -e /^$/d \
-ext/pgmemcache.sql  $TESTDIR/sql/init.sql
-cp $TESTDIR/sql/init.sql $TESTDIR/expected/init.out
-
-# Run the actual tests
-
-make -C $TESTDIR installcheck REGRESS_OPTS=--host=$PGDATA --port=$PGPORT
+# Run the actual tests in the temporary cluster.
+make installcheck REGRESS_OPTS=--host=$PGDATA --port=$PGPORT

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-07-30 Thread Etsuro Fujita

On 2015/07/07 19:15, Etsuro Fujita wrote:

On 2015/07/06 9:42, Kouhei Kaigai wrote:

However, we need to pay attention on advantages towards the alternatives.
Hooks on add_paths_to_joinrel() enables to implement identical stuff,
with
less complicated logic to reproduce left / right relations from
RelOptInfo
of the joinrel. (Note that RelOptInfo-fdw_private enables to avoid path-
construction multiple times.)
I'm uncertain why this API change is necessary to fix up the problem
around EvalPlanQual.


Yeah, maybe we wouldn't need any API change.  I think we would be able
to fix this by complicating add_path as you pointed out upthread.  I'm
not sure that complicating it is a good idea, though.  I think that it
might be possible that the callback in standard_join_search would allow
us to fix this without complicating the core path-cost-comparison stuff
such as add_path.  I noticed that what I proposed upthread doesn't work
properly, though.


To resolve this issue, I tried to make the core create an alternative 
plan that will be used in an EvalPlanQual recheck, instead of a foreign 
scan that performs a foreign join remotely (ie, scanrelid = 0).  But I 
changed that idea.  Instead, I'd like to propose that it's the FDW's 
responsibility to provide such a plan.  Specifically, I'd propose that 
(1) we add a new Path field, say subpath, to the ForeignPath data 
structure and that (2) when generating a ForeignPath node for a foreign 
join, an FDW must provide the subpath Path node by itself.  As before, 
it'd be recommended to use


ForeignPath *
create_foreignscan_path(PlannerInfo *root, RelOptInfo *rel,
double rows, Cost startup_cost, Cost total_cost,
List *pathkeys,
Relids required_outer,
Path *subpath,
List *fdw_private)

where subpath is the subpath Path node that has the pathkeys and the 
required_outer rels.  (subpath is NULL if scanning a base relation.) 
Also, it'd be recommended that an FDW generates such ForeignPath nodes 
by considering, for each of paths in the rel's pathlist, whether to push 
down that path (to generate a ForeignPath node for a foreign join that 
has the same pathkeys and parameterization as that path).  So, if 
generating the ForeignPath node, that path could be used as the subpath 
Path node.


(I think the current postgres_fdw patch only considers an unsorted, 
unparameterized path for performing a foreign join remotely, but I think 
we should also consider presorted and/or parameterized paths.)


I think this idea would apply to the API location that you proposed. 
However, ISTM that this idea would work better for the API call from 
standard_join_search because the rel's pathlist at that point has more 
paths worthy of consideration, in view of not only costs and sizes but 
pathkeys and parameterization.


I think the subplan created from the subpath Path node could be used in 
an EvalPlanQual recheck, instead of a foreign scan that performs a 
foreign join remotely, as discussed previously.


Comments welcome!

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] We need to support ForeignRecheck for late row locking, don't we?

2015-07-30 Thread Etsuro Fujita
On 2015/07/27 18:16, Kouhei Kaigai wrote:
 On 2015/07/24 23:51, Kouhei Kaigai wrote:
 On 2015/07/22 19:10, Etsuro Fujita wrote:
 While working on the issue Foreign join pushdown vs EvalPlanQual, I
 happened to notice odd behaviors of late row locking in FDWs.

 I think the reason for that is because we don't check pushed-down quals
 inside an EPQ testing even if what was fetched by RefetchForeignRow was
 an updated version of the tuple rather than the same version previously
 obtained.  So, to fix this, I'd like to propose that pushed-down quals
 be checked in ForeignRecheck.

 * I've modified ForeignRecheck so as to check pushed-down quals whether
 doing late locking or early locking.

 Isn't it an option to put a new callback in ForeignRecheck?

 FDW driver knows its private data structure includes expression node
 that was pushed down to the remote side. So, it seems to me the best
 way to consult FDW driver whether the supplied tuple should be visible
 according to the pushed down qualifier.

 More or less, this fix need a new interface contract around EvalPlanQual
 logic. It is better to give FDW driver more flexibility of its private
 data structure and the way to process recheck logic, rather than special
 purpose variable.

 If FDW driver managed pushed-down expression in its own format, requirement
 to pushedDownQual makes them to have qualifier redundantly.
 The callback approach does not have such kind of concern.

 That might be an idea, but is there any performance disadvantage as
 discussed in [1]?; it looks like that that needs to perform another
 remote query to see if the supplied tuple satisfies the pushed-down
 quals during EPQ testing.

 I expect the callback of ForeignRecheck runs ExecQual() towards
 the qualifier expression pushed-down but saved on the private data
 of ForeignScanState. It does not need to kick another remote query
 (unless FDW driver is not designed), so performance disadvantage is
 none or quite limited.

The advantages look not clear to me.

I think the callback approach would be a good idea if FDWs were able to
do the recheck more efficiently in their own ways than the core, for
example.

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Heikki Linnakangas

On 07/30/2015 01:47 PM, Simon Riggs wrote:

On 30 July 2015 at 08:00, Heikki Linnakangas hlinn...@iki.fi wrote:

So here's a shorter/different explanation of this optimization: When it's
time to perform the sort, instead of draining the in-memory heap one tuple
at a time to the last tape, you sort the heap with quicksort, and pretend
that the sorted heap belongs to the last tape, after all the other tuples
in the tape.

Some questions/thoughts on that:

Isn't that optimization applicable even when you have multiple runs?
Quicksorting the heap and keeping it as an array in memory is surely always
faster than heapsorting and pushing it to the tape.


It's about use of memory. If you have multiple runs on tape, then they will
need to be merged and you need memory to do that efficiently. If there are
tuples in the last batch still in memory then it can work, but it depends
upon how full memory is from the last batch and how many batches there are.


True, you need a heap to hold the next tuple from each tape in the merge 
step. To avoid exceeding work_mem, you'd need to push some tuples from 
the in-memory array to the tape to make room for that. In practice, 
though, the memory needed for the merge step's heap is tiny. Even if you 
merge 1000 tapes, you only need memory for 1000 tuples in the heap. But 
yeah, you'll need some logic to share/divide the in-memory array between 
the heap and the in-memory tail of the last tape.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-07-30 Thread Greg Stark
On Thu, Jul 30, 2015 at 12:09 PM, Heikki Linnakangas hlinn...@iki.fi
wrote:


 True, you need a heap to hold the next tuple from each tape in the merge
 step. To avoid exceeding work_mem, you'd need to push some tuples from the
 in-memory array to the tape to make room for that. In practice, though, the
 memory needed for the merge step's heap is tiny. Even if you merge 1000
 tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll
 need some logic to share/divide the in-memory array between the heap and
 the in-memory tail of the last tape.


It's a bit worse than that because we buffer up a significant chunk of the
tape to avoid randomly seeking between tapes after every tuple read. But I
think in today's era of large memory we don't need anywhere near the entire
work_mem just to buffer to avoid random access. Something simple like a
fixed buffer size per tape probably much less than 1MB/tape.

I'm a bit confused where the big win comes from though. Is what's going on
that the external sort only exceeded memory by a small amount  so nearly
all the tuples are still in memory?


-- 
greg