Re: [HACKERS] Parallel Index-only scan

2017-02-15 Thread Rahila Syed
I  reviewed the patch. Overall it looks fine to me.

One comment,

>-   if (index->amcanparallel &&
>-   !index_only_scan &&
>+   if ((index->amcanparallel ||
>+   index_only_scan) &&

Why do we need to check for index_only_scan in the above condition. IIUC,
index->amcanparallel is necessary for
parallel index scan to be chosen. We cannot chose parallel index only scan
if index_only_scan is happening
without worrying about index->amcanparallel value. So OR-ing
index->amcanparallel with index_only_scan is probably not
correct.

Thank you,
Rahila Syed




On Thu, Feb 16, 2017 at 1:06 AM, Robert Haas  wrote:

> On Thu, Jan 19, 2017 at 7:07 AM, Rafia Sabih
>  wrote:
> > Please find the attached file rebased patch of parallel index-only
> > scan on the latest Parallel index-scan patch [1].
>
> This again needs minor rebasing but basically looks fine.  It's a
> pretty straightforward extension of the parallel index scan work.
>
> Please make sure that this is pgindent-clean - i.e. that when you
> pgindent the files that it touches, pgindent doesn't change anything
> of the same parts of the file that you've changed in the patch.  Also,
> I believe Amit may have made some adjustments to the logic in
> nodeIndexScan.c; if so, it would be good to make sure that the
> nodeIndexOnlyScan.c changes match what was done there.  In particular,
> he's got this:
>
> if (reset_parallel_scan && node->iss_ScanDesc->parallel_
> scan)
> index_parallelrescan(node->iss_ScanDesc);
>
> And you've got this:
>
> +   if (reset_parallel_scan)
> +   index_parallelrescan(node->ioss_ScanDesc);
>
> There might be some other inconsistencies as well that I didn't notice
> on a quick look.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] UPDATE of partition key

2017-02-15 Thread Amit Langote
On 2017/02/16 15:50, Amit Khandekar wrote:
> On 15 February 2017 at 20:26, David Fetter  wrote:
>> When an UPDATE can't happen, there are often ways to hint at
>> what went wrong and how to correct it.  Violating a uniqueness
>> constraint would be one example.
>>
>> When an UPDATE can't happen and the depth of the subtree is a
>> plausible candidate for what prevents it, there might be a way to say
>> so.
>>
>> Let's imagine a table called log with partitions on "stamp" log_
>> and subpartitions, also on "stamp", log_MM.  If you do something
>> like
>>
>> UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>>
>> it's possible to know that it might have worked had the UPDATE taken
>> place on log rather than on log_2017.
>>
>> Does that make sense, and if so, is it super invasive to HINT that?
> 
> Yeah, I think it should be possible to find the root partition with

I assume you mean root *partitioned* table.

> the help of pg_partitioned_table,

The pg_partitioned_table catalog does not store parent-child
relationships, just information about the partition key of a table.  To
get the root partitioned table, you might want to create a recursive
version of get_partition_parent(), maybe called
get_partition_root_parent().  By the way, get_partition_parent() scans
pg_inherits to find the inheritance parent.

> and then run ExecFindPartition()
> again using the root. Will check. I am not sure right now how involved
> that would turn out to be, but I think that logic would not change the
> existing code, so in that sense it is not invasive.

I couldn't understand why run ExecFindPartition() again on the root
partitioned table, can you clarify?  ISTM, we just want to tell the user
in the HINT that trying the same update query with root partitioned table
might work.  I'm not sure if it would work instead to find some
intermediate partitioned table (that is, between the root and the one that
update query was tried with) to include in the HINT.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-15 Thread Amit Langote
On 2017/02/16 2:08, Robert Haas wrote:
> On Wed, Feb 15, 2017 at 11:34 AM, Alvaro Herrera
>  wrote:
>> I think new-style partitioning is supposed to consider each partition as
>> an implementation detail of the table; the fact that you can manipulate
>> partitions separately does not really mean that they are their own
>> independent object.  You don't stop to think "do I really want to drop
>> the TOAST table attached to this main table?" and attach a CASCADE
>> clause if so.  You just drop the main table, and the toast one is
>> dropped automatically.  I think new-style partitions should behave
>> equivalently.
> 
> That's a reasonable point of view.  I'd like to get some more opinions
> on this topic.  I'm happy to have us do whatever most people want, but
> I'm worried that having table inheritance and table partitioning work
> differently will be create confusion.  I'm also suspicious that there
> may be some implementation difficulties.  On the hand, it does seem a
> little silly to say that DROP TABLE partitioned_table should always
> fail except in the degenerate case where there are no partitions, so
> maybe changing it is for the best.

So I count more than a few votes saying that we should be able to DROP
partitioned tables without specifying CASCADE.

I tried to implement that using the attached patch by having
StoreCatalogInheritance1() create DEPENDENCY_AUTO dependency between
parent and child if the child is a partition, instead of DEPENDENCY_NORMAL
that would otherwise be created.  Now it seems that that is one way of
making sure that partitions are dropped when the root partitioned table is
dropped, not sure if the best; why create the pg_depend entries at all one
might ask.  I chose it for now because that's the one with fewest lines of
change.  Adjusted regression tests as well, since we recently tweaked
tests [1] to work around the irregularities of test output when using CASCADE.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c397814
>From 3b75f82b4f47e840074d0209323d8ab1f39ed676 Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 16 Feb 2017 15:56:44 +0900
Subject: [PATCH] Allow dropping partitioned table without CASCADE

Currently, a normal dependency is created between a inheritance
parent and child when creating the child.  That means one must
specify CASCADE to drop the parent table if a child table exists.
When creating partitions as inheritance children, create auto
dependency instead, so that partitions are dropped automatically
when the parent is dropped i.e., without specifying CASCADE.
---
 src/backend/commands/tablecmds.c   | 26 ++
 src/test/regress/expected/alter_table.out  | 10 --
 src/test/regress/expected/create_table.out |  9 ++---
 src/test/regress/expected/inherit.out  | 18 --
 src/test/regress/expected/insert.out   |  7 ++-
 src/test/regress/expected/update.out   |  5 -
 src/test/regress/sql/alter_table.sql   | 10 --
 src/test/regress/sql/create_table.sql  |  9 ++---
 src/test/regress/sql/insert.sql|  7 ++-
 9 files changed, 34 insertions(+), 67 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f33aa70da6..27b6556a71 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -289,9 +289,11 @@ static List *MergeAttributes(List *schema, List *supers, char relpersistence,
 static bool MergeCheckConstraint(List *constraints, char *name, Node *expr);
 static void MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel);
 static void MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel);
-static void StoreCatalogInheritance(Oid relationId, List *supers);
+static void StoreCatalogInheritance(Oid relationId, List *supers,
+		bool child_is_partition);
 static void StoreCatalogInheritance1(Oid relationId, Oid parentOid,
-		 int16 seqNumber, Relation inhRelation);
+		 int16 seqNumber, Relation inhRelation,
+		 bool child_is_partition);
 static int	findAttrByName(const char *attributeName, List *schema);
 static void AlterIndexNamespaces(Relation classRel, Relation rel,
    Oid oldNspOid, Oid newNspOid, ObjectAddresses *objsMoved);
@@ -730,7 +732,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		  typaddress);
 
 	/* Store inheritance information for new rel. */
-	StoreCatalogInheritance(relationId, inheritOids);
+	StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != NULL);
 
 	/*
 	 * We must bump the command counter to make the newly-created relation
@@ -2245,7 +2247,8 @@ MergeCheckConstraint(List *constraints, char *name, Node *expr)
  * supers is a list of the OIDs of the new relation's direct ancestors.
  */
 static void
-StoreCatalogInheritance(Oid relationId, List *supers)
+StoreCatalogInheritance(Oid 

Re: [HACKERS] UPDATE of partition key

2017-02-15 Thread Amit Khandekar
On 15 February 2017 at 20:26, David Fetter  wrote:
> When an UPDATE can't happen, there are often ways to hint at
> what went wrong and how to correct it.  Violating a uniqueness
> constraint would be one example.
>
> When an UPDATE can't happen and the depth of the subtree is a
> plausible candidate for what prevents it, there might be a way to say
> so.
>
> Let's imagine a table called log with partitions on "stamp" log_
> and subpartitions, also on "stamp", log_MM.  If you do something
> like
>
> UPDATE log_2017 SET "stamp"='2016-11-08 23:03:00' WHERE ...
>
> it's possible to know that it might have worked had the UPDATE taken
> place on log rather than on log_2017.
>
> Does that make sense, and if so, is it super invasive to HINT that?

Yeah, I think it should be possible to find the root partition with
the help of pg_partitioned_table, and then run ExecFindPartition()
again using the root. Will check. I am not sure right now how involved
that would turn out to be, but I think that logic would not change the
existing code, so in that sense it is not invasive.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-15 Thread Peter Geoghegan
On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
 wrote:
>> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
>> and the ICU major version changes we expect to have to REINDEX, but
>> does anyone know if such data changes are ever pulled into the minor
>> version package upgrades you get from regular apt-get update of (say)
>> a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
>> expect collversion changes to happen basically any time in response to
>> regular system updates, or only when you're doing a major upgrade of
>> some kind, as the above-quoted documentation patch implies?
>
> Stable operating systems shouldn't do major library upgrades, so I feel
> pretty confident about this.

There is one case where it *is* appropriate for a bugfix release of
ICU to bump collator version: When there was a bug in the collator
itself, leading to broken binary blobs [1]. We should make the user
REINDEX when this happens.

I think that ICU may well do this in point releases, which is actually
a good thing. So, it seems like a good idea to test that there is no
change, in a place like check_strxfrm_bug(). In my opinion, we should
LOG a complaint in the event of a mismatch rather than refusing to
start the server, since there probably isn't that much chance of the
user being harmed by the bug. The cost of not starting the server
properly until a REINDEX finishes or similar looks particularly
unappealing when one considers that the app was probably affected by
any corruption for weeks or months before the ICU update enabled its
detection.

[1] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-02-15 Thread Amit Khandekar
On 15 February 2017 at 18:40, Robert Haas  wrote:
> On Wed, Feb 15, 2017 at 4:43 AM, Amit Khandekar  
> wrote:
>>> On 14 February 2017 at 22:35, Robert Haas  wrote:
 For example, suppose that I have a scan of two children, one
 of which has parallel_workers of 4, and the other of which has
 parallel_workers of 3.  If I pick parallel_workers of 7 for the
 Parallel Append, that's probably too high.
>>
>> In the patch, in such case, 7 workers are indeed selected for Parallel
>> Append path, so that both the subplans are able to execute in parallel
>> with their full worker capacity. Are you suggesting that we should not
>> ?
>
> Absolutely.  I think that's going to be way too many workers.  Imagine
> that there are 100 child tables and each one is big enough to qualify
> for 2 or 3 workers.  No matter what value the user has selected for
> max_parallel_workers_per_gather, they should not get a scan involving
> 200 workers.
>
> What I was thinking about is something like this:
>
> 1. First, take the maximum parallel_workers value from among all the children.
>
> 2. Second, compute log2(num_children)+1 and round up.  So, for 1
> child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
> for 9-16 children, 5, and so on.
>
> 3. Use as the number of parallel workers for the children the maximum
> of the value computed in step 1 and the value computed in step 2.

Ah, now that I closely look at compute_parallel_worker(), I see what
you are getting at.

For plain unpartitioned table, parallel_workers is calculated as
roughly equal to log(num_pages) (actually it is log3). So if the table
size is n, the workers will be log(n). So if it is partitioned into p
partitions of size n/p each, still the number of workers should be
log(n). Whereas, in the patch, it is calculated as (total of all the
child workers) i.e. n * log(n/p) for this case. But log(n) != p *
log(x/p). For e.g. log(1000) is much less than log(300) + log(300) +
log(300).

That means, the way it is calculated in the patch turns out to be much
larger than if it were calculated using log(total of sizes of all
children). So I think for the step 2 above, log(total_rel_size)
formula seems to be appropriate. What do you think ? For
compute_parallel_worker(), it is actually log3 by the way.

BTW this formula is just an extension of how parallel_workers is
calculated for an unpartitioned table.

>>> For example, suppose that I have a scan of two children, one
>>> of which has parallel_workers of 4, and the other of which has
>>> parallel_workers of 3.  If I pick parallel_workers of 7 for the
>>> Parallel Append, that's probably too high.  Had those two tables been
>>> a single unpartitioned table, I would have picked 4 or 5 workers, not
>>> 7.  On the other hand, if I pick parallel_workers of 4 or 5 for the
>>> Parallel Append, and I finish with the larger table first, I think I
>>> might as well throw all 4 of those workers at the smaller table even
>>> though it would normally have only used 3 workers.
>>
>>> Having the extra 1-2 workers exit does not seem better.
>>
>> It is here, where I didn't understand exactly why would we want to
>> assign these extra workers to a subplan which tells use that it is
>> already being run by 'parallel_workers' number of workers.
>
> The decision to use fewer workers for a smaller scan isn't really
> because we think that using more workers will cause a regression.
> It's because we think it may not help very much, and because it's not
> worth firing up a ton of workers for a relatively small scan given
> that workers are a limited resource.  I think once we've got a bunch
> of workers started, we might as well try to use them.

One possible side-effect I see due to this is : Other sessions might
not get a fair share of workers due to this. But again, there might be
counter argument that, because Append is now focussing all the workers
on a last subplan, it may finish faster, and release *all* of its
workers earlier.

BTW, there is going to be some logic change in the choose-next-subplan
algorithm if we consider giving extra workers to subplans.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-15 Thread Peter Geoghegan
On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
 wrote:
> I have changed it to use ucol_nextSortKeyPart() when appropriate.

I think it would be fine if the second last argument was
"sizeof(Datum)", not "Min(sizeof(Datum), sss->buflen2)" here:

> +   if (GetDatabaseEncoding() == PG_UTF8)
> +   {
> +   UCharIterator iter;
> +   uint32_tstate[2];
> +   UErrorCode  status;
> +
> +   uiter_setUTF8(, sss->buf1, len);
> +   state[0] = state[1] = 0;  /* won't need that again */
> +   status = U_ZERO_ERROR;
> +   bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
> +,
> +state,
> +(uint8_t *) sss->buf2,
> +Min(sizeof(Datum), 
> sss->buflen2),
> +);
> +   if (U_FAILURE(status))
> +   ereport(ERROR,
> +   (errmsg("sort key generation failed: %s", 
> u_errorName(status;
> +   }

Does this really need to be in the strxfrm() loop at all? I don't see
why you need to ever retry here.

There is an issue of style here:

> -   /* Just like strcoll(), strxfrm() expects a NUL-terminated string */
> memcpy(sss->buf1, authoritative_data, len);
> +   /* Just like strcoll(), strxfrm() expects a NUL-terminated string.
> +* Not necessary for ICU, but doesn't hurt. */

I see that you have preserved strcoll() comparison caching (or should
I say ucol_strcollUTF8() comparison caching?), at the cost of having
to keep around a buffer which we must continue to copy every text
string into within varstr_abbrev_convert(). That was probably the
right trade-off. It doesn't hurt that it required minimal divergence
within new ICU code, too.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2017-02-15 Thread Tom Lane
"Karl O. Pinc"  writes:
> On a different topic, I pulled from master and now
> I see that git finds the following untracked:
>   src/bin/pg_basebackup/pg_receivexlog
>   src/bin/pg_resetxlog/
>   src/bin/pg_xlogdump/

Those got renamed, cf
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=85c11324cabaddcfaf3347df78555b30d27c5b5a

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2017-02-15 Thread Karl O. Pinc
On Wed, 15 Feb 2017 15:23:00 -0500
Robert Haas  wrote:

> +ereport(WARNING,
> +(errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("corrupted data found in \"%s\"",
> +LOG_METAINFO_DATAFILE)));
> 
> elog seems fine here.  There's no need for this to be translatable.
> Also, I'd change the level to ERROR.
> 
> + errhint("The supported log formats are
> \"stderr\""
> +" and \"csvlog\".")));
> 
> I think our preferred style is not to break strings across lines like
> this.
> 
> +log_filepath[strcspn(log_filepath, "\n")] = '\0';
> 
> We have no existing dependency on strcspn().  It might be better not
> to add one just for this feature; I suggest using strchr() instead,
> which is widely used in our existing code.


Attached is a v29 patch which fixes the above problems.

The Syslogger hunk remains to be fixed.  I have no plans
to do so at this time.

Note that since I have to write an "if" anyway, I'm going ahead
and raising an error condition when there's no newline in the
current_logfiles file.   The strcspn() ignored the missing newline.
The new code could do so as well by negating the if condition
should that be preferable.


On a different topic, I pulled from master and now
I see that git finds the following untracked:

src/bin/pg_basebackup/pg_receivexlog
src/bin/pg_resetxlog/
src/bin/pg_xlogdump/

I'd appreciate knowing if I'm doing something dumb
on my end to make this happen.  Thanks.

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 963e70c..ae1fa0b 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -906,6 +906,7 @@ pg_current_logfile(PG_FUNCTION_ARGS)
 	char*logfmt;
 	char*log_filepath;
 	char*log_format = lbuffer;
+	char*nlpos;
 
 	/* The log format parameter is optional */
 	if (PG_NARGS() == 0 || PG_ARGISNULL(0))
@@ -918,8 +919,7 @@ pg_current_logfile(PG_FUNCTION_ARGS)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 	 errmsg("log format \"%s\" is not supported", logfmt),
-	 errhint("The supported log formats are \"stderr\""
-	" and \"csvlog\".")));
+	 errhint("The supported log formats are \"stderr\" and \"csvlog\".")));
 	}
 
 	fd = AllocateFile(LOG_METAINFO_DATAFILE, "r");
@@ -947,19 +947,28 @@ pg_current_logfile(PG_FUNCTION_ARGS)
 		if (log_filepath == NULL)
 		{
 			/*
-			 * Corrupted data found, return NULL to the caller and
+			 * Corrupted data found, no space.  Return NULL to the caller and
 			 * inform him on the situation.
 			 */
-			ereport(WARNING,
-	(errcode(ERRCODE_INTERNAL_ERROR),
-	 errmsg("corrupted data found in \"%s\"",
-			LOG_METAINFO_DATAFILE)));
+			elog(ERROR,
+ "missing space character in \"%s\"", LOG_METAINFO_DATAFILE);
 			break;
 		}
 
 		*log_filepath = '\0';
 		log_filepath++;
-		log_filepath[strcspn(log_filepath, "\n")] = '\0';
+		nlpos = strchr(log_filepath, '\n');
+		if (nlpos == NULL)
+		{
+			/*
+			 * Corrupted data found, no newline.  Return NULL to the caller
+			 * and inform him on the situation.
+			 */
+			elog(ERROR,
+ "missing newline character in \"%s\"", LOG_METAINFO_DATAFILE);
+			break;
+		}
+		*nlpos = '\0';
 
 		if (logfmt == NULL || strcmp(logfmt, log_format) == 0)
 		{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-15 Thread Tom Lane
Joel Jacobson  writes:
> Case Preservation + Case Insensitivity = A good combination
> Thoughts?

Have you read any of our innumerable previous discussions about this?
The last one was barely a month ago, cf
https://www.postgresql.org/message-id/flat/ACF85C502E55A143AB9F4ECFE960660A17282D%40mailserver2.local.mstarlabs.com
https://www.postgresql.org/message-id/flat/CA%2BTgmoYcDXCp5E-2ga2%2BbBz%3DcdQN6T_bBDXksiggm6BtR7UA1A%40mail.gmail.com
(somehow the thread got broken in two in the archives)

The short answer is that nobody can see a way to modify the identifier
case-folding rules that isn't going to add more pain than it subtracts.
And much of the added pain will be felt by people who aren't getting
any benefit, who will therefore be vociferously against the whole thing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2017-02-15 Thread Tom Lane
Andres Freund  writes:
> On 2017-02-15 22:46:38 -0300, Alvaro Herrera wrote:
>> Now that we've renamed "xlog" to "wal" in user-facing elements, I think
>> we should strive to use the name "wal" internally too in new code, not
>> "xlog" anymore.  This patch introduces several variables, macros,
>> functions that ought to change names now -- XLogSegmentOffset should be
>> WALSegmentOffset for example.  (I expect that as we touch code over
>> time, the use of "xlog" will decrease, though not fully disappear).

> I think this will just decrease the consistency in xlog.c (note the
> name) et al.

It's also going to make back-patching bug fixes in the area a real
nightmare.  Let's not change the code more than necessary to implement
the desired user-facing behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags

2017-02-15 Thread Seki, Eiji
Simon Riggs  wrote:
> Please persuade us with measurements that allowing this impact on
> ANALYZE would really improve performance at least in your case, and
> also examine the effect of this on the accuracy and usefulness of the
> gathered statistics.

I explain results of the test that Haribabu mentioned in [1].

The purpose of this test is the followings.

- Check the effect of VCI to OLTP workload
- Check whether VCI can be used in OLAP query 
  even if there is OLTP workload at a same table

The test is done as the followings.

- Use the Tiny TPC-C [2] benchmark as OLTP workload
- Scale factor: 100
- Create VCI on the 'stock' table before starting benchmark
- Planner doesn't select VCI for queries of the Tiny TPC-C.

I attach the result graph.

This graph indicates the transition of the number of rows in WOS. In our 
environment, when the WOS size exceeds about 700,000, VCI is no longer used as 
such the following query.

select count(*) from stock where s_order_cnt > 4;

While in low load ("Number of clients = 10" line, the throughput was about 
1,000) the WOS size didn't exceed about 500,000, in high load ("Number of 
clients = 30 (Without patch)" line, the throughput was about 1,400) the WOS 
size frequently exceeded 700,000.

While the WOS size continued to increase, ANALYZE only (without VACUUM) process 
created by autovacuum daemon always ran and conversion process from WOS to ROS 
didn't run. Then, after changing to ignore ANALYZE only processes using my 
patch, the WOS size no longer exceeded about 500,000 ("Number of clients = 30 
(With patch)" line, the throughput was about 1,400).

Please let me know if you need any further information.

[1] - 
https://www.postgresql.org/message-id/CAJrrPGen1bJYRHu7VFp13QZUyaLdX5N4AH1cqQdiNd8uLVZWow%40mail.gmail.com
[2] - http://hp.vector.co.jp/authors/VA052413/jdbcrunner/manual_ja/tpc-c.html 
(Sorry, there is Japanese document only)

--
Regards,
Eiji Seki
Fujitsu




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-15 Thread Peter Eisentraut
On 1/31/17 16:59, Thomas Munro wrote:
> I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
> the start of the string via the UCharIterator passed in, unless you
> have the rare reverse-accent-sort feature enabled (maybe used only in
> fr_CA, it looks like it is required to scan the whole string looking
> for the last accent).  So I assume that uiter_setUTF8 and
> ucol_nextSortKeyPart would usually do a small fixed amount of work,
> whereas this patch's icu_to_uchar allocates space and converts the
> whole variable length string every time.

I have changed it to use ucol_nextSortKeyPart() when appropriate.

> That's about abbreviation, but I note that you can also compare
> strings using iterators with ucol_strcollIter, avoiding the need to
> allocate and transcode up front.  I have no idea whether that'd pay
> off.

These iterators don't actually transcode on the fly.  You can only set
up iterators for UTF-8 or UTF-16 strings.  And for UTF-8, we already
have a special code path using ucol_strcollUTF8(), which uses an
interator internally.

> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
> and the ICU major version changes we expect to have to REINDEX, but
> does anyone know if such data changes are ever pulled into the minor
> version package upgrades you get from regular apt-get update of (say)
> a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
> expect collversion changes to happen basically any time in response to
> regular system updates, or only when you're doing a major upgrade of
> some kind, as the above-quoted documentation patch implies?

Stable operating systems shouldn't do major library upgrades, so I feel
pretty confident about this.

> 
> +   collversion = ntohl(*((uint32 *) versioninfo));
> 
> UVersionInfo is an array of four uint8_t.  That doesn't sound like
> something that needs to be bit-swizzled... is it really?  Even if it
> is arranged differently on different architectures, I'm not sure why
> you care since we only ever use it to compare for equality on the same
> system.  But aside from that, I don't love this cast to uint32.  I
> wonder if we should use u_versionToString to respect boundaries a bit
> better?

Makes sense, changed the column type to text.

> I have another motivation for wanting to model collation versions as
> strings: I have been contemplating a version check for system-provided
> collations too, piggy-backing on your work here.  Obviously PostgreSQL
> can't directly know anything about system collation versions, but I'm
> thinking of a GUC system_collation_version_command which you could
> optionally set to a script that knows how to inspect the local
> operating system.  For example, a package maintainer might be
> interested in writing such a script that knows how to ask the package
> system for a locale data version.  A brute force approach that works
> on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.

That sounds like an idea worth pursuing.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] case_preservation_and_insensitivity = on

2017-02-15 Thread Joel Jacobson
Hi hackers,

"A system that is not case-preserving is necessarily case-insensitive,
but it is possible and common for a system to be case-insensitive, yet
case-preserving" [1]

Imagine if you could turn on a GUC that would turn PostgreSQL into
such a system,
where the case would be preserved by default for all created objects,
without having to use double-quotes around all identifiers,
and while still being able to refer to such created objects case-insensitively,
without having to use double-quotes around all identifiers.

Today, you have to sacrifice the nice case-insensitivity feature if
you wish to preserve case information, which is a shame.

This would make a huge difference in terms of usability when using
PostgreSQL together with external systems where casing is important.

Today, you have to maintain silly look-up tables to translate between
PostgreSQL's internal lowercase objects,
and the outside world's e.g. CamelCase names for the corresponding objects,
or you have to sacrifice the nice case-insensitivity feature.

Case Preservation + Case Insensitivity = A good combination

Thoughts?

[1] https://en.wikipedia.org/wiki/Case_preservation


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-02-15 Thread Ashutosh Bapat
On Wed, Feb 15, 2017 at 6:40 PM, Robert Haas  wrote:
> On Wed, Feb 15, 2017 at 4:43 AM, Amit Khandekar  
> wrote:
>>> On 14 February 2017 at 22:35, Robert Haas  wrote:
 For example, suppose that I have a scan of two children, one
 of which has parallel_workers of 4, and the other of which has
 parallel_workers of 3.  If I pick parallel_workers of 7 for the
 Parallel Append, that's probably too high.
>>
>> In the patch, in such case, 7 workers are indeed selected for Parallel
>> Append path, so that both the subplans are able to execute in parallel
>> with their full worker capacity. Are you suggesting that we should not
>> ?
>
> Absolutely.  I think that's going to be way too many workers.  Imagine
> that there are 100 child tables and each one is big enough to qualify
> for 2 or 3 workers.  No matter what value the user has selected for
> max_parallel_workers_per_gather, they should not get a scan involving
> 200 workers.

If the user is ready throw 200 workers and if the subplans can use
them to speed up the query 200 times (obviously I am exaggerating),
why not to use those? When the user set
max_parallel_workers_per_gather to that high a number, he meant it to
be used by a gather, and that's what we should be doing.

>
> What I was thinking about is something like this:
>
> 1. First, take the maximum parallel_workers value from among all the children.
>
> 2. Second, compute log2(num_children)+1 and round up.  So, for 1
> child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
> for 9-16 children, 5, and so on.

Can you please explain the rationale behind this maths?

>
> 3. Use as the number of parallel workers for the children the maximum
> of the value computed in step 1 and the value computed in step 2.
>
> With this approach, a plan with 100 children qualifies for 8 parallel
> workers (unless one of the children individually qualifies for some
> larger number, or unless max_parallel_workers_per_gather is set to a
> smaller value).  That seems fairly reasonable to me.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM authentication, take three

2017-02-15 Thread Michael Paquier
On Wed, Feb 15, 2017 at 8:28 PM, Heikki Linnakangas  wrote:
> Ah, found it. It was because of this change:
>
> Having two error codes with the same SQLSTATE is not cool, and tripped the
> assertion in PL/python. I removed the new error code, it was only used in
> one place, and ERRCODE_PROTOCOL_VIOLATIOn was more appropriate there anyway.

It seems that this one is on me. Thanks for looking at it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2017-02-15 Thread Amit Kapila
On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas  wrote:
> On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas  wrote:
>> On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila  
>> wrote:>
>>> support related patch.  In anycase, to avoid confusion I am attaching
>>> all the three patches with this e-mail.
>>
>> Committed guc_parallel_index_scan_v1.patch, with changes to the
>> documentation and GUC descriptions.
>
> And committed parallel_index_opt_exec_support_v10.patch as well, with
> a couple of minor tweaks.
>

Thanks a lot!  I think this is a big step forward for parallelism in
PostgreSQL.  Now, we have another way to drive parallel scans and I
hope many more queries can use parallelism.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-15 Thread Andres Freund
Hi,

Just to summarize what you could read between the lines in the previous
mail: From a higher level POV the design here makes sense to me, I do
however think there's a good chunk of code-level improvements needed.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP SUBSCRIPTION and ROLLBACK

2017-02-15 Thread Masahiko Sawada
On Thu, Feb 16, 2017 at 7:52 AM, Petr Jelinek
 wrote:
> On 15/02/17 06:43, Masahiko Sawada wrote:
>> On Tue, Feb 14, 2017 at 1:13 AM, Fujii Masao  wrote:
>>> On Mon, Feb 13, 2017 at 4:05 PM, Masahiko Sawada  
>>> wrote:
 On Sat, Feb 11, 2017 at 8:18 PM, Petr Jelinek
  wrote:
> On 10/02/17 19:55, Masahiko Sawada wrote:
>> On Thu, Feb 9, 2017 at 12:44 AM, Petr Jelinek
>>  wrote:
>>> On 08/02/17 07:40, Masahiko Sawada wrote:
 On Wed, Feb 8, 2017 at 9:01 AM, Michael Paquier
  wrote:
> On Wed, Feb 8, 2017 at 1:30 AM, Fujii Masao  
> wrote:
>> On Wed, Feb 8, 2017 at 12:26 AM, Petr Jelinek
>>  wrote:
>>> For example what happens if apply crashes during the DROP
>>> SUBSCRIPTION/COMMIT and is not started because the delete from 
>>> catalog
>>> is now visible so the subscription is no longer there?
>>
>> Another idea is to treat DROP SUBSCRIPTION in the same way as 
>> VACUUM, i.e.,
>> make it emit an error if it's executed within user's transaction 
>> block.
>
> It seems to me that this is exactly Petr's point: using
> PreventTransactionChain() to prevent things to happen.

 Agreed. It's better to prevent to be executed inside user transaction
 block. And I understood there is too many failure scenarios we need to
 handle.

>> Also DROP SUBSCRIPTION should call CommitTransactionCommand() just
>> after removing the entry from pg_subscription, then connect to the 
>> publisher
>> and remove the replication slot.
>
> For consistency that may be important.

 Agreed.

 Attached patch, please give me feedback.

>>>
>>> This looks good (and similar to what initial patch had btw). Works fine
>>> for me as well.
>>>
>>> Remaining issue is, what to do about CREATE SUBSCRIPTION then, there are
>>> similar failure scenarios there, should we prevent it from running
>>> inside transaction as well?
>>>
>>
>> Hmm,  after thought I suspect current discussing approach. For
>> example, please image the case where CRAETE SUBSCRIPTION creates
>> subscription successfully but fails to create replication slot for
>> whatever reason, and then DROP SUBSCRIPTION drops the subscription but
>> dropping replication slot is failed. In such case, CREAET SUBSCRIPTION
>> and DROP SUBSCRIPTION return ERROR but the subscription is created and
>> dropped successfully. I think that this behaviour confuse the user.
>>
>> I think we should just prevent calling DROP SUBSCRIPTION in user's
>> transaction block. Or I guess that it could be better to separate the
>> starting/stopping logical replication from subscription management.
>>
>
> We need to stop the replication worker(s) in order to be able to drop
> the slot. There is no such issue with startup of the worker as that one
> is launched by launcher after the transaction has committed.
>
> IMO best option is to just don't allow DROP/CREATE SUBSCRIPTION inside a
> transaction block and don't do any commits inside of those (so that
> there are no rollbacks, which solves your initial issue I believe). That
> way failure to create/drop slot will result in subscription not being
> created/dropped which is what we want.
>>>
>>> On second thought, +1.
>>>
 I basically agree this option, but why do we need to change CREATE
 SUBSCRIPTION as well?
>>>
>>> Because the window between the creation of replication slot and the 
>>> transaction
>>> commit of CREATE SUBSCRIPTION should be short. Otherwise, if any error 
>>> happens
>>> during that window, the replication slot unexpectedly remains while there 
>>> is no
>>> corresponding subscription. Of course, even If we prevent CREATE 
>>> SUBSCRIPTION
>>> from being executed within user's transaction block, there is still such
>>> window. But we can reduce the possibility of that problem.
>>
>> Thank you for the explanation. I understood and agree.
>>
>> I think we should disallow to call ALTER SUBSCRIPTION inside a user's
>> transaction block as well.
>
> Why? ALTER SUBSCRIPTION does not create/drop anything on remote server ever.
>

Hmm you're right. ALTER SUBSCRIPTION can support transaction. Attached
fixed version patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


disallow_sub_ddls_in_transaction_block_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-15 Thread Andres Freund
Hi,

On 2017-02-13 23:57:00 +1300, Thomas Munro wrote:
> Here's a new version to fix the problems reported by Rafia above.  The
> patch descriptions are as before but it starts from 0002 because 0001
> was committed as 7c5d8c16 (thanks, Andres).

FWIW, I'd appreciate if you'd added a short commit message to the
individual patches - I find it helpful to have a littlebit more context
while looking at them than just the titles.  Alternatively you can
include that text when re-posting the series, but it's imo just as easy
to have a short commit message (and just use format-patch).

I'm for now using [1] as context.


0002-hj-add-dtrace-probes-v5.patch

Hm. I'm personally very unenthusiastic about addming more of these, and
would rather rip all of them out.  I tend to believe that static
problems simply aren't a good approach for anything requiring a lot of
detail.  But whatever.


0003-hj-refactor-memory-accounting-v5.patch
@@ -424,15 +422,29 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
bool useskew,
if (ntuples <= 0.0)
ntuples = 1000.0;
 
-   /*
-* Estimate tupsize based on footprint of tuple in hashtable... note 
this
-* does not allow for any palloc overhead.  The manipulations of 
spaceUsed
-* don't count palloc overhead either.
-*/
+   /* Estimate tupsize based on footprint of tuple in hashtable. */

palloc overhead is still unaccounted for, no? In the chunked case that
might not be much, I realize that (so that comment should probably have
been updated when chunking was introduced).

-   SizespaceUsed;  /* memory space currently used 
by tuples */
+   SizespaceUsed;  /* memory space currently used 
by hashtable */

It's not really the full hashtable, is it? The ->buckets array appears
to still be unaccounted for.

Looks ok.


0004-hj-refactor-batch-increases-v5.patch

@@ -1693,10 +1689,12 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
 }
 
 /*
- * Allocate 'size' bytes from the currently active HashMemoryChunk
+ * Allocate 'size' bytes from the currently active HashMemoryChunk.  If
+ * 'respect_work_mem' is true, this may cause the number of batches to be
+ * increased in an attempt to shrink the hash table.
  */
 static void *
-dense_alloc(HashJoinTable hashtable, Size size)
+dense_alloc(HashJoinTable hashtable, Size size, bool respect_work_mem)

{
HashMemoryChunk newChunk;
char   *ptr;
@@ -1710,6 +1708,15 @@ dense_alloc(HashJoinTable hashtable, Size size)
 */
if (size > HASH_CHUNK_THRESHOLD)
{
+   if (respect_work_mem &&
+   hashtable->growEnabled &&
+   hashtable->spaceUsed + HASH_CHUNK_HEADER_SIZE + size >
+   hashtable->spaceAllowed)
+   {
+   /* work_mem would be exceeded: try to shrink hash table 
*/
+   ExecHashIncreaseNumBatches(hashtable);
+   }
+

Isn't it kinda weird to do this from within dense_alloc()?  I mean that
dumps a lot of data to disk, frees a bunch of memory and so on - not
exactly what "dense_alloc" implies.  Isn't the free()ing part also
dangerous, because the caller might actually use some of that memory,
like e.g. in ExecHashRemoveNextSkewBucket() or such.  I haven't looked
deeply enough to check whether that's an active bug, but it seems like
inviting one if not.


0005-hj-refactor-unmatched-v5.patch

I'm a bit confused as to why unmatched tuple scan is a good parallelism
target, but I might see later...

0006-hj-barrier-v5.patch

Skipping that here.


0007-hj-exec-detach-node-v5.patch

Hm. You write elsewhere:
> By the time ExecEndNode() runs in workers, ExecShutdownNode() has
> already run.  That's done on purpose because, for example, the hash
> table needs to survive longer than the parallel environment to allow
> EXPLAIN to peek at it.  But it means that the Gather node has thrown
> out the shared memory before any parallel-aware node below it gets to
> run its Shutdown and End methods.  So I invented ExecDetachNode()
> which runs before ExecShutdownNode(), giving parallel-aware nodes a
> chance to say goodbye before their shared memory vanishes.  Better
> ideas?

To me that is a weakness in the ExecShutdownNode() API - imo child nodes
should get the chance to shutdown before the upper-level node.
ExecInitNode/ExecEndNode etc give individual nodes the freedom to do
things in the right order, but ExecShutdownNode() doesn't.  I don't
quite see why we'd want to invent a separate ExecDetachNode() that'd be
called immediately before ExecShutdownNode().

An easy way to change that would be to return in the
ExecShutdownNode()'s T_GatherState case, and delegate the responsibility
of calling it on Gather's children to ExecShutdownGather().
Alternatively we could make it a full-blown thing like ExecInitNode()
that every node needs to implement, but that seems a 

Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags

2017-02-15 Thread Michael Paquier
On Thu, Feb 16, 2017 at 10:48 AM, Haribabu Kommi
 wrote:
> Because of the above reason, we need a new API or some change in API
> to provide the Oldest xmin by ignoring the ANALYZE transactions, so that
> it will reduce the size of WOS and improves the VCI query performance.

A new API in core is not completely mandatory either. As mentioned
during last week developer meeting to Seki-san, as the list of PGXACT
and PGPROC is in shared memory you could as well develop your own
version. Still there is a limitation with
KnownAssignedXidsGetOldestXmin in recovery, which may be better if
exposed at quick glance but you actually don't care for storage,
right? If the change makes sense, it seems to me that making a routine
more extensible makes the most sense in this case if the API extension
can be used by modules with more goals in mind.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-15 Thread Thomas Munro
On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan  wrote:
> It might just have been that the table was too small to be an
> effective target for parallel sequential scan with so many workers,
> and so a presorted best case CREATE INDEX, which isn't that different,
> also fails to see much benefit (compared to what you'd see with a
> similar case involving a larger table). In other words, I might have
> jumped the gun in emphasizing issues with hardware and I/O bandwidth
> over issues around data volume (that I/O parallelism is inherently not
> very helpful with these relatively small tables).
>
> As I've pointed out a couple of times before, bigger sorts will be
> more CPU bound because sorting itself has costs that grow
> linearithmically, whereas writing out runs has costs that grow
> linearly. The relative cost of the I/O can be expected to go down as
> input goes up for this reason. At the same time, a larger input might
> make better use of I/O parallelism, which reduces the cost paid in
> latency to write out runs in absolute terms.

Here are some results with your latest patch, using the same test as
before but this time with SCALE=100 (= 100,000,000 rows).  The table
sizes are:

 List of relations
 Schema | Name | Type  |Owner | Size  | Description
+--+---+--+---+-
 public | million_words| table | thomas.munro | 42 MB |
 public | some_words   | table | thomas.munro | 19 MB |
 public | test_intwide_u_asc   | table | thomas.munro | 18 GB |
 public | test_intwide_u_desc  | table | thomas.munro | 18 GB |
 public | test_intwide_u_rand  | table | thomas.munro | 18 GB |
 public | test_textwide_u_asc  | table | thomas.munro | 19 GB |
 public | test_textwide_u_desc | table | thomas.munro | 19 GB |
 public | test_textwide_u_rand | table | thomas.munro | 19 GB |

To reduce the number of combinations I did only unique data and built
only non-unique indexes with only 'wide' tuples (= key plus a text
column that holds a 151-character wide string, rather than just the
key), and also didn't bother with the 1MB memory size as suggested.
Here are the results up to 4 workers (a results table going up to 8
workers is attached, since it wouldn't format nicely if I pasted it
here).  Again, the w = 0 time is seconds, the rest show relative
speed-up.  This data was all in the OS page cache because of a dummy
run done first, and I verified with 'sar' that there was exactly 0
reading from the block device.  The CPU was pegged on leader + workers
during sort runs, and then the leader's CPU hovered around 93-98%
during the merge/btree build.  I had some technical problems getting a
cold-cache read-from-actual-disk-each-time test run to work properly,
but can go back and do that again if anyone thinks that would be
interesting data to see.

   tab| ord  | mem |  w = 0  | w = 1 | w = 2 | w = 3 | w = 4
--+--+-+-+---+---+---+---
 intwide  | asc  |  64 |   67.91 | 1.26x | 1.46x | 1.62x | 1.73x
 intwide  | asc  | 256 |   67.84 | 1.23x | 1.48x | 1.63x | 1.79x
 intwide  | asc  | 512 |   69.01 | 1.25x | 1.50x | 1.63x | 1.80x
 intwide  | desc |  64 |   98.08 | 1.48x | 1.83x | 2.03x | 2.25x
 intwide  | desc | 256 |   99.87 | 1.43x | 1.80x | 2.03x | 2.29x
 intwide  | desc | 512 |  104.09 | 1.44x | 1.85x | 2.09x | 2.33x
 intwide  | rand |  64 |  138.03 | 1.56x | 2.04x | 2.42x | 2.58x
 intwide  | rand | 256 |  139.44 | 1.61x | 2.04x | 2.38x | 2.56x
 intwide  | rand | 512 |  138.96 | 1.52x | 2.03x | 2.28x | 2.57x
 textwide | asc  |  64 |  207.10 | 1.20x | 1.07x | 1.09x | 1.11x
 textwide | asc  | 256 |  200.62 | 1.19x | 1.06x | 1.04x | 0.99x
 textwide | asc  | 512 |  191.42 | 1.16x | 0.97x | 1.01x | 0.94x
 textwide | desc |  64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x
 textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x
 textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x
 textwide | rand |  64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x
 textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x
 textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x

"textwide" "asc" is nearly an order of magnitude faster than other
initial orders without parallelism, but then parallelism doesn't seem
to help it much.  Also, using more that 64MB doesn't ever seem to help
very much; in the "desc" case it hinders.

I was curious to understand how performance changes if we become just
a bit less correlated (rather than completely uncorrelated or
perfectly inversely correlated), so I tried out a 'banana skin' case:
I took the contents of the textwide asc table and copied it to a new
table, and then moved the 900 words matching 'banana%' to the physical
end of the heap by deleting and reinserting them in one transaction.
I guess if we were to use this technology for CLUSTER, this might be
representative of a situation where you regularly 

Re: [HACKERS] [Bug fix] PQsendQuery occurs error when target_session_attrs is set to read-write

2017-02-15 Thread Michael Paquier
On Thu, Feb 16, 2017 at 1:08 AM, Robert Haas  wrote:
> Great, thanks!  This looks good to me, so committed.

Thanks.

> Is there any
> chance you can use your amazing TAP-test-creation powers to do some
> automated testing of this feature?  For example, maybe we could at
> least set up a master and a standby and somehow test that asking for a
> read-only server picks the standby if it's first but asking for a
> read-write server tries the standby and then switches to the master?
> It would also be nice to test that probing a server that doesn't exist
> fails, but I'm not sure how to create an IP/port combination that's
> guaranteed to not work.

It is possible to get a test easily in this area by abusing of the
fact that multiple -d switches defined in psql make it use only the
last value. By looking at psql() in PostgresNode.pm you would see what
I mean as -d is defined by default. Or we could just enforce
PGHOST/PGPORT as there is a method to get the unix domain directory.
Not sure which one of those two methods is most elegant though. I
would tend for just using PGHOST and PGPORT.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2017-02-15 Thread Andres Freund
On 2017-02-15 22:46:38 -0300, Alvaro Herrera wrote:
> Now that we've renamed "xlog" to "wal" in user-facing elements, I think
> we should strive to use the name "wal" internally too in new code, not
> "xlog" anymore.  This patch introduces several variables, macros,
> functions that ought to change names now -- XLogSegmentOffset should be
> WALSegmentOffset for example.  (I expect that as we touch code over
> time, the use of "xlog" will decrease, though not fully disappear).

I think this will just decrease the consistency in xlog.c (note the
name) et al.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags

2017-02-15 Thread Haribabu Kommi
On Wed, Feb 15, 2017 at 11:35 PM, Amit Kapila 
wrote:

> On Wed, Feb 15, 2017 at 12:03 PM, Seki, Eiji 
> wrote:
> > Amit Kapila wrote:
> >> How will you decide just based on oldest xmin whether the tuple is
> visible or not?  How will you take decisions about tuples which have xmax
> set?
> >
> > In our use case, GetOldestXmin is used by an original maintainer
> process[es] to an original control table[s]. The table can be concurrently
> read or inserted in any transactions. However, rows in the table can be
> deleted (set xmax) only by the maintainer process. Then, one control table
> can be processed by only one maintainer process at once.
> >
> > So I do MVCC as following.
> >
> > - The maintainer's transaction:
> >   - If xmax is set, simply ignore the tuple.
> >   - For other tuples, read tuples if GetOldestXmin() > xmin.
> > - Other transactions: Do ordinal MVCC using his XID.
> >
>
> Oh, this is a very specific case for which such an API can be useful.
> Earlier, I have seen that people proposing some kind of hooks which
> can be used for their specific purpose but exposing an API or changing
> the signature of an API sound bit awkward.  Consider tomorrow someone
> decides to change this API for some reason, it might become difficult
> to decide because we can't find it's usage.


The proposed change of new API is in the context of fixing the performance
problem of vertical clustered index feature [1].

During the performance test of VCI in parallel with OLTP queries, sometimes
the query performance is getting dropped because of not choosing the VCI
as the best plan. This is due to increase of more records in WOS relation
that are needed to be moved to ROS. If there are more records in WOS
relation, the it takes more time to generate the LOCAL ROS, because of
this reason, the VCI plan is not chosen.

Why there are more records in WOS? We used GetOldestXmin() function
to identify the minimum transaction that is present in the cluster in order
to
move the data from WOS to ROS. This function doesn't ignore the ANALYZE
transactions that are running in the system. As these transactions doesn't
do any changes that will affect the data movement from WOS to ROS.

Because of the above reason, we need a new API or some change in API
to provide the Oldest xmin by ignoring the ANALYZE transactions, so that
it will reduce the size of WOS and improves the VCI query performance.

[1] -
https://www.postgresql.org/message-id/cajrrpgfac7wc9nk6ptty6yn-nn+hcy8xolah2doyhvg5d6h...@mail.gmail.com

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] increasing the default WAL segment size

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 8:46 PM, Alvaro Herrera
 wrote:
> Now that we've renamed "xlog" to "wal" in user-facing elements, I think
> we should strive to use the name "wal" internally too in new code, not
> "xlog" anymore.  This patch introduces several variables, macros,
> functions that ought to change names now -- XLogSegmentOffset should be
> WALSegmentOffset for example.  (I expect that as we touch code over
> time, the use of "xlog" will decrease, though not fully disappear).

Ugh.

I think that's going to lead to a complete mess.  We'll end up with
newer and older sections of the code being randomly inconsistent with
each other.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-15 Thread Amit Langote
On 2017/02/15 13:31, Robert Haas wrote:
> On Mon, Feb 13, 2017 at 5:57 AM, Amit Langote
>  wrote:
>> On 2017/02/13 14:21, Amit Langote wrote:
>>> On 2017/02/10 19:19, Simon Riggs wrote:
 * The OID inheritance needs work - you shouldn't need to specify a
 partition needs OIDS if the parent has it already.
>>>
>>> That sounds right.  It's better to keep the behavior same as for regular
>>> inheritance.  Will post a patch to get rid of the unnecessary check.
>>>
>>> FWIW, the check was added to prevent the command from succeeding in the
>>> case where WITHOUT OIDS has been specified for a partition and the parent
>>> partitioned table has the OID column.  Regular inheritance simply
>>> *overrides* the WITHOUT OIDS specification, which might be seen as 
>>> surprising.
>>
>> 0001 of the attached patches takes care of this.
> 
> I think 0001 needs to remove this hunk of documentation:
> 
>   
>
> If the partitioned table specified WITH OIDS then
> each partition must also specify WITH OIDS. Oids
> are not automatically inherited by partitions.
>
>  

Attached updated 0001 which does that.

> I think 0001 is better than the status quo, but I'm wondering whether
> we should try to do something slightly different.  Maybe it should
> always work for the child table to specify neither WITH OIDS nor
> WITHOUT OIDS, but if you do specify one of them then it has to be the
> one that matches the parent partitioned table?  With this patch, IIUC,
> WITH OIDS is allowed only if the parent has the same, but WITHOUT OIDS
> is allowed (but ignored) regardless of the parent setting.

With the patch, one can always specify (or not) WITH/WITHOUT OIDS when
creating partitions.  If WITH OIDS is specified and the parent doesn't
have OIDs, then an error is raised.  Then just like with normal
inheritance, WITHOUT OIDS specification for a partition will be
*overridden* if the parent has OIDs.  By the way, CREATE TABLE page says
this about inheritance and OIDS:

  (If the new table inherits from any tables that have OIDs, then
  OIDS=TRUE is forced even if the command says
  OIDS=FALSE.

Hopefully it's clear to someone reading "If the table inherits from any
tables ..." that it also refers to creating partition of a partitioned table.


Also attaching 0002 (unchanged) for tab-completion support for the new
partitioning syntax.

0003 changes how ExecFindPartition() shows the row for which
get_partition_for_tuple() failed to find a partition.  As Simon commented
upthread, we should show just the partition key, not the whole row in the
error DETAIL.  So the DETAIL now looks like what's shown by
_bt_check_unique() upon uniqueness violation:

DETAIL: Partition key of the failing row contains (key1, key2, ...)=(val1,
val2, ...)

The rules about which columns to show or whether to show the DETAIL at all
are similar to those in BuildIndexValueDescription():

 - if user has SELECT privilege on the whole table, simply go ahead

 - if user doesn't have SELECT privilege on the table, check that they
   can see all the columns in the key (no point in showing partial key);
   however abort on finding an expression for which we don't try finding
   out privilege situation of whatever columns may be in the expression

Thanks,
Amit
>From 1f249c0a395d18532c470ebe4912e33aa61409fd Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 13 Feb 2017 13:32:18 +0900
Subject: [PATCH 1/3] Inherit OID system column automatically for partitions

Currently, WITH OIDS must be explicitly specified when creating a
partition if the parent table has the OID system column.  Instead,
inherit it automatically, possibly overriding any explicit WITHOUT
OIDS specification.  Per review comment from Simon Riggs
---
 doc/src/sgml/ddl.sgml  |  8 
 doc/src/sgml/ref/create_table.sgml | 12 +---
 src/backend/commands/tablecmds.c   | 21 -
 src/test/regress/expected/create_table.out | 18 +-
 src/test/regress/sql/create_table.sql  | 10 ++
 5 files changed, 32 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index f909242e4c..5779eac43d 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2863,14 +2863,6 @@ VALUES ('Albany', NULL, NULL, 'NY');
 
  
   
-   If the partitioned table specified WITH OIDS then
-   each partition must also specify WITH OIDS. Oids
-   are not automatically inherited by partitions.
-  
- 
-
- 
-  
One cannot drop a NOT NULL constraint on a
partition's column, if the constraint is present in the parent table.
   
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0f7cd9b93..87a3443ee2 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -301,13 +301,11 @@ 

Re: [HACKERS] increasing the default WAL segment size

2017-02-15 Thread Alvaro Herrera
Now that we've renamed "xlog" to "wal" in user-facing elements, I think
we should strive to use the name "wal" internally too in new code, not
"xlog" anymore.  This patch introduces several variables, macros,
functions that ought to change names now -- XLogSegmentOffset should be
WALSegmentOffset for example.  (I expect that as we touch code over
time, the use of "xlog" will decrease, though not fully disappear).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-15 Thread Robert Haas
On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila  wrote:
> As discussed, attached are refactoring patches and a patch to enable
> WAL for the hash index on top of them.

Thanks.  I think that the refactoring patches shouldn't add
START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
for the final patch.  Nor should their comments reference the idea of
writing WAL; that should also be for the final patch.

PageGetFreeSpaceForMulTups doesn't seem like a great name.
PageGetFreeSpaceForMultipleTuples?  Or maybe just use
PageGetExactFreeSpace and then do the account in the caller.  I'm not
sure it's really worth having a function just to subtract a multiple
of sizeof(ItemIdData), and it would actually be more efficient to have
the caller take care of this, since you wouldn't need to keep
recalculating the value for every iteration of the loop.

I think we ought to just rip (as a preliminary patch) out the support
for freeing an overflow page in the middle of a bucket chain.  That
code is impossible to hit, I believe, and instead of complicating the
WAL machinery to cater to a nonexistent case, maybe we ought to just
get rid of it.

+   /*
+* We need to release and if required
reacquire the lock on
+* rbuf to ensure that standby
shouldn't see an intermediate
+* state of it.
+*/

How does releasing and reacquiring the lock on the master affect
whether the standby can see an intermediate state?  That sounds
totally wrong to me (and also doesn't belong in a refactoring patch
anyway since we're not emitting any WAL here).

-   Assert(PageIsNew(page));

This worries me a bit.  What's going on here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Andres Freund
On 2017-02-15 20:35:16 -0500, Stephen Frost wrote:
> Perhaps, but until we've got a system worked out for having the workers
> do the writes, we aren't getting anything.  Being able to have the
> leader do the writes based on the tuples fed back from the workers is
> clearly better than nothing.

Never argued differently.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Stephen Frost
Andres,

* Andres Freund (and...@anarazel.de) wrote:
> On 2017-02-15 20:28:43 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > On February 15, 2017 5:20:20 PM PST, Stephen Frost  
> > > wrote:
> > > >In many cases, I expect this would work just as well, if not better,
> > > >than trying to actually do writes in parallel.
> > > 
> > > Why? IPCing tuples around is quite expensive.  Or do you mean because 
> > > it'll be more suitable because of the possible plans?
> > 
> > Because I've seen some serious problems when trying to have multiple
> > processes writing into the same relation due to the relation extension
> > lock, cases where it was much faster to have each process write into its
> > own table.  Admittedly, we've improved things there, so perhaps this isn't
> > an issue any longer, but we also don't yet really know what the locking
> > implementation looks like yet for having multiple parallel workers
> > writing into the same relation, so it may be that sending a few records
> > back to the leader is cheaper than working out the locking to allow
> > parallel workers to write to the same relation, or at least not any more
> > expensive.
> 
> I quite seriously doubt that you will get enough rows to the master via
> tuplequeues that it'll be faster than inserting on the workers, eve
> nwith such scalability problems present. Even before the 9.6
> improvements, and even more so after.

Perhaps, but until we've got a system worked out for having the workers
do the writes, we aren't getting anything.  Being able to have the
leader do the writes based on the tuples fed back from the workers is
clearly better than nothing.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Andres Freund
On 2017-02-15 20:28:43 -0500, Stephen Frost wrote:
> Andres,
> 
> * Andres Freund (and...@anarazel.de) wrote:
> > On February 15, 2017 5:20:20 PM PST, Stephen Frost  
> > wrote:
> > >In many cases, I expect this would work just as well, if not better,
> > >than trying to actually do writes in parallel.
> > 
> > Why? IPCing tuples around is quite expensive.  Or do you mean because it'll 
> > be more suitable because of the possible plans?
> 
> Because I've seen some serious problems when trying to have multiple
> processes writing into the same relation due to the relation extension
> lock, cases where it was much faster to have each process write into its
> own table.  Admittedly, we've improved things there, so perhaps this isn't
> an issue any longer, but we also don't yet really know what the locking
> implementation looks like yet for having multiple parallel workers
> writing into the same relation, so it may be that sending a few records
> back to the leader is cheaper than working out the locking to allow
> parallel workers to write to the same relation, or at least not any more
> expensive.

I quite seriously doubt that you will get enough rows to the master via
tuplequeues that it'll be faster than inserting on the workers, eve
nwith such scalability problems present. Even before the 9.6
improvements, and even more so after.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Stephen Frost
Andres,

* Andres Freund (and...@anarazel.de) wrote:
> On February 15, 2017 5:20:20 PM PST, Stephen Frost  wrote:
> >In many cases, I expect this would work just as well, if not better,
> >than trying to actually do writes in parallel.
> 
> Why? IPCing tuples around is quite expensive.  Or do you mean because it'll 
> be more suitable because of the possible plans?

Because I've seen some serious problems when trying to have multiple
processes writing into the same relation due to the relation extension
lock, cases where it was much faster to have each process write into its
own table.  Admittedly, we've improved things there, so perhaps this isn't
an issue any longer, but we also don't yet really know what the locking
implementation looks like yet for having multiple parallel workers
writing into the same relation, so it may be that sending a few records
back to the leader is cheaper than working out the locking to allow
parallel workers to write to the same relation, or at least not any more
expensive.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Andres Freund


On February 15, 2017 5:20:20 PM PST, Stephen Frost  wrote:
>In many cases, I expect this would work just as well, if not better,
>than trying to actually do writes in parallel.

Why? IPCing tuples around is quite expensive.  Or do you mean because it'll be 
more suitable because of the possible plans?

Andres

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Stephen Frost
All,

* Andres Freund (and...@anarazel.de) wrote:
> On 2017-02-15 08:48:44 -0500, Robert Haas wrote:
> > The other way of fixing this problem is to have each worker generate a
> > subset of the tuples and funnel them all back to the leader through a
> > Gather node; the leader then does all the inserts.  That avoids having
> > to solve the problems mentioned above, but it probably doesn't perform
> > nearly as well.
> 
> I think it'd already be tremendously useful however.  I think it'd not
> be an unreasonable first step. It'd be a good fallback that'd be useful
> for !insert and such anyway.

Absolutely.  I had always figured this would be what we would do first,
before coming up with something more clever down the road.  In
particular, this allows filters to be pushed down and performed in
parallel, which may significantly reduce the result which is passed back
up to the leader.

In many cases, I expect this would work just as well, if not better,
than trying to actually do writes in parallel.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Andres Freund
Hi,

On 2017-02-15 08:48:44 -0500, Robert Haas wrote:
> We got rid of the major existing use of page locks in
> 6d46f4783efe457f74816a75173eb23ed8930020, which extirpated them from
> hash indexes, and I was kind of hoping they could go away altogether,
> but we can't do that as long as GIN is using them.

Learned a new word today.


> Anyway, if we solve those problems, we can allow inserts (not updates
> or deletes, those have other problems, principally relating to combo
> CIDs) in parallel mode, which would make it possible to allow the
> kinds of things you are asking about here.

I don't think general INSERTs are safe, if you consider unique indexes
and foreign keys (both setting xmax in the simple cases and multixacts
are likely to be problematic).


> The other way of fixing this problem is to have each worker generate a
> subset of the tuples and funnel them all back to the leader through a
> Gather node; the leader then does all the inserts.  That avoids having
> to solve the problems mentioned above, but it probably doesn't perform
> nearly as well.

I think it'd already be tremendously useful however.  I think it'd not
be an unreasonable first step. It'd be a good fallback that'd be useful
for !insert and such anyway.


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE TABLE with parallel workers, 10.0?

2017-02-15 Thread Haribabu Kommi
On Thu, Feb 16, 2017 at 12:48 AM, Robert Haas  wrote:

> On Wed, Feb 15, 2017 at 12:24 AM, Joshua Chamberlain 
> wrote:
> > Hello,
> >
> > (I'm posting to hackers since I got no response on the general list.)
> >
> > I use Postgres + PostGIS quite heavily, and recently have been taking
> full
> > advantage of the new parallelism in 9.6. I'm now running queries in a few
> > hours that would otherwise take more than a day.
> >
> > However, parallelism is disabled for all queries that perform writes (as
> > documented). I would normally run "CREATE TABLE AS [some super-expensive
> > query]", but since that can't use parallelism I'm using the \o option in
> > psql, creating the table separately, and then \copy-ing in the results.
> That
> > works, but "CREATE TABLE AS" would be more convenient.
> >
> > Are there plans in 10.0 to allow parallelism in queries that write, or at
> > least in "CREATE TABLE AS" queries? (Support in materialized views would
> be
> > great, too!)
>
> Somebody else asked me about this in private email.  Nobody at
> EnterpriseDB is working on this right now, and I don't think anybody
> else is working on it either.  There are several ways to do it, but
> none of them are entirely easy and the ones likely to perform better
> are less easy.
>
> I think that what we need to do to make this work is come up with a
> way to fix the interaction between group locking (which allows
> multiple processes to hold mutually conflicting locks at the same time
> if they are in a parallel group) and relation extension (which uses
> heavyweight locking to prevent multiple processes from trying to
> extend the same relation at the same time).  I think that perhaps the
> right way to solve that problem is to come up with a way of providing
> mutual exclusion around relation extension that doesn't need the
> heavyweight lock manager.  Relation extension locks don't need
> multiple lock modes or deadlock detection or any of the other
> frammishes that the heavyweight lock manager provides, but they do
> need to be fast, which the heavyweight lock manager isn't especially
> good at.  So moving this out of the heavyweight lock manager might be
> a way to solve two problems at once.
>
> There's also a hazard related to GIN indexes since
> e2c79e14d998cd31f860854bc9210b37b457bb01, which introduced a new use
> of Page locks, which have a similar kind of problem.  We got rid of
> the major existing use of page locks in
> 6d46f4783efe457f74816a75173eb23ed8930020, which extirpated them from
> hash indexes, and I was kind of hoping they could go away altogether,
> but we can't do that as long as GIN is using them.
>
> Anyway, if we solve those problems, we can allow inserts (not updates
> or deletes, those have other problems, principally relating to combo
> CIDs) in parallel mode, which would make it possible to allow the
> kinds of things you are asking about here.  Then you could fix things
> so that each worker generates a subset of the tuples and inserts the
> ones it generates.  You'd end up with a parallel plan but no Gather
> node!   The workers could feed tuples directly to a suitable
> DestReceiver, which would be really spiffy.
>
> The other way of fixing this problem is to have each worker generate a
> subset of the tuples and funnel them all back to the leader through a
> Gather node; the leader then does all the inserts.  That avoids having
> to solve the problems mentioned above, but it probably doesn't perform
> nearly as well.
>

How about supporting something like, backend does the write operations
and whereas the worker will produce the results. This way it may not produce
good performance for all the cases compared to do the writer operation by
all parallel workers, but this may be useful for some scenarios like;
CREATE MATERIALIZED VIEW and etc.

Following are the explain plan with minimal changes in the code to allow
write operations. I didn't verified all the scenarios. How about supporting
writer operations as below and then later enhance it to do the write
operations
by the parallel workers also?

POC patch is attached.

postgres=# explain create materialized view mat_view as select * from tbl
where f1 =10;
   QUERY PLAN

 Gather  (cost=1000.00..37458.43 rows=1 width=214)
   Workers Planned: 2
   ->  Parallel Seq Scan on tbl  (cost=0.00..36458.33 rows=1 width=214)
 Filter: (f1 = 10)
(4 rows)

postgres=# explain insert into tbl select * from tbl where f1 = 10;
 QUERY PLAN


 Insert on tbl  (cost=1000.00..37458.43 rows=1 width=214)
   ->  Gather  (cost=1000.00..37458.43 rows=1 width=214)
 Workers Planned: 2
 ->  Parallel Seq Scan on tbl tbl_1  (cost=0.00..36458.33 rows=1
width=214)
   Filter: (f1 = 10)

Re: [HACKERS] AT detach partition is broken

2017-02-15 Thread Amit Langote
On 2017/02/15 2:37, Robert Haas wrote:
> On Mon, Feb 13, 2017 at 2:30 AM, Amit Langote
>  wrote:
>> I noticed that running ALTER TABLE table_name DETACH PARTITION crashes, if
>> table_name is not a partitioned table.  That's because of an  Assert in
>> ATExecDetachPartition().  We really should error out much sooner in this
>> case, IOW during transformAlterTableStmt(), as is done in the case of
>> ATTACH PARTITION.
>>
>> Attached patch fixes that.
> 
> -/* assign transformed values */
> -partcmd->bound = cxt.partbound;
> +/*
> + * Assign transformed value of the partition bound, if
> + * any.
> + */
> +if (cxt.partbound != NULL)
> +partcmd->bound = cxt.partbound;
> 
> This hunk isn't really needed, is it?  I mean, if cxt.partbound comes
> out NULL, then partcmd->bound will be NULL with or without adding an
> "if" here, won't it?

You're right.  Took this one out (except slightly tweaking the comment) in
the attached updated patch.

Thanks,
Amit
>From 8df13147eecc1b6a6f3b6187d0b54085c4afb634 Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 13 Feb 2017 16:10:17 +0900
Subject: [PATCH] Unbreak ALTER TABLE DETACH PARTITION

DETACH PARTITION should throw error much sooner if the specified
(parent) table is not a partitioned table.  In case of ATTACH
PARTITION that's done during transformAlterTableStmt.  Do the same
for DETACH PARTITION.  Instead of inventing a transformDetachPartition,
rename the existing transformAttachPartition to transformPartitionCmd
and have it serve both cases.
---
 src/backend/parser/parse_utilcmd.c| 24 ++--
 src/test/regress/expected/alter_table.out |  5 +
 src/test/regress/sql/alter_table.sql  |  5 +
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 0f78abaae2..209034d1e3 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -133,7 +133,7 @@ static void transformConstraintAttrs(CreateStmtContext *cxt,
 		 List *constraintList);
 static void transformColumnType(CreateStmtContext *cxt, ColumnDef *column);
 static void setSchemaName(char *context_schema, char **stmt_schema_name);
-static void transformAttachPartition(CreateStmtContext *cxt, PartitionCmd *cmd);
+static void transformPartitionCmd(CreateStmtContext *cxt, PartitionCmd *cmd);
 
 
 /*
@@ -2654,12 +2654,12 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt,
 }
 
 			case AT_AttachPartition:
+			case AT_DetachPartition:
 {
 	PartitionCmd *partcmd = (PartitionCmd *) cmd->def;
 
-	transformAttachPartition(, partcmd);
-
-	/* assign transformed values */
+	transformPartitionCmd(, partcmd);
+	/* assign transformed value of the partition bound */
 	partcmd->bound = cxt.partbound;
 }
 
@@ -3032,11 +3032,14 @@ setSchemaName(char *context_schema, char **stmt_schema_name)
 }
 
 /*
- * transformAttachPartition
- *		Analyze ATTACH PARTITION ... FOR VALUES ...
+ * transformPartitionCmd
+ *		Analyze the ATTACH/DETACH PARTITION command
+ *
+ * In case of the ATTACH PARTITION command, cxt->partbound is set to the
+ * transformed value of cmd->bound.
  */
 static void
-transformAttachPartition(CreateStmtContext *cxt, PartitionCmd *cmd)
+transformPartitionCmd(CreateStmtContext *cxt, PartitionCmd *cmd)
 {
 	Relation	parentRel = cxt->rel;
 
@@ -3050,10 +3053,11 @@ transformAttachPartition(CreateStmtContext *cxt, PartitionCmd *cmd)
  errmsg("\"%s\" is not partitioned",
 		RelationGetRelationName(parentRel;
 
-	/* transform the values */
+	/* transform the partition bound, if any */
 	Assert(RelationGetPartitionKey(parentRel) != NULL);
-	cxt->partbound = transformPartitionBound(cxt->pstate, parentRel,
-			 cmd->bound);
+	if (cmd->bound != NULL)
+		cxt->partbound = transformPartitionBound(cxt->pstate, parentRel,
+ cmd->bound);
 }
 
 /*
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index b0e80a7788..e84af67fb2 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3259,6 +3259,11 @@ DETAIL:  "list_parted2" is already a child of "list_parted2".
 --
 -- DETACH PARTITION
 --
+-- check that the table is partitioned at all
+CREATE TABLE regular_table (a int);
+ALTER TABLE regular_table DETACH PARTITION any_name;
+ERROR:  "regular_table" is not partitioned
+DROP TABLE regular_table;
 -- check that the partition being detached exists at all
 ALTER TABLE list_parted2 DETACH PARTITION part_4;
 ERROR:  relation "part_4" does not exist
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index 7513769359..a403fd8cb4 100644
--- 

Re: [HACKERS] bytea_output vs make installcheck

2017-02-15 Thread Tom Lane
Andres Freund  writes:
> On 2017-02-15 18:30:30 -0500, Tom Lane wrote:
>> If we tried to lock that down it'd be counterproductive for the reason
>> Andres mentions: sometimes you *want* to see what you get for other
>> settings.

> We could kinda address that by doing it in a separate file early in the
> schedule, which could just be commented out when doing something like
> this.  But I'm still unconvinced it's worth caring.

Actually, that idea might be worth pursuing.  Right now pg_regress.c has a
hard-wired notion that it should ALTER DATABASE SET certain parameters on
the regression database.  The best you can say for that is it's ugly as
sin.  It'd definitely be nicer if we could move that into someplace where
it's more readily adjustable.  Having done that, locking down most stuff
by default might become more practical.

However, I'm not sure that just dumping the responsibility into an initial
test script is very workable.  We'd need another copy for each contrib
and PL test suite, the isolation tests, yadda yadda.  And maintaining
that many copies would be a nightmare.  I think we'd need some way to have
pg_regress pick up the desired settings from a central place.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-15 Thread Petr Jelinek
On 13/02/17 14:51, Erik Rijkers wrote:
> On 2017-02-11 11:16, Erik Rijkers wrote:
>> On 2017-02-08 23:25, Petr Jelinek wrote:
>>
>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>>
>> This often works but it also fails far too often (in my hands).  I
>> test whether the tables are identical by comparing an md5 from an
>> ordered resultset, from both replica and master.  I estimate that 1 in
>> 5 tries fail; 'fail'  being a somewhat different table on replica
>> (compared to mater), most often pgbench_accounts (typically there are
>> 10-30 differing rows).  No errors or warnings in either logfile.   I'm
>> not sure but I think testing on faster machines seem to be doing
>> somewhat better ('better' being less replication error).
>>
> 
> I have noticed that when I insert a few seconds wait-state after the
> create subscription (or actually: the 'enable'ing of the subscription)
> the problem does not occur.  Apparently, (I assume) the initial snapshot
> occurs somewhere when the subsequent pgbench-run has already started, so
> that the logical replication also starts somewhere 'into' that
> pgbench-run. Does that make sense?
> 
> I don't know what to make of it.  Now that I think that I understand
> what happens I hesitate to call it a bug. But I'd say it's still a
> useability problem that the subscription is only 'valid' after some
> time, even if it's only a few seconds.
> 

It is a bug, we are going to great lengths to create data snapshot that
corresponds to specific LSN so that we are able to decode exactly the
changes that happened since the data snapshot was taken. And the
tablecopy.c does quite a lot to synchronize table handover to main apply
process so that there is correct continuation of data stream as well. So
the end result is that concurrent changes are supposed to be okay and
eventually replication should catch up and the contents should be the same.

That being said, I am so far having problems reproducing this on my test
machine(s) so no idea what causes it yet.

Could you periodically dump contents of the pg_subscription_rel on
subscriber (ideally when dumping the md5 of the data) and attach that as
well?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bytea_output vs make installcheck

2017-02-15 Thread Andres Freund
On 2017-02-15 18:30:30 -0500, Tom Lane wrote:
> If we tried to lock that down it'd be counterproductive for the reason
> Andres mentions: sometimes you *want* to see what you get for other
> settings.

We could kinda address that by doing it in a separate file early in the
schedule, which could just be commented out when doing something like
this.  But I'm still unconvinced it's worth caring.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bytea_output vs make installcheck

2017-02-15 Thread Tom Lane
Andres Freund  writes:
> On 2017-02-15 09:30:39 -0800, Jeff Janes wrote:
>> I don't really see the cost here.

> Because that means we essentially need to make sure that our tests pass
> with a combination of about ~20-30 behaviour changing gucs, and ~5
> different compilation settings that change output.

Yeah, the problem with addressing this non-systematically is that it'll
never stay fixed for long.

> Alternatively we could ALTER DATABASE a bunch of settings on the
> regression database at the start, but I'm not sure that's nice either,
> because it makes ad-hoc tests with unusual settings harder.

I'd definitely be -1 on that.

I think that it is worth fixing cases where a parameter change leads to
surprising results, like the operator_precedence_warning case just now.
But people should not be surprised when, say, changes in planner
parameters lead to different EXPLAIN output or different row ordering.
If we tried to lock that down it'd be counterproductive for the reason
Andres mentions: sometimes you *want* to see what you get for other
settings.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-02-15 Thread Thomas Munro
Out of archeological curiosity, I was digging around in the hash join
code and RCS history from Postgres 4.2[1], and I was astounded to
discover that it had a parallel executor for Sequent SMP systems and
was capable of parallel hash joins as of 1991.  At first glance, it
seems to follow approximately the same design as I propose: share a
hash table and use a barrier to coordinate the switch from build phase
to probe phase and deal with later patches.  It uses mmap to get space
and then works with relative pointers.  See
src/backend/executor/n_hash.c and src/backend/executor/n_hashjoin.c.
Some of this might be described in Wei Hong's PhD thesis[2] which I
haven't had the pleasure of reading yet.

The parallel support is absent from the first commit in our repo
(1996), but there are some vestiges like RelativeAddr and ABSADDR used
to access the hash table (presumably needlessly) and also some
mentions of parallel machines in comments that survived up until
commit 26069a58 (1999).

[1] http://db.cs.berkeley.edu/postgres.html
[2] http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gitlab post-mortem: pg_basebackup waiting for checkpoint

2017-02-15 Thread Jeff Janes
On Tue, Feb 14, 2017 at 9:06 AM, Magnus Hagander 
wrote:

Yeah, that's my view as well. I'm all for including it in verbose mode.
>
> *Iff* we can get a progress indicator through the checkpoint we could
> include that in --progress mode. But that's a different patch, of course,
> but it shouldn't be included in the default output even if we find it.
>
>
I think it should show up in --progress mode.  It would be great if we
could show fine-grained progress reports on the checkpoint, but if we can't
do that we should still report as fine as we are able to, which is that a
checkpoint is in progress. Otherwise we are setting the perfect as the
enemy of the good.

Cheers,

Jeff


Re: [HACKERS] bytea_output vs make installcheck

2017-02-15 Thread Andres Freund
Hi,

On 2017-02-15 09:30:39 -0800, Jeff Janes wrote:
> On Tue, Feb 14, 2017 at 9:17 PM, Andres Freund  wrote:
> > What's your reason to get this fixed?
> >
> 
> More testing is better than less testing, and a good way to get less
> testing is requiring the tester to memorize a list of false positives they
> might encounter.  I'd like to systematically clone my production system,
> run it through pg_upgrade, and then do installcheck (plus my own
> app-specific) on the result, and I'd like others to do that as well with
> their own production systems.

> I don't really see the cost here.

Because that means we essentially need to make sure that our tests pass
with a combination of about ~20-30 behaviour changing gucs, and ~5
different compilation settings that change output.  Either we do that
systematically - which'd be a fair amount of effort - or we're not
getting anywhere, because the setting around the next corner breaks a
bunch of different tests.

Should tests pass with random client_min_messages settings? Random
enable_? Switched default_with_oids? statement_timeout? Could go on a
long while.

Having to think about all GUCs/compile time settings that could
potentially affect a test and manually set everything up so they don't
makes writing tests a lot harder, and we'll fail anyway unless it's
checked in an automated fashion.  Unless that testing is done you're not
really benefiting, because you can't rely on test failures meaning much.

If we want to have a list of GUCs that we want tests to pass under, then
the proponents of that at least should bring up a buildfarm animal
afterwards that test that reasonable combinations of those pass and
continue to pass.


Alternatively we could ALTER DATABASE a bunch of settings on the
regression database at the start, but I'm not sure that's nice either,
because it makes ad-hoc tests with unusual settings harder.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication and Character encoding

2017-02-15 Thread Petr Jelinek
On 14/02/17 03:23, Kyotaro HORIGUCHI wrote:
> At Sat, 4 Feb 2017 21:27:32 +0100, Petr Jelinek 
>  wrote in 
> 
>> Hmm I wonder if we should just make the subscriber send the
>> client_encoding always (based on server encoding of the subscriber).
>> That should solve the issue in combination with your patch no?
> 
> Yeah, right. I considered that a subscriber might want to set its
> own value for that but that is useless.
> 
> The attached patch does the following things to just prevent
> making a logical replication connection between databases with
> inconsistent encodings.
> 
> - added client_encoding with subscriber(or standby)'s encoding at
>   the last of options in libpqrcv_connect.
> 
> - CheckLogicalDecodingRequirements refuses connection for a
>   request with inconsistent encodings.
> 
>> ERROR:  logical replication requires consistent encodings on both side 
>> (publisher = UTF8, subscriber = EUC_JP)
> 

I am not quite convinced that this should be handled by logical decoding
itself. It's quite possible to have output plugins that will handle this
correctly for their use-cases (by doing similar conversion you did in
the original patch) so they should not be prevented doing so.
So it's probably better to check this in the plugin.

I do like the idea of just using client_encoding in libpqrcv_connect though.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP SUBSCRIPTION and ROLLBACK

2017-02-15 Thread Petr Jelinek
On 15/02/17 06:43, Masahiko Sawada wrote:
> On Tue, Feb 14, 2017 at 1:13 AM, Fujii Masao  wrote:
>> On Mon, Feb 13, 2017 at 4:05 PM, Masahiko Sawada  
>> wrote:
>>> On Sat, Feb 11, 2017 at 8:18 PM, Petr Jelinek
>>>  wrote:
 On 10/02/17 19:55, Masahiko Sawada wrote:
> On Thu, Feb 9, 2017 at 12:44 AM, Petr Jelinek
>  wrote:
>> On 08/02/17 07:40, Masahiko Sawada wrote:
>>> On Wed, Feb 8, 2017 at 9:01 AM, Michael Paquier
>>>  wrote:
 On Wed, Feb 8, 2017 at 1:30 AM, Fujii Masao  
 wrote:
> On Wed, Feb 8, 2017 at 12:26 AM, Petr Jelinek
>  wrote:
>> For example what happens if apply crashes during the DROP
>> SUBSCRIPTION/COMMIT and is not started because the delete from 
>> catalog
>> is now visible so the subscription is no longer there?
>
> Another idea is to treat DROP SUBSCRIPTION in the same way as VACUUM, 
> i.e.,
> make it emit an error if it's executed within user's transaction 
> block.

 It seems to me that this is exactly Petr's point: using
 PreventTransactionChain() to prevent things to happen.
>>>
>>> Agreed. It's better to prevent to be executed inside user transaction
>>> block. And I understood there is too many failure scenarios we need to
>>> handle.
>>>
> Also DROP SUBSCRIPTION should call CommitTransactionCommand() just
> after removing the entry from pg_subscription, then connect to the 
> publisher
> and remove the replication slot.

 For consistency that may be important.
>>>
>>> Agreed.
>>>
>>> Attached patch, please give me feedback.
>>>
>>
>> This looks good (and similar to what initial patch had btw). Works fine
>> for me as well.
>>
>> Remaining issue is, what to do about CREATE SUBSCRIPTION then, there are
>> similar failure scenarios there, should we prevent it from running
>> inside transaction as well?
>>
>
> Hmm,  after thought I suspect current discussing approach. For
> example, please image the case where CRAETE SUBSCRIPTION creates
> subscription successfully but fails to create replication slot for
> whatever reason, and then DROP SUBSCRIPTION drops the subscription but
> dropping replication slot is failed. In such case, CREAET SUBSCRIPTION
> and DROP SUBSCRIPTION return ERROR but the subscription is created and
> dropped successfully. I think that this behaviour confuse the user.
>
> I think we should just prevent calling DROP SUBSCRIPTION in user's
> transaction block. Or I guess that it could be better to separate the
> starting/stopping logical replication from subscription management.
>

 We need to stop the replication worker(s) in order to be able to drop
 the slot. There is no such issue with startup of the worker as that one
 is launched by launcher after the transaction has committed.

 IMO best option is to just don't allow DROP/CREATE SUBSCRIPTION inside a
 transaction block and don't do any commits inside of those (so that
 there are no rollbacks, which solves your initial issue I believe). That
 way failure to create/drop slot will result in subscription not being
 created/dropped which is what we want.
>>
>> On second thought, +1.
>>
>>> I basically agree this option, but why do we need to change CREATE
>>> SUBSCRIPTION as well?
>>
>> Because the window between the creation of replication slot and the 
>> transaction
>> commit of CREATE SUBSCRIPTION should be short. Otherwise, if any error 
>> happens
>> during that window, the replication slot unexpectedly remains while there is 
>> no
>> corresponding subscription. Of course, even If we prevent CREATE SUBSCRIPTION
>> from being executed within user's transaction block, there is still such
>> window. But we can reduce the possibility of that problem.
> 
> Thank you for the explanation. I understood and agree.
> 
> I think we should disallow to call ALTER SUBSCRIPTION inside a user's
> transaction block as well.

Why? ALTER SUBSCRIPTION does not create/drop anything on remote server ever.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bytea_output vs make installcheck

2017-02-15 Thread neha khatri
Agreed with Jeff, false alarms should be avoided, whenever it is easy to
put the avoiding mechanism in place.

Regards,
Neha


Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags

2017-02-15 Thread Tom Lane
Andres Freund  writes:
> On 2017-02-15 12:27:11 -0500, Robert Haas wrote:
>> I agree; also, many years ago a guy named Tom Lane told me that flags
>> argument should typically be declared as type "int".  I've followed
>> that advice ever since.

> Why is that?  I think uint makes a lot more sense for flags where the
> flags are individual bits that set/unset. Doing that with the sign bit
> isn't a good idea.

It would only matter if you got to the point of needing the sign bit
for a flag, which basically never happens for this sort of usage
(or if it does, you'd better be thinking about switching to int64
anyway).  So the extra mental and typing load of choosing some other
type than plain old "int" isn't really justified, IMO.

Anyway the real point is that "int" is preferable to "uint8" which
was the original choice here.  "uint8" unduly constrains the number
of flags you can add without an ABI break, and on many machines it's
more efficient to pass "int" function arguments than some other width.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2017-02-15 Thread Tom Lane
"Karl O. Pinc"  writes:
> On Wed, 15 Feb 2017 15:23:00 -0500
> Robert Haas  wrote:
>> I think our preferred style is not to break strings across lines like
>> this.

> How do you do that and not exceed the 80 character line limit?
> Just ignore the line length limit?

Right.  It depends on context, but letting isolated lines wrap doesn't
do that much damage to readability, IMO anyway.  (If you've got a bunch
of these adjacent to each other, you might choose to break them to
reduce confusion.)

The advantage of not breaking up error strings is that it makes grepping
for them more reliable, when you're wondering "where in the sources did
this error come from?".  If you get a report about "could not frob a
whatzit" and grep for "frob a whatzit", but somebody decided to break
that string in the middle to avoid a line wrap, you won't find the spot.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags

2017-02-15 Thread Andres Freund
On 2017-02-15 12:27:11 -0500, Robert Haas wrote:
> On Tue, Feb 14, 2017 at 1:38 PM, Jim Nasby  wrote:
> > On 2/14/17 3:13 AM, Seki, Eiji wrote:
> >>   +extern TransactionId GetOldestXmin(Relation rel, uint8 ignoreFlags);
> >
> >
> > My impression is that most other places that do this sort of thing just call
> > the argument 'flags', so as not to "lock in" a single idea of what the flags
> > are for. I can't readily think of another use for flags in GetOldestXmin,
> > but ISTM it's better to just go with "flags" instead of "ignoreFlags".
> 
> I agree; also, many years ago a guy named Tom Lane told me that flags
> argument should typically be declared as type "int".  I've followed
> that advice ever since.

Why is that?  I think uint makes a lot more sense for flags where the
flags are individual bits that set/unset. Doing that with the sign bit
isn't a good idea.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2017-02-15 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Jan 20, 2017 at 2:09 AM, Michael Paquier
>  wrote:
> > Okay I just did it. At the same time the check for ferror is not
> > necessary as fgets() returns NULL on an error as well so that's dead
> > code. I have also removed the useless call to FreeFile().
> 
> diff --git a/src/backend/postmaster/postmaster.c
> b/src/backend/postmaster/postmaster.c
> index 271c492..a7ebb74 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -1733,7 +1733,7 @@ ServerLoop(void)
>  }
> 
>  /* If we have lost the log collector, try to start a new one */
> -if (SysLoggerPID == 0 && Logging_collector)
> +if (SysLoggerPID == 0)
>  SysLoggerPID = SysLogger_Start();
> 
>  /*
> 
> This hunk has zero chance of being acceptable, I think.  We're not
> going to start running the syslogger in even when it's not configured
> to run.

So what is going on here is that SysLogger_Start() wants to unlink the
current-logfile file if the collector is not enabled.  This should
probably be split out into a separate new function, for two reasons:
first, it doesn't seem good idea to have SysLogger_Start do something
other than start the logger; and second, so that we don't have a syscall
on each ServerLoop iteration.  That new function should be called from
some other place -- perhaps reaper() and just before entering
ServerLoop, so that the file is deleted if the syslogger goes away or is
not started.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing CHECK_FOR_INTERRUPTS in hash joins

2017-02-15 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Feb 15, 2017 at 2:22 PM, Tom Lane  wrote:
>> Adding a C.F.I. inside this loop is the most straightforward fix, but
>> I am leaning towards adding one in ExecHashJoinGetSavedTuple instead,

> Would it also make sense to put one in the loop in
> ExecHashIncreaseNumBatches (or perhaps
> ExecHashJoinSaveTuple for symmetry with the above)?  Otherwise you
> might have to wait for a few hundred MB of tuples to be written out
> which could be slow if IO is somehow overloaded.

Mmm, good point.  I think in that case the C.F.I. had better be in
the loop in ExecHashIncreaseNumBatches, because if you were unlucky
the loop might not take the ExecHashJoinSaveTuple path for a long time.

Looking around at other callers of ExecHashJoinSaveTuple, the only one
that seems to be in need of a C.F.I. is the loop in
ExecHashRemoveNextSkewBucket, and there again there's a code path
whereby the loop doesn't call ExecHashJoinSaveTuple.

Will CFI-ify all three places.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2017-02-15 Thread Karl O. Pinc
On Wed, 15 Feb 2017 15:23:00 -0500
Robert Haas  wrote:

> + errhint("The supported log formats are
> \"stderr\""
> +" and \"csvlog\".")));
> 
> I think our preferred style is not to break strings across lines like
> this.

How do you do that and not exceed the 80 character line limit?
Just ignore the line length limit?


Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-02-15 Thread David G. Johnston
On Wed, Feb 15, 2017 at 12:24 PM, Robert Haas  wrote:

> On Tue, Jan 24, 2017 at 4:32 AM, Peter Moser  wrote:
> >> Using common terms such as ALIGN and NORMALIZE for such a specific
> >> functionality seems a bit wrong.
> >
> > Would ALIGN RANGES/RANGE ALIGN and NORMALIZE RANGES/RANGE NORMALIZE be
> better
> > options? We are also thankful for any suggestion or comments about the
> syntax.
>
> So it seems like an ALIGN or NORMALIZE option is kind of like a JOIN,
> except apparently there's no join type and the optimizer can never
> reorder these operations with each other or with other joins.  Is that
> right?  The optimizer changes in this patch seem fairly minimal, so
> I'm guessing it can't be doing anything very complex here.
>
> + * INPUT:
> + * (r ALIGN s ON q WITH (r.ts, r.te, s.ts, s.te)) c
> + * where q can be any join qualifier, and r.ts, r.te, s.ts,
> and s.t
> e
> + * can be any column name.
> + *
> + * OUTPUT:
> + * (
> + * SELECT r.*, GREATEST(r.ts, s.ts) P1, LEAST(r.te, s.te) P2
> + *  FROM
> + *  (
> + * SELECT *, row_id() OVER () rn FROM r
> + *  ) r
> + *  LEFT OUTER JOIN
> + *  s
> + *  ON q AND r.ts < s.te AND r.te > s.ts
> + *  ORDER BY rn, P1, P2
> + *  ) c
>
> It's hard to see what's going on here.  What's ts?  What's te?  If you
> used longer names for these things, it might be a bit more
> self-documenting.
>

Just reasoning out loud here...​

ISTM ts and te are "temporal [range] start" and "temporal [range] end"​ (or
probably just the common "timestamp start/end")

​From what I can see it is affecting an intersection of the two ranges and,
furthermore, splitting the LEFT range into sub-ranges that match up with
the sub-ranges found on the right side.  From the example above this seems
like it should be acting on self-normalized ranges - but I may be missing
something by evaluating this out of context.

r1 [1, 6] [ts, te] [time period start, time period end]
s1 [2, 3]
s2 [3, 4]
s3 [5, 7]

r LEFT JOIN s ON (r.ts < s.te AND r.te > s.ts)

r1[1, 6],s1[2, 3] => [max(r.ts, s.ts),min(r.te, s.te)] => r1[1, 6],d[2, 3]
r1[1, 6],s2[3, 4] => [max(t.ts, s.ts),min(r.te, s.te)] => r1[1, 6],d[3, 4]
r1[1, 6],s3[5, 7] => [max(t.ts, s.ts),min(r.te, s.te)] => r1[1, 6],d[5, 6]

Thus the intersection is [2,6] but since s1 has three ranges that begin
between 2 and 6 (i.e., 2, 3, and 5) there are three output records that
correspond to those sub-ranges.

The description in the OP basically distinguishes between NORMALIZE and
ALIGN in that ALIGN, as described above, affects an INTERSECTION on the two
ranges - discarding the non-overlapping data - while NORMALIZE performs the
alignment while also retaining the non-overlapping data.

The rest of the syntax seems to deal with selecting subsets of range
records based upon attribute data.

David J.


Re: [HACKERS] operator_precedence_warning vs make installcheck

2017-02-15 Thread Jeff Janes
On Wed, Feb 15, 2017 at 11:47 AM, Tom Lane  wrote:

> I wrote:
> > We could possibly prevent the difference by having exprLocation look
> > through such nodes.  I'm not sure offhand if there are cases where
> > that would be worse than before.  We've definitely made some other
> > hacks to hide the difference between operator_precedence_warning on
> > and off.
>
> After some study I concluded the best fix is just to make the AEXPR_PAREN
> node have the same reportable location as its child node to begin with.
> None of the code dealing with precedence errors was using the location
> of the left parenthesis, so there's no good reason to store that.
>
> Pushed a fix along that line.
>
> regards, tom lane
>

Thanks.


Re: [HACKERS] Patch to implement pg_current_logfile() function

2017-02-15 Thread Robert Haas
On Fri, Jan 20, 2017 at 2:09 AM, Michael Paquier
 wrote:
> Okay I just did it. At the same time the check for ferror is not
> necessary as fgets() returns NULL on an error as well so that's dead
> code. I have also removed the useless call to FreeFile().

diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 271c492..a7ebb74 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1733,7 +1733,7 @@ ServerLoop(void)
 }

 /* If we have lost the log collector, try to start a new one */
-if (SysLoggerPID == 0 && Logging_collector)
+if (SysLoggerPID == 0)
 SysLoggerPID = SysLogger_Start();

 /*

This hunk has zero chance of being acceptable, I think.  We're not
going to start running the syslogger in even when it's not configured
to run.

I think what should happen is that the postmaster should try to remove
any old metainfo datafile before it reaches this point, and then if
the logging collector is started it will recreate it.

+ereport(WARNING,
+(errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("corrupted data found in \"%s\"",
+LOG_METAINFO_DATAFILE)));

elog seems fine here.  There's no need for this to be translatable.
Also, I'd change the level to ERROR.

+ errhint("The supported log formats are \"stderr\""
+" and \"csvlog\".")));

I think our preferred style is not to break strings across lines like this.

+log_filepath[strcspn(log_filepath, "\n")] = '\0';

We have no existing dependency on strcspn().  It might be better not
to add one just for this feature; I suggest using strchr() instead,
which is widely used in our existing code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-15 Thread Claudio Freire
On Wed, Feb 15, 2017 at 9:52 AM, Robert Haas  wrote:
> On Tue, Feb 14, 2017 at 11:45 PM, Tom Lane  wrote:
>> You could perhaps make an argument that sum(float4) would have less risk
>> of overflow if it accumulated in and returned float8, but frankly that
>> seems a bit thin.
>
> I think that's more or less the argument Konstantin is in fact making.
> Whether it's a good argument or a thin one is a value judgement.
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.  He doesn't like
> the way it works currently, but we have no way of knowing how many
> people who are happy with the way it works today would become unhappy
> if we changed it.  We need a fairly compelling reason to risk breaking
> somebody's SQL, and I don't think this rises to that level.


I know this is said from time to time in this list, but a third option
that wouldn't break anybody's SQL would be using compensated summation
in the input type.

AFAIK, that can only increase precision, but it will cost cycles. The
impact could however fall below the noise and be worth a try.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] operator_precedence_warning vs make installcheck

2017-02-15 Thread Tom Lane
I wrote:
> We could possibly prevent the difference by having exprLocation look
> through such nodes.  I'm not sure offhand if there are cases where
> that would be worse than before.  We've definitely made some other
> hacks to hide the difference between operator_precedence_warning on
> and off.

After some study I concluded the best fix is just to make the AEXPR_PAREN
node have the same reportable location as its child node to begin with.
None of the code dealing with precedence errors was using the location
of the left parenthesis, so there's no good reason to store that.

Pushed a fix along that line.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 2:15 PM, Andres Freund  wrote:
> I don't think that's true for several reasons.  Separating out PGXACT
> didn't just mean reducing the stride size of the access / preventing
> sharing. It also meant that frequently changing fields in PGPROC aren't
> on the same cache-line as fields in PGXACT.  That makes quite a
> difference, because with the split a lot of the cachelines "backing"
> PGPROC can stay in 'shared' mode in several CPU caches, while
> modifications to PGPROC largely can stay in 'exclusive' mode locally on
> the CPU the backend is currently running on.  I think I previously
> mentioned, even just removing the MyPgXact->xmin assignment in
> SnapshotResetXmin() is measurable performance wise and cache-hit ratio
> wise.

Oh, hmm.  I didn't think about that angle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does having a NULL column automatically exclude the table from the tupleDesc cache?

2017-02-15 Thread Ryan Murphy
> attcacheoff can only be set positive for fields preceding any varlena
> (typlen<0, but including the first such) or nullable values.  I don't
> know how much faster it is with the cache; you can measure it if your
> curiosity is strong enough -- just set the first column to nullable.
>
>
Thanks!  Maybe I'll do some benchmarks.


Re: [HACKERS] Parallel Index-only scan

2017-02-15 Thread Robert Haas
On Thu, Jan 19, 2017 at 7:07 AM, Rafia Sabih
 wrote:
> Please find the attached file rebased patch of parallel index-only
> scan on the latest Parallel index-scan patch [1].

This again needs minor rebasing but basically looks fine.  It's a
pretty straightforward extension of the parallel index scan work.

Please make sure that this is pgindent-clean - i.e. that when you
pgindent the files that it touches, pgindent doesn't change anything
of the same parts of the file that you've changed in the patch.  Also,
I believe Amit may have made some adjustments to the logic in
nodeIndexScan.c; if so, it would be good to make sure that the
nodeIndexOnlyScan.c changes match what was done there.  In particular,
he's got this:

if (reset_parallel_scan && node->iss_ScanDesc->parallel_scan)
index_parallelrescan(node->iss_ScanDesc);

And you've got this:

+   if (reset_parallel_scan)
+   index_parallelrescan(node->ioss_ScanDesc);

There might be some other inconsistencies as well that I didn't notice
on a quick look.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-02-15 Thread Robert Haas
On Tue, Jan 24, 2017 at 4:32 AM, Peter Moser  wrote:
>> Using common terms such as ALIGN and NORMALIZE for such a specific
>> functionality seems a bit wrong.
>
> Would ALIGN RANGES/RANGE ALIGN and NORMALIZE RANGES/RANGE NORMALIZE be better
> options? We are also thankful for any suggestion or comments about the syntax.

So it seems like an ALIGN or NORMALIZE option is kind of like a JOIN,
except apparently there's no join type and the optimizer can never
reorder these operations with each other or with other joins.  Is that
right?  The optimizer changes in this patch seem fairly minimal, so
I'm guessing it can't be doing anything very complex here.

What happens if you perform the ALIGN or NORMALIZE operation using
something other than an equality operator, like, say, less-than?  Or
an arbitrary user-defined operator.

There's no documentation in this patch.  I'm not sure you want to go
to the trouble of writing SGML documentation until this has been
reviewed enough that it has a real chance of getting committed, but on
the other hand we're obviously all struggling to understand what it
does, so I think if not SGML documentation it at least needs a real
clear explanation of what the syntax is and does in a README or
something, even just for initial review.

We don't have anything quite like this in PostgreSQL today.  An
ordinary join just matches up things in relation A and relation B and
outputs the matching rows, and something like a SRF takes a set of
input rows and returns a set of output rows.  This is a hybrid - it
takes in two sets of rows, one from each relation being joined, and
produces a derived set of output rows that takes into account both
inputs.

+ * INPUT:
+ * (r ALIGN s ON q WITH (r.ts, r.te, s.ts, s.te)) c
+ * where q can be any join qualifier, and r.ts, r.te, s.ts, and s.t
e
+ * can be any column name.
+ *
+ * OUTPUT:
+ * (
+ * SELECT r.*, GREATEST(r.ts, s.ts) P1, LEAST(r.te, s.te) P2
+ *  FROM
+ *  (
+ * SELECT *, row_id() OVER () rn FROM r
+ *  ) r
+ *  LEFT OUTER JOIN
+ *  s
+ *  ON q AND r.ts < s.te AND r.te > s.ts
+ *  ORDER BY rn, P1, P2
+ *  ) c

It's hard to see what's going on here.  What's ts?  What's te?  If you
used longer names for these things, it might be a bit more
self-documenting.

If we are going to transform an ALIGN operator in to a left outer
join, why do we also have an executor node for it?

+   fcLowerLarg = makeFuncCall(SystemFuncName("lower"),
+
list_make1(crLargTs),
+
UNKNOWN_LOCATION);
+   fcLowerRarg = makeFuncCall(SystemFuncName("lower"),
+
list_make1(crRargTs),
+
UNKNOWN_LOCATION);
+   fcUpperLarg = makeFuncCall(SystemFuncName("upper"),
+
list_make1(crLargTs),
+
UNKNOWN_LOCATION);
+   fcUpperRarg = makeFuncCall(SystemFuncName("upper"),
+
list_make1(crRargTs),
+
UNKNOWN_LOCATION);

Why is a temporal operator calling functions that upper-case and
lower-case strings?  In one sense this whole function (and much of the
nearby code) is very straightforward code and you can see exactly why
it's doing it.  In another sense it's totally inscrutable: WHY is it
doing any of that stuff?

-   char   *strategy;   /* partitioning strategy
('list' or 'range') */
-   List   *partParams; /* List of PartitionElems */
-   int location;   /* token
location, or -1 if unknown */
+   char   *strategy;   /* partitioning strategy ('list' or 'range') */
+   List   *partParams; /* List of PartitionElems */
+   int location;   /* token location, or
-1 if unknown */

I think this is some kind of mistake on your end while generating the
patch.  It looks like you patched one version of the source code, and
diffed against another.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Andres Freund
Hi,

On 2017-02-15 12:24:44 -0500, Robert Haas wrote:
> If you pad PGXACT out to one cache line, you could likewise take a
> snapshot by touching 1 cache line per backend, and they'd be
> consecutive.  Maybe that difference matters to the memory prefetching
> controller, I dunno,

Unfortunately it's currently *not* consecutively accessed afaik.  We're
not iterating PGXACT sequentially, we're doing it in pgprocno order.
The reason is that otherwise we have to scan unused PGXACT entries.

I've previously played around with using actual sequential access, and
it was a good bit faster as long as nearly all connections are used -
but slower when only a few are in use, which is quite common...


> but it seems funny that we did the PGXACT work to
> reduce the number of cache lines that had to be touched in order to
> take a snapshot to improve performance, and now we're talking about
> increasing it again, also to improve performance.

I don't think it's that weird that aligning things properly is important
for performance, nor that reducing the amount of memory accessed is
beneficial.  Engineering is tradeoffs, and we might just have erred a
bit too far in this case.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Andres Freund
Hi,

On 2017-02-15 11:43:17 -0500, Robert Haas wrote:
> On Mon, Feb 13, 2017 at 11:07 AM, Alvaro Herrera
>  wrote:
> > It seems to me that Andres comments here were largely ignored:
> > https://www.postgresql.org/message-id/20160822021747.u5bqx2xwwjzac...@alap3.anarazel.de
> > He was suggesting to increase the struct size to 16 bytes rather than
> > going all the way up to 128.  Did anybody test this?
> 
> So, I think that going up to 128 bytes can't really make sense.  If
> that's the best-performing solution here, then maybe what we ought to
> be doing is reverting the PGXACT/PGPROC separation, sticking these
> critical members at the beginning, and padding the whole PGXACT out to
> a multiple of the cache line size.  Something that only touches the
> PGXACT fields will touch the exact same number of cache lines with
> that design, but cases that touch more than just the PGXACT fields
> should perform better.

I don't think that's true for several reasons.  Separating out PGXACT
didn't just mean reducing the stride size of the access / preventing
sharing. It also meant that frequently changing fields in PGPROC aren't
on the same cache-line as fields in PGXACT.  That makes quite a
difference, because with the split a lot of the cachelines "backing"
PGPROC can stay in 'shared' mode in several CPU caches, while
modifications to PGPROC largely can stay in 'exclusive' mode locally on
the CPU the backend is currently running on.  I think I previously
mentioned, even just removing the MyPgXact->xmin assignment in
SnapshotResetXmin() is measurable performance wise and cache-hit ratio
wise.

While Nasby is right that going to multiple of the actual line size can
cause problems due to the N-way associativity of caches, it's not like
unused padding memory >= cacheline-size actually causes cache to be
wasted; it just gets used for other things.


FWIW, I however think we should just set the cacheline size based on the
architecture, rather than defaulting to 128 - it's not like that
actually changes particularly frequently.  I doubt a special-case for
each of x86, arm, power, itanium, and the rest will cause us a lot of
maintenance issues.


> The point of moving PGXACT into a separate
> structure was to reduce the number of cache lines required to build a
> snapshot.  If that's going to go back up to 1 per PGPROC anyway, we
> might as well at least use the rest of that cache line to store
> something else that might be useful instead of pad bytes.  I think,
> anyway.

I don't think that's true, due to the aforementioned dirtyness issue,
which causes inter-node cache coherency traffic (busier bus & latency).

> I think our now-standard way of doing this is to have a "Padded"
> version that is a union between the unpadded struct and char[].  c.f.
> BufferDescPadded, LWLockPadded.

That's what my original POC patch did IIRC, but somebody complained it
changed a lot of accesses.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-02-15 Thread Robert Haas
On Tue, Jan 24, 2017 at 4:32 AM, Peter Moser  wrote:
> NORMALIZE: splits all the ranges of one relation according to all the range
> boundaries of another (but possibly the same) relation whenever some equality
> condition over some given attributes is satisfied.
>
> When the two relations are the same, all ranges with the given equal 
> attributes
> are either equal or disjoint. After this, the traditional GROUP BY or DISTINCT
> can be applied. The attributes given to NORMALIZE for a (temporal) GROUP BY 
> are
> the grouping attributes and for a (temporal) DISTINCT the target list
> attributes.
>
> When the two relations are different, but they each contain disjoint ranges
> for the same attributes (as the current limitation for the set operations is)
> we perform a symmetric NORMALIZE on each of them. Then we have a similar
> situation as before, i.e., in both relations ranges with the same attributes
> are either equal or disjoint and a traditional set operation
> (EXCEPT/INTERSECT/UNION) can be applied. The attributes given to NORMALIZE for
> a (temporal) EXCEPT/INTERSECT/UNION are the target list attributes.
>
>
> ALIGN: splits all the ranges of one relation according to all the range
> intersections of another relation, i.e., it produces all intersections and
> non-overlapping parts, whenever some condition is satisfied.
>
> We perform a symmetric ALIGN on each relation, after which a traditional inner
> or outer join can be applied using equality on the ranges to calculate the
> overlap. The condition given to a (temporal) inner or outer join is the
> join condition without overlap.

I don't quite understand the difference between NORMALIZE and ALIGN.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas  wrote:
> On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila  wrote:>
>> support related patch.  In anycase, to avoid confusion I am attaching
>> all the three patches with this e-mail.
>
> Committed guc_parallel_index_scan_v1.patch, with changes to the
> documentation and GUC descriptions.

And committed parallel_index_opt_exec_support_v10.patch as well, with
a couple of minor tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila  wrote:>
> support related patch.  In anycase, to avoid confusion I am attaching
> all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does having a NULL column automatically exclude the table from the tupleDesc cache?

2017-02-15 Thread Alvaro Herrera
Ryan Murphy wrote:

> My question kind of remains though - does that mean that having any nulls
> in the tuple loses the ability to use the tupleDesc cache? and how much of
> a performance impact is this?  I'm sure there's a good reason why you can't
> really use the cache if you have a null column, just was curious of the
> implications.

attcacheoff can only be set positive for fields preceding any varlena
(typlen<0, but including the first such) or nullable values.  I don't
know how much faster it is with the cache; you can measure it if your
curiosity is strong enough -- just set the first column to nullable.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reporting xmin from VACUUMs

2017-02-15 Thread Alvaro Herrera
Jim Nasby wrote:

> ISTR previous discussion of allowing more stats files; if that happened
> I think having stats that were dedicated to (auto)vacuum would be very
> useful. That's clearly a lot more work though.

What?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 12:48 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> ... Maybe that difference matters to the memory prefetching
>> controller, I dunno, but it seems funny that we did the PGXACT work to
>> reduce the number of cache lines that had to be touched in order to
>> take a snapshot to improve performance, and now we're talking about
>> increasing it again, also to improve performance.
>
> Yes.  I was skeptical that the original change was adequately proven
> to be a good idea, and I'm even more skeptical this time.  I think
> every single number that's been reported about this is completely
> machine-specific, and likely workload-specific too, and should not
> be taken as a reason to do anything.

The original change definitely worked on read-only pgbench workloads
on both x86 and Itanium, and the gains were pretty significant at
higher client counts.  I don't know whether we tested POWER.
Read-only pgbench throughput is not the world, of course, but it's a
reasonable proxy for high-concurrency, read-mostly workloads involving
short transactions, so it's not nothing, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Help text for pg_basebackup -R

2017-02-15 Thread Alvaro Herrera
Magnus Hagander wrote:

>   printf(_("  -R, --write-recovery-conf\n"
> -  " write recovery.conf after 
> backup\n"));
> +  " write recovery.conf for 
> replication\n"));
>   printf(_("  -S, --slot=SLOTNAMEreplication slot to use\n"));

LGTM.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements

2017-02-15 Thread Anastasia Lubennikova

13.02.2017 19:34, Andrew Dunstan:


On 01/13/2017 08:36 AM, Anastasia Lubennikova wrote:

I implemented IF NOT EXISTS option for CREATE SERVER and CREATE USER
MAPPING statements
for one of our customers.
I think other users can also find it useful for scripting and
automated tasks.
The patches themselves are small and simple. Documentation is updated
as well.




This looks good and useful. Please add some regression tests.


Done.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_server.sgml b/doc/src/sgml/ref/create_server.sgml
index 734c6c9..6777679 100644
--- a/doc/src/sgml/ref/create_server.sgml
+++ b/doc/src/sgml/ref/create_server.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE SERVER server_name [ TYPE 'server_type' ] [ VERSION 'server_version' ]
+CREATE SERVER [IF NOT EXISTS] server_name [ TYPE 'server_type' ] [ VERSION 'server_version' ]
 FOREIGN DATA WRAPPER fdw_name
 [ OPTIONS ( option 'value' [, ... ] ) ]
 
@@ -56,6 +56,19 @@ CREATE SERVER server_name [ TYPE '<
   Parameters
 
   
+
+  
+IF NOT EXISTS
+
+ 
+  Do not throw an error if a server with the same name already exists.
+  A notice is issued in this case.  Note that there is no guarantee that
+  the existing server is anything like the one that would have been
+  created.
+ 
+
+   
+

 server_name
 
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 6ff8b69..b4ae5de 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -883,12 +883,25 @@ CreateForeignServer(CreateForeignServerStmt *stmt)
 
 	/*
 	 * Check that there is no other foreign server by this name.
+	 * Do nothing if IF NOT EXISTS was enforced.
 	 */
 	if (GetForeignServerByName(stmt->servername, true) != NULL)
-		ereport(ERROR,
-(errcode(ERRCODE_DUPLICATE_OBJECT),
- errmsg("server \"%s\" already exists",
-		stmt->servername)));
+	{
+		if (stmt->if_not_exists)
+		{
+			ereport(NOTICE,
+	(errcode(ERRCODE_DUPLICATE_OBJECT),
+	 errmsg("foreign server \"%s\" already exists, skipping",
+			stmt->servername)));
+			heap_close(rel, RowExclusiveLock);
+			return InvalidObjectAddress;
+		}
+		else
+			ereport(ERROR,
+	(errcode(ERRCODE_DUPLICATE_OBJECT),
+	 errmsg("foreign server \"%s\" already exists",
+			stmt->servername)));
+	}
 
 	/*
 	 * Check that the FDW exists and that we have USAGE on it. Also get the
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a4edea0..d007468 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -4655,6 +4655,19 @@ CreateForeignServerStmt: CREATE SERVER name opt_type opt_foreign_server_version
 	n->version = $5;
 	n->fdwname = $9;
 	n->options = $10;
+	n->if_not_exists = false;
+	$$ = (Node *) n;
+}
+| CREATE SERVER IF_P NOT EXISTS name opt_type opt_foreign_server_version
+		 FOREIGN DATA_P WRAPPER name create_generic_options
+{
+	CreateForeignServerStmt *n = makeNode(CreateForeignServerStmt);
+	n->servername = $6;
+	n->servertype = $7;
+	n->version = $8;
+	n->fdwname = $12;
+	n->options = $13;
+	n->if_not_exists = true;
 	$$ = (Node *) n;
 }
 		;
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 07a8436..704bc6b 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2113,6 +2113,7 @@ typedef struct CreateForeignServerStmt
 	char	   *servertype;		/* optional server type */
 	char	   *version;		/* optional server version */
 	char	   *fdwname;		/* FDW name */
+	bool		if_not_exists;	/* just do nothing if it already exists? */
 	List	   *options;		/* generic options to server */
 } CreateForeignServerStmt;
 
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 3a9fb8f..17f9f40 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -283,7 +283,9 @@ ERROR:  foreign-data wrapper "foo" does not exist
 CREATE FOREIGN DATA WRAPPER foo OPTIONS ("test wrapper" 'true');
 CREATE SERVER s1 FOREIGN DATA WRAPPER foo;
 CREATE SERVER s1 FOREIGN DATA WRAPPER foo;  -- ERROR
-ERROR:  server "s1" already exists
+ERROR:  foreign server "s1" already exists
+CREATE SERVER IF NOT EXISTS s1 FOREIGN DATA WRAPPER foo;	-- No ERROR, just NOTICE
+NOTICE:  foreign server "s1" already exists, skipping
 CREATE SERVER s2 FOREIGN DATA WRAPPER foo OPTIONS (host 'a', dbname 'b');
 CREATE SERVER s3 TYPE 'oracle' FOREIGN DATA WRAPPER foo;
 CREATE SERVER s4 TYPE 'oracle' FOREIGN DATA WRAPPER foo OPTIONS (host 'a', dbname 'b');
diff --git a/src/test/regress/sql/foreign_data.sql b/src/test/regress/sql/foreign_data.sql
index 38e1d41..a1fc10f 100644
--- a/src/test/regress/sql/foreign_data.sql
+++ 

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Alvaro Herrera
Alexander Korotkov wrote:

> Difference between master, pgxact-align-2 and pgxact-align-3 doesn't exceed
> per run variation.

FWIW this would be more visible if you added error bars to each data
point.  Should be simple enough in gnuplot ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Tom Lane
Robert Haas  writes:
> ... Maybe that difference matters to the memory prefetching
> controller, I dunno, but it seems funny that we did the PGXACT work to
> reduce the number of cache lines that had to be touched in order to
> take a snapshot to improve performance, and now we're talking about
> increasing it again, also to improve performance.

Yes.  I was skeptical that the original change was adequately proven
to be a good idea, and I'm even more skeptical this time.  I think
every single number that's been reported about this is completely
machine-specific, and likely workload-specific too, and should not
be taken as a reason to do anything.

My druthers at this point would be to revert the separation on code
cleanliness grounds and call it a day, more or less independently of any
claims about performance.  I'd be willing to talk about padding PGPROC
to some reasonable stride, but I remain dubious that any changes of
that sort would have a half-life worth complicating the code for.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-15 Thread Tom Lane
Robert Haas  writes:
> I think that's more or less the argument Konstantin is in fact making.
> Whether it's a good argument or a thin one is a value judgement.
> Personally, I find it somewhere in the middle: I think the way it
> works now is reasonable, and I think what he wants would have been
> reasonable as well.  However, I find it hard to believe it would be
> worth changing now on backward compatibility grounds.

I agree with that conclusion.  It might have been reasonable to change
it fifteen years ago when we changed the integer sum() implementations,
but I think doing so now would make more people unhappy than it would
make happy; or at least, there's little reason to think the reverse
is true.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bytea_output vs make installcheck

2017-02-15 Thread Jeff Janes
On Tue, Feb 14, 2017 at 9:17 PM, Andres Freund  wrote:

>
>
> On February 14, 2017 9:02:14 PM PST, neha khatri 
> wrote:
> >On Wed, Feb 15, 2017 at 10:04 AM, neha khatri 
> > wrote:.
> >>
> >>
> >>> Attached are two options for doing that.  One overrides bytea_output
> >>> locally where-ever needed, and the other overrides it for the entire
> >>> 'regression' database.
> >>>
> >>
> >> The solution that overrides bytea_output locally looks appropriate.
> >It may
> >> not be required to change the format for entire database.
> >> Had there been a way to convert  bytea_output from 'hex' to 'escape'
> >> internally, that could have simplified this customization even more.
> >>
> >
> >Well, the conversion from 'hex' to 'escape' is available using the
> >function
> >encode().
> >So the queries that are failing due to the setting bytea_output =
> >escape,
> >can be wrapped under encode(), to obtain the result in 'escape' format.
> >Here is another way to resolve the same problem. The patch is attached.
>
> I don't quite see the point of this - there's a lot of settings that cause
> spurious test failures. I don't see any point fixing random cases of that.
> And I don't think the continual cost of doing so overall is worth the
> minimal gain.
>
> What's your reason to get this fixed?
>

More testing is better than less testing, and a good way to get less
testing is requiring the tester to memorize a list of false positives they
might encounter.  I'd like to systematically clone my production system,
run it through pg_upgrade, and then do installcheck (plus my own
app-specific) on the result, and I'd like others to do that as well with
their own production systems.

I don't really see the cost here.  It took me longer to satisfy myself that
this was not actually a real bug than it did to write the patch.  Now much
of that time was just because there were 3 other problems as well, which
makes isolating and evaluating them exponentially harder--which itself is a
good reason for fixing the ones that are easy to fix, so we don't have to
get distracted by those while investigating the other ones.  And if we go
with the option 2 patch, it is one line which seems pretty self-documenting
and easy to maintain. I'd rather light a candle than to curse the darkness,
at least when the candle is this easy to light.

Cheers,

Jeff


Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags

2017-02-15 Thread Robert Haas
On Tue, Feb 14, 2017 at 1:38 PM, Jim Nasby  wrote:
> On 2/14/17 3:13 AM, Seki, Eiji wrote:
>>   +extern TransactionId GetOldestXmin(Relation rel, uint8 ignoreFlags);
>
>
> My impression is that most other places that do this sort of thing just call
> the argument 'flags', so as not to "lock in" a single idea of what the flags
> are for. I can't readily think of another use for flags in GetOldestXmin,
> but ISTM it's better to just go with "flags" instead of "ignoreFlags".

I agree; also, many years ago a guy named Tom Lane told me that flags
argument should typically be declared as type "int".  I've followed
that advice ever since.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does having a NULL column automatically exclude the table from the tupleDesc cache?

2017-02-15 Thread Ryan Murphy
> No, that tests whether the particular tuple contains any null fields, not
> whether the whole table is declared NOT NULL.
>
> regards, tom lane
>

Ok, thanks, that makes sense.

My question kind of remains though - does that mean that having any nulls
in the tuple loses the ability to use the tupleDesc cache? and how much of
a performance impact is this?  I'm sure there's a good reason why you can't
really use the cache if you have a null column, just was curious of the
implications.  Thanks again!


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 12:14 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Feb 13, 2017 at 11:07 AM, Alvaro Herrera
>>  wrote:
>>> It seems to me that Andres comments here were largely ignored:
>>> https://www.postgresql.org/message-id/20160822021747.u5bqx2xwwjzac...@alap3.anarazel.de
>>> He was suggesting to increase the struct size to 16 bytes rather than
>>> going all the way up to 128.  Did anybody test this?
>
>> So, I think that going up to 128 bytes can't really make sense.  If
>> that's the best-performing solution here, then maybe what we ought to
>> be doing is reverting the PGXACT/PGPROC separation, sticking these
>> critical members at the beginning, and padding the whole PGXACT out to
>> a multiple of the cache line size.
>
> Yes.  That separation was never more than a horribly ugly kluge.
> I would love to see it go away.  But keeping it *and* padding
> PGXACT to something >= the size of PGPROC borders on insanity.

I don't think it would be bigger than a PGPROC.  PGPROCs are really
big, 816 bytes on my MacBook Pro.  But if you did what I suggested,
you could take a snapshot by touching 1 cache line per backend.  They
wouldn't be consecutive; it would be an upward pattern, with skips.
If you pad PGXACT out to one cache line, you could likewise take a
snapshot by touching 1 cache line per backend, and they'd be
consecutive.  Maybe that difference matters to the memory prefetching
controller, I dunno, but it seems funny that we did the PGXACT work to
reduce the number of cache lines that had to be touched in order to
take a snapshot to improve performance, and now we're talking about
increasing it again, also to improve performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does having a NULL column automatically exclude the table from the tupleDesc cache?

2017-02-15 Thread Tom Lane
Ryan Murphy  writes:
> My question is this:  HeapTupleNoNulls() is run first to see if there are
> any columns that can be NULL.  It looks like the fetchatt() that uses the
> cache in the tupleDesc can only be used if there are no NULLable columns in
> the table.  Is my understanding correct?

No, that tests whether the particular tuple contains any null fields, not
whether the whole table is declared NOT NULL.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix pg_proc comment grammar

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 10:40 AM, David Christensen  wrote:
> Fixes some DESCR() grammar mistakes introduced by the xlog -> wal changes.

Thanks, David!

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Tom Lane
Robert Haas  writes:
> On Mon, Feb 13, 2017 at 11:07 AM, Alvaro Herrera
>  wrote:
>> It seems to me that Andres comments here were largely ignored:
>> https://www.postgresql.org/message-id/20160822021747.u5bqx2xwwjzac...@alap3.anarazel.de
>> He was suggesting to increase the struct size to 16 bytes rather than
>> going all the way up to 128.  Did anybody test this?

> So, I think that going up to 128 bytes can't really make sense.  If
> that's the best-performing solution here, then maybe what we ought to
> be doing is reverting the PGXACT/PGPROC separation, sticking these
> critical members at the beginning, and padding the whole PGXACT out to
> a multiple of the cache line size.

Yes.  That separation was never more than a horribly ugly kluge.
I would love to see it go away.  But keeping it *and* padding
PGXACT to something >= the size of PGPROC borders on insanity.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-15 Thread Tom Lane
Alvaro Herrera  writes:
> Robert Haas wrote:
>> True.  I think the question here is: do we want to view the dependency
>> between a partitioned table and a partition of that table as
>> DEPENDENCY_NORMAL or as DEPENDENCY_AUTO?  With table inheritance, it's
>> always been "normal" and I'm not sure there's any good reason for
>> partitioning to make the opposite decision.

> I think new-style partitioning is supposed to consider each partition as
> an implementation detail of the table; the fact that you can manipulate
> partitions separately does not really mean that they are their own
> independent object.  You don't stop to think "do I really want to drop
> the TOAST table attached to this main table?" and attach a CASCADE
> clause if so.  You just drop the main table, and the toast one is
> dropped automatically.  I think new-style partitions should behave
> equivalently.

I agree with Alvaro's position.  If you need CASCADE to get rid of the
individual partitions, that's going to be a serious usability fail.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error handling in RegisterBackgroundWorker

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 11:30 AM, Peter Eisentraut
 wrote:
> If RegisterBackgroundWorker() (the non-dynamic kind that is only
> loadable from shared_preload_libraries) fails to register the worker, it
> writes a log message and proceeds, ignoring the registration request.  I
> think that is a mistake, it should be a hard error.  The only way in
> practice to fix the problem is to change shared_preload_libraries or
> max_worker_processes, both requiring a restart anyway, so proceeding
> without the worker is not useful.

I guess the question is whether people will prefer to have the
database start up and be missing the worker, or to have it not start.
As you point out, the former is likely to result in an eventual
restart, but the latter may lead to a longer period of downtime RIGHT
NOW.  People tend to really hate things that make the database not
start, so I'm not sure what's best here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 11:34 AM, Alvaro Herrera
 wrote:
> I think new-style partitioning is supposed to consider each partition as
> an implementation detail of the table; the fact that you can manipulate
> partitions separately does not really mean that they are their own
> independent object.  You don't stop to think "do I really want to drop
> the TOAST table attached to this main table?" and attach a CASCADE
> clause if so.  You just drop the main table, and the toast one is
> dropped automatically.  I think new-style partitions should behave
> equivalently.

That's a reasonable point of view.  I'd like to get some more opinions
on this topic.  I'm happy to have us do whatever most people want, but
I'm worried that having table inheritance and table partitioning work
differently will be create confusion.  I'm also suspicious that there
may be some implementation difficulties.  On the hand, it does seem a
little silly to say that DROP TABLE partitioned_table should always
fail except in the degenerate case where there are no partitions, so
maybe changing it is for the best.

> Now that partitions are declarative, the underlying implementation could
> change away from inheritance.  It's now just an implementation artifact.

I don't really agree with that.  It's true that, for example, we could
decide to store the inheritance information for partitions someplace
other than pg_inherits, but from a user perspective these are always
going to be closely-related features.  Also, there are quite a number
of ways in which it's very noticeable that the objects are separate.
They are dumped separately.  They have separate indexes, and even if
we provide some facility to create a common indexing scheme across all
partitions automatically, you'll still be able to REINDEX or CLUSTER
one of those tables individually.  They can have different storage
properties, like one can be UNLOGGED while another is not.  They show
up in EXPLAIN plans.  The partitioning structure affects what you can
and can't do with foreign keys.  All of those are user-visible things
that make this look and feel like a collection of tables, not just a
single table that happens to have several relfilenodes under the hood.
I think that's actually a really important feature, not a design
defect.

As you may or may not know, EDB has had a proprietary implementation
of table partitioning since Advanced Server 9.1, and one of the things
we've learned from that implementation is that users really like to be
able to fiddle with the individual partitions.  They want to things
like make them individually unlogged, rename them, vacuum them, add
contraints, add triggers, put them in different tablespaces, vary
indexing strategies, all the stuff that you normally do with
standalone tables.  Any time one of those things didn't work quite
like it would for a standalone table, we got complaints.  One of the
things that has become really clear over the five years that feature
has been out of the field is that users value the ability to do
different things with different child tables.  Now that of course does
not mean that they don't want to be able to operate on the hierarchy
as a whole; we have numerous feature requests in that direction, too.
But, at least among the base of customers that have talked to us about
the proprietary partitioning feature in Advanced Server, wanting to
treat a partition as a table and do some arbitrary thing to it that
can be done to a table is extremely common.  Of course, I can't
promise (and am not trying to argue) that the reaction among
PostgreSQL users generally will necessarily be similar to the
experience we've had with our Advanced Server customers, but this
experience has definitely caused me to lean in the direction of
wanting partitions to be first-class objects.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Help text for pg_basebackup -R

2017-02-15 Thread Magnus Hagander
The current help text for pg_basebackup -R is "write recovery.conf after
backup".

This says nothing about what it actually does. I've had a number of people
ask me now why that's not default "because you need a recovery.conf to
restore from backup". The point being that it doesn't say anything about
the fact that it writes the file *for replication*. The help page does, but
not the message.

I propose a new message per the attached patch.

Comments?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 0c82bc9..9020fb1 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -337,7 +337,7 @@ usage(void)
 	printf(_("  -r, --max-rate=RATEmaximum transfer rate to transfer data directory\n"
 	  " (in kB/s, or use suffix \"k\" or \"M\")\n"));
 	printf(_("  -R, --write-recovery-conf\n"
-			 " write recovery.conf after backup\n"));
+			 " write recovery.conf for replication\n"));
 	printf(_("  -S, --slot=SLOTNAMEreplication slot to use\n"));
 	printf(_("  --no-slot  prevent creation of temporary replication slot\n"));
 	printf(_("  -T, --tablespace-mapping=OLDDIR=NEWDIR\n"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-15 Thread Robert Haas
On Mon, Feb 13, 2017 at 11:07 AM, Alvaro Herrera
 wrote:
> It seems to me that Andres comments here were largely ignored:
> https://www.postgresql.org/message-id/20160822021747.u5bqx2xwwjzac...@alap3.anarazel.de
> He was suggesting to increase the struct size to 16 bytes rather than
> going all the way up to 128.  Did anybody test this?

So, I think that going up to 128 bytes can't really make sense.  If
that's the best-performing solution here, then maybe what we ought to
be doing is reverting the PGXACT/PGPROC separation, sticking these
critical members at the beginning, and padding the whole PGXACT out to
a multiple of the cache line size.  Something that only touches the
PGXACT fields will touch the exact same number of cache lines with
that design, but cases that touch more than just the PGXACT fields
should perform better.  The point of moving PGXACT into a separate
structure was to reduce the number of cache lines required to build a
snapshot.  If that's going to go back up to 1 per PGPROC anyway, we
might as well at least use the rest of that cache line to store
something else that might be useful instead of pad bytes.  I think,
anyway.

If padding to 16 bytes is sufficient to get the performance
improvement (or as much as we care about), that seems more reasonable.
But it's still possible we may be gaining on one hand (because now
you'll never touch multiple cache lines for your own PGXACT) and
losing on the other (because now you're scanning 4/3 as many cache
lines to construct a snapshot).  Which of those is better might be
workload-dependent or hardware-dependent.

> Re the coding of the padding computation, seems it'd be better to use
> our standard "offsetof(last-struct-member) + sizeof(last-struct-member)"
> rather than adding each of the members' sizes individually.

I think our now-standard way of doing this is to have a "Padded"
version that is a union between the unpadded struct and char[].  c.f.
BufferDescPadded, LWLockPadded.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Does having a NULL column automatically exclude the table from the tupleDesc cache?

2017-02-15 Thread Ryan Murphy
Hi all,

I was looking through some of the implementation details of the
heap/tuples, specifically src/include/access/htup_details.h - and I came
across the big macro fastgetattr, and had a question about it.  I've
included the code here for clarity and convenience:

#define fastgetattr(tup, attnum, tupleDesc, isnull)\
(\
AssertMacro((attnum) > 0),\
(*(isnull) = false),\
HeapTupleNoNulls(tup) ?\
(\
(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?\
(\
fetchatt((tupleDesc)->attrs[(attnum)-1],\
(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
(tupleDesc)->attrs[(attnum)-1]->attcacheoff)\
)\
:\
nocachegetattr((tup), (attnum), (tupleDesc))\
)\
:\
(\
att_isnull((attnum)-1, (tup)->t_data->t_bits) ?\
(\
(*(isnull) = true),\
(Datum)NULL\
)\
:\
(\
nocachegetattr((tup), (attnum), (tupleDesc))\
)\
)\
)

My question is this:  HeapTupleNoNulls() is run first to see if there are
any columns that can be NULL.  It looks like the fetchatt() that uses the
cache in the tupleDesc can only be used if there are no NULLable columns in
the table.  Is my understanding correct?  Does this mean that there is
significant performance gain by never allowing any column to be null in
your table?

Thanks!
Ryan


Re: [HACKERS] Documentation improvements for partitioning

2017-02-15 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, Feb 15, 2017 at 9:37 AM, Alvaro Herrera
>  wrote:
> > Joshua D. Drake wrote:

> >> Because partitions may have data.
> >
> > So would the table, were it not partitioned.
> 
> True.  I think the question here is: do we want to view the dependency
> between a partitioned table and a partition of that table as
> DEPENDENCY_NORMAL or as DEPENDENCY_AUTO?  With table inheritance, it's
> always been "normal" and I'm not sure there's any good reason for
> partitioning to make the opposite decision.

I think new-style partitioning is supposed to consider each partition as
an implementation detail of the table; the fact that you can manipulate
partitions separately does not really mean that they are their own
independent object.  You don't stop to think "do I really want to drop
the TOAST table attached to this main table?" and attach a CASCADE
clause if so.  You just drop the main table, and the toast one is
dropped automatically.  I think new-style partitions should behave
equivalently.

You can make the partition an independent entity, but if you don't
explicitly take that step beforehand, I don't see why we should see it
that way implicitly.

> The new partitioning
> implementation provides a user experience that is overall smoother
> than doing the same thing with inheritance, but it's not as if you can
> ignore the fact that your partitioned tables have sub-objects that are
> also tables.

Now that partitions are declarative, the underlying implementation could
change away from inheritance.  It's now just an implementation artifact.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] error handling in RegisterBackgroundWorker

2017-02-15 Thread Peter Eisentraut
If RegisterBackgroundWorker() (the non-dynamic kind that is only
loadable from shared_preload_libraries) fails to register the worker, it
writes a log message and proceeds, ignoring the registration request.  I
think that is a mistake, it should be a hard error.  The only way in
practice to fix the problem is to change shared_preload_libraries or
max_worker_processes, both requiring a restart anyway, so proceeding
without the worker is not useful.

Perhaps this kind of worker has not been widely used in practice, but we
now have the logical replication launcher registering that way, and the
auto-prewarm patch also proposes to add one.  If you run out of worker
slots before the launcher is supposed to start, it just logs a message
and doesn't start.  That seems prone to confuse.

Attached is a proposed patch (0002) to change the log level to ERROR.
The other patch (0001) gives some additional error context for the
log/error message that you get.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 38c89ce84e5e4eadf88f432ee0c416121a5fc33b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 14 Feb 2017 10:39:06 -0500
Subject: [PATCH 1/3] Add errcontext to background worker registration

---
 src/backend/postmaster/bgworker.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index cd99b0b392..db25a7f68b 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -824,7 +824,8 @@ RegisterBackgroundWorker(BackgroundWorker *worker)
   "Up to %d background workers can be registered with the current settings.",
   max_worker_processes,
   max_worker_processes),
- errhint("Consider increasing the configuration parameter \"max_worker_processes\".")));
+ errhint("Consider increasing the configuration parameter \"max_worker_processes\"."),
+ errcontext("registration of background worker \"%s\"", worker->bgw_name)));
 		return;
 	}
 
-- 
2.11.1

From 6d2f6ec75f160dce622d77ac55bc50858c337527 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 14 Feb 2017 10:39:51 -0500
Subject: [PATCH 2/3] Change failures in RegisterBackgroundWorker() to hard
 errors

Previously, just a log message was written and the background worker
registration was ignored.
---
 src/backend/postmaster/bgworker.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index db25a7f68b..a42bd0f758 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -790,19 +790,19 @@ RegisterBackgroundWorker(BackgroundWorker *worker)
 	if (!process_shared_preload_libraries_in_progress && !internal)
 	{
 		if (!IsUnderPostmaster)
-			ereport(LOG,
+			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 	 errmsg("background worker \"%s\": must be registered in shared_preload_libraries",
 			worker->bgw_name)));
 		return;
 	}
 
-	if (!SanityCheckBackgroundWorker(worker, LOG))
+	if (!SanityCheckBackgroundWorker(worker, ERROR))
 		return;
 
 	if (worker->bgw_notify_pid != 0)
 	{
-		ereport(LOG,
+		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("background worker \"%s\": only dynamic background workers can request notification",
 		worker->bgw_name)));
@@ -817,7 +817,7 @@ RegisterBackgroundWorker(BackgroundWorker *worker)
 	 */
 	if (++numworkers > max_worker_processes)
 	{
-		ereport(LOG,
+		ereport(ERROR,
 (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
  errmsg("too many background workers"),
  errdetail_plural("Up to %d background worker can be registered with the current settings.",
@@ -835,7 +835,7 @@ RegisterBackgroundWorker(BackgroundWorker *worker)
 	rw = malloc(sizeof(RegisteredBgWorker));
 	if (rw == NULL)
 	{
-		ereport(LOG,
+		ereport(ERROR,
 (errcode(ERRCODE_OUT_OF_MEMORY),
  errmsg("out of memory")));
 		return;
-- 
2.11.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 7:17 AM, Dilip Kumar  wrote:
> This is easily doable and it will look much cleaner. While doing this
> I am facing one problem related to
> relptr_store and relptr_access. The problem is that when I call
> "relptr_store  (base, rp, val)" with both base and val as a same
> address then it will store offset as 0 which is fine, but when we use
> relptr_access with 0 offset it is returning NULL instead of giving the
> actual address. I expect it should return directly base when offset is
> 0. (I am facing this problem because in this case (TBM_ONE_PAGE),
> there will be only one page and address of that will be same as
> base.).
>
> There can be multiple solutions to this but none of them looks cleaner to me.
> sol1: If relptr_access return NULL then directly use the base address
> as our PagetableEntry.
> sol2: Instead of using base as start of the element array, just try to
> use some modified base as e.g base=base - some number.
> sol3: change relptr_access to not return NULL in this case. But, this
> will change the current behaviour of this interface and need to
> analyze the side effects.

Hmm, yeah, that's a problem.  How about not using relative pointers
here, and instead just using array indexes?

>> I don't see why you think you need to add sizeof(dsa_pointer) to the
>> allocation and store an extra copy of the dsa_pointer in the
>> additional space.   You are already storing it in tbm->dsa_data and
>> that seems good enough.  pagetable_free() needs the value, but it has
>> a pointer to the TIDBitmap and could just pass tbm->dsa_data directly
>> instead of fishing the pointer out of the extra space.
>
> If you see the code of SH_GROW, first it needs to allocate the bigger
> chunk copy data from smaller chunk to the bigger chunk and then free
> the smaller chunk.

Oh, I see.

> So by the time it call the pagetable_free, it would have already
> called the pagetable_allocate for the newer bigger chunk of memory so
> now, tbm->dsa_data points to the latest memory, but pagetable_free
> wants to free older one.
>
> One solution to this could be that I keep two dsa_pointer in TBM, one
> point to old memory and another to new. (After this here we will get
> the same problem of relptr_access because now we will have first entry
> pointer is same as base pointer)

Yes, two dsa_pointers seems fine.  Maybe that's not totally beautiful,
but it seems better than what you had before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CONNECTION LIMIT and Parallel Query don't play well together

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 11:19 AM, Peter Eisentraut
 wrote:
> On 1/11/17 5:51 PM, David Rowley wrote:
>> Now, since background workers
>> don't consume anything from max_connections, then I don't really feel
>> that a background worker should count towards "CONNECTION LIMIT". I'd
>> assume any CONNECTION LIMITs that are set for a user would be
>> calculated based on what max_connections is set to. If we want to
>> limit background workers in the same manner, then perhaps we'd want to
>> invent something like "WORKER LIMIT N" in CREATE USER.
>
> This explanation makes sense, but it kind of upset my background
> sessions patch, which would previously have been limited by per-user
> connection settings.
>
> So I would like to have a background worker limit per user, as you
> allude to.  Attached is a patch that implements a GUC setting
> max_worker_processes_per_user.
>
> Besides the uses for background sessions, but it can also be useful for
> parallel workers, logical replication apply workers, or things like
> third-party partitioning extensions.
>
> Thoughts?

This isn't going to deliver consistent results if it's set differently
in different sessions, although maybe you could weasel around that by
wording the documentation in just the right way.  It seems OK
otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CONNECTION LIMIT and Parallel Query don't play well together

2017-02-15 Thread Peter Eisentraut
On 1/11/17 5:51 PM, David Rowley wrote:
> Now, since background workers
> don't consume anything from max_connections, then I don't really feel
> that a background worker should count towards "CONNECTION LIMIT". I'd
> assume any CONNECTION LIMITs that are set for a user would be
> calculated based on what max_connections is set to. If we want to
> limit background workers in the same manner, then perhaps we'd want to
> invent something like "WORKER LIMIT N" in CREATE USER.

This explanation makes sense, but it kind of upset my background
sessions patch, which would previously have been limited by per-user
connection settings.

So I would like to have a background worker limit per user, as you
allude to.  Attached is a patch that implements a GUC setting
max_worker_processes_per_user.

Besides the uses for background sessions, but it can also be useful for
parallel workers, logical replication apply workers, or things like
third-party partitioning extensions.

Thoughts?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From eca26d5858fa0750427b15f4439d8936daff4d8b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 14 Feb 2017 22:20:02 -0500
Subject: [PATCH] Add max_worker_processes_per_user setting

---
 doc/src/sgml/config.sgml  | 22 ++
 doc/src/sgml/ref/create_role.sgml |  3 ++-
 src/backend/postmaster/bgworker.c | 28 
 src/backend/utils/init/globals.c  |  1 +
 src/backend/utils/misc/guc.c  | 12 
 src/include/miscadmin.h   |  1 +
 6 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index dc63d7d5e4..ba74556444 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2013,6 +2013,28 @@ Asynchronous Behavior

   
 
+  
+   max_worker_processes_per_user (integer)
+   
+max_worker_processes_per_user configuration parameter
+   
+   
+   
+
+ Sets the maximum number of background processes allowed per user.  If
+ the setting is -1, then there is no limit.  That is also the default.
+
+
+
+ Only superusers can change this setting.
+ Unlike max_worker_processes, which controls the
+ overall instance limit, this setting can also be changed at run time
+ and can be set differently for different users by
+ using ALTER ROLE ... SET.
+
+   
+  
+
   
max_parallel_workers_per_gather (integer)

diff --git a/doc/src/sgml/ref/create_role.sgml b/doc/src/sgml/ref/create_role.sgml
index 2ae576ede6..4d0b8127f9 100644
--- a/doc/src/sgml/ref/create_role.sgml
+++ b/doc/src/sgml/ref/create_role.sgml
@@ -201,7 +201,8 @@ Parameters
 the role can make.  -1 (the default) means no limit. Note that only
 normal connections are counted towards this limit. Neither prepared
 transactions nor background worker connections are counted towards
-this limit.
+this limit (see 
+for the latter).

   
  
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index cd99b0b392..f1045ddcd2 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -77,6 +77,7 @@ typedef struct BackgroundWorkerSlot
 	bool		in_use;
 	bool		terminate;
 	pid_t		pid;			/* InvalidPid = not started yet; 0 = dead */
+	Oid			roleid;			/* user responsible for it */
 	uint64		generation;		/* incremented when slot is recycled */
 	BackgroundWorker worker;
 } BackgroundWorkerSlot;
@@ -905,6 +906,32 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
 	}
 
 	/*
+	 * Check number of used slots for user
+	 */
+	if (max_worker_processes_per_user >= 0)
+	{
+		int		count = 0;
+
+		for (slotno = 0; slotno < BackgroundWorkerData->total_slots; ++slotno)
+		{
+			BackgroundWorkerSlot *slot = >slot[slotno];
+
+			if (slot->in_use && slot->roleid == GetUserId())
+count++;
+		}
+
+		if (count > max_worker_processes_per_user)
+		{
+			ereport(LOG,
+	(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+	 errmsg("too many worker processes for role \"%s\"",
+			GetUserNameFromId(GetUserId(), false;
+			LWLockRelease(BackgroundWorkerLock);
+			return false;
+		}
+	}
+
+	/*
 	 * Look for an unused slot.  If we find one, grab it.
 	 */
 	for (slotno = 0; slotno < BackgroundWorkerData->total_slots; ++slotno)
@@ -915,6 +942,7 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
 		{
 			memcpy(>worker, worker, sizeof(BackgroundWorker));
 			slot->pid = InvalidPid;		/* indicates not started yet */
+			slot->roleid = GetUserId();
 			slot->generation++;
 			slot->terminate = false;
 			generation = slot->generation;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 08b6030a64..f92c8c196f 

Re: [HACKERS] Partitioned tables and relfilenode

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 2:14 AM, Michael Paquier
 wrote:
> This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
> to avoid interactions with partition tables. Did you consider doing
> something in the executor instead?

That seems inferior, because the planner really ought to know that the
partitioning root doesn't need to be scanned.  That way, it can do a
better job getting the cost and selectivity estimates right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Bug fix] PQsendQuery occurs error when target_session_attrs is set to read-write

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 1:31 AM, Michael Paquier
 wrote:
> On Wed, Feb 15, 2017 at 11:08 AM, Robert Haas  wrote:
>> I think the patch as presented probably isn't quite what we want,
>> because it waits synchronously for the second result to be ready.
>> Note that the wait for the first result is asynchronous: we check
>> PQisBusy and return without progressing the state machine until the
>> latter returns false; only then do we call PQgetResult().
>
> OK, I have been poking at this problem. I think that we need to
> introduce a new state in ConnStatusType telling "please consume all
> results until PQgetResult returns NULL" which is CONNECTION_CONSUME in
> the patch attached. And as long as all the results are not consumed,
> the loop just keeps going. If more messages are being waited for with
> PGisBusy returning true, then the loop requests for more data to read
> by switching back to PGRES_POLLING_READING.
>
> By the way, I am noticing as well that libpq.sgml is missing a
> reference to CONNECTION_CHECK_WRITABLE. This should be present as
> applications calling PQconnectPoll() could face such a status. I have
> fixed this issue as well in the patch attached.

Great, thanks!  This looks good to me, so committed.  Is there any
chance you can use your amazing TAP-test-creation powers to do some
automated testing of this feature?  For example, maybe we could at
least set up a master and a standby and somehow test that asking for a
read-only server picks the standby if it's first but asking for a
read-write server tries the standby and then switches to the master?
It would also be nice to test that probing a server that doesn't exist
fails, but I'm not sure how to create an IP/port combination that's
guaranteed to not work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add doc advice about systemd RemoveIPC

2017-02-15 Thread Peter Eisentraut
On 2/14/17 11:49 AM, Magnus Hagander wrote:
> +   
> +If systemd is in use, some care must be
> taken
> +that IPC resources (shared memory and semaphores) are not prematurely
> +removed by the operating system.  This is especially of concern when
> +installing PostgreSQL from source.  Users of distribution packages of
> +PostgreSQL are less likely to be affected.
> +   
> 
> I would add "are less likely to be affected as the postgres user is
> normally created as a system user" or something like that -- to indicate
> *why* they are less likely to be affected (and it also tells people that
> if they change the user, then they might become affected again).

Committed with that addition, thanks!

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup -R

2017-02-15 Thread Robert Haas
On Tue, Feb 14, 2017 at 11:36 PM, Michael Paquier
 wrote:
> On Wed, Feb 15, 2017 at 2:38 AM, Robert Haas  wrote:
>> On Thu, Feb 9, 2017 at 9:40 PM, Michael Paquier
>>  wrote:
>>> In short, I'd like to think that we should just filter out those two
>>> parameters by name and call it a day. Or introduce an idea of value
>>> set for the environment by adding some kind of tracking flag in
>>> PQconninfoOption? Though I am not sure that it is worth complicating
>>> libpq to just generate recovery.conf in pg_basebackup.
>>
>> Yeah, I'm not sure what the best solution is.  I just thought it was strange.
>
> Thinking more about this, perhaps the correct answer is to do nothing?
> target_session_attrs being set is rather similar to sslmode or
> sslcompression for example. They are here but don't hurt. The same
> thing applies to passfile: if the file is not here the client would
> still ask for input. If it is here, things are helped a bit.

Let's wait and see if anybody else has an opinion.  I imagine that, as
further libpq parameters are added, eventually this is going to get
long and annoying enough that we want to do something about it.  But
we might not have reached that point just yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add checklist item for psql completion to commitfest review

2017-02-15 Thread Robert Haas
On Tue, Feb 14, 2017 at 1:13 PM, Jim Nasby  wrote:
> After seeing Yet Another Missing Psql Tab Completion it occurred to me...
> why not add a checklist item to the commitfest review page? I realize most
> regular contributors don't use the form, but a fair number of people do. I
> don't see how it could hurt.
>
> Another possible idea is a git hook that checks to see if the psql
> completion code has been touched if any of the grammar has been. That could
> certainly trigger false positives so it'd need to be easy to over-ride, but
> AFAIK that could be done via a special phrase in the commit message.

In the past, our usual policy has been that tab completion isn't a
hard requirement for a patch implementing a new feature.  It often
gets added after the fact.  I think that policy has worked fine, but
it's not a bad thing for people to include tab completion in the
original patch either, if they have the brain space for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 9:10 AM, Simon Riggs  wrote:
>>> * "ERROR:  cannot create index on partitioned table 
>>> "measurement_year_month""
>>> is misleading because you can create indexes on partitions
>>
>> Do you mean that this should not cause an error at all, but create the
>> specified index on partitions as part of running the command?  Should the
>> code to handle that be part of this release?
>
> Sounds fairly basic to me. If you don't support this, then presumably
> every ORM, pgAdmin etc will all be broken.

I don't see why that should be the case.

> And 1000 people will need to write a script that does what we could do
> easily in a loop internally.

Now that is probably true.

> At present you haven't even documented how you'd do this.

It's not that hard to figure it out, though.  A HINT wouldn't be a bad
idea, certainly.

There are some really thorny problems with making index creation
cascade to all of the partitions.  I think it's worth doing, but
there's a lot of stuff to think about before you go and start writing
code.  Most obviously, if you can use a single CREATE INDEX statement
to create indexes on all of the partitions, then probably you ought to
also be able to use DROP INDEX to get rid of all of those indexes.  In
other words, it should probably work a lot like what already happens
with constraints: constraints cascade from the parent down to the
children, but we still know which child object goes with which parent
object, so if the parent object is dropped we can get rid of all of
the children.  I think we need something similar here, although if we
restrict it to the partitioning case and don't make it work with table
inheritance then it can be simpler since table partitioning doesn't
allow multiple inheritance.  Presumably we'd want other index commands
like REINDEX to cascade similarly.

Also, it's not entirely clear what the semantics should be.  If the
partitioning key is (a) and you ask for an index on (a, b), you could
conceivably omit a from the indexes created on partitions that only
cover a single value of a.  (That case is easy to detect when list
partitioning is in use.)  Should we try do that elimination, or just
do what the user asked for?  Will users be unhappy if we try to do
this sort of column elimination but it only works in simple cases?
Think about the possibility that there are partitioning expressions
rather than partitioning columns before concluding we can make it work
in all cases.  On the other hand, if you ask for a UNIQUE index on
(b), should we go ahead and create such an index on each partition,
ensuring uniqueness within each partition, or should we refuse to
proceed on the grounds that we can't be sure that such an index will
ensure global uniqueness?  If you do the former, someone might find
the behavior surprising, but if you do the latter, you might annoy
people who know what they're asking for and want that thing but can't
get it.  I suspect we want to eventually allow a user to ask for
either one, because eventually we'll probably have global indexes, and
then you really need a way to say whether you want a global index or a
partitioned non-global index.  But that requires agreeing on syntax,
which is complicated and will probably involve a lot of bikeshedding
(as well it should - these are big decisions).

I think it would be a bad idea to try to fix this problem for v10.
One of the earlier versions of the patch allowed indexes on the parent
table as if it were just a regular empty table, which did not seem
useful. I asked him to disallow that so as to keep our options open
for the future.  I see no reason why v11 or v12 can't fill in the
functionality in this area.  Right now we're 2 weeks away from the
start of the last CommitFest, and that's not the time to go start
writing a complex patch for a feature that isn't even particularly
well-defined.  If somebody really cared about this
make-an-index-for-everything-in-the-hierarchy problem, they could've
written a patch for that at any time in the last 5 years; it's not
strictly dependent on the new partitioning stuff.  Nobody's done that,
and trying to throw together something now in the last couple of weeks
could easily end with us getting it wrong and then having to face the
unpleasant prospect of either leaving it broken or breaking backward
compatibility to fix it.

> It leaves me asking what else is missing.

There is certainly a lot of room for improvement here but I don't
understand your persistent negativity about what's been done thus far.
I think it's pretty clearly a huge step forward, and I think Amit
deserves a ton of credit for making it happen.  The improvements in
bulk loading performance alone are stupendous.  You apparently have
the idea that somebody could have written an even larger patch that
solved even more problems at once, but this was already a really big
patch, and IMHO quite a good one.

-- 
Robert Haas
EnterpriseDB: 

  1   2   >