Re: [HACKERS] Extracting fields from 'infinity'::TIMESTAMP[TZ]

2015-11-18 Thread Torsten Zühlsdorff

On 17.11.2015 09:09, Vitaly Burovoy wrote:


I suppose behavior of monotonic values (julian, century, decade,
isoyear, millennium and year) should be the same as for epoch (which
obviously also monotonic value).
Proposed patch has that behavior: +/-infinity for epoch, julian,
century, decade, isoyear, millennium and year; NULL for other fields.


This works for me.

Greetings,
Torsten


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-18 Thread Catalin Iacob
On Tue, Nov 17, 2015 at 10:13 PM, Tom Lane  wrote:
> 1. -c no longer implies --no-psqlrc.  That's a backwards incompatibility,
> but very easy to explain and very easy to work around.
>
> 2. You can have multiple -c and/or -f.  Each -c is processed in
> the traditional way, ie, either it's a single backslash command
> or it's sent in a single PQexec.  That doesn't seem to me to have
> much impact on the behavior of adjacent -c or -f.
>
> 3. If you combine -1 with -c and/or -f, you get one BEGIN inserted
> at the beginning and one COMMIT at the end.  Nothing else changes.

And -v AUTOCOMMIT=off should do the same as now for -c: issue a BEGIN
before each single PQexec with the content of each -c.

I like it, it avoids what I didn't like about -C semantics since the
grouping now means something (single PQexec per group) and you even
see the effects of the grouping in the result (only last result of
group is returned). If you don't like that grouping (probably most
people won't) the solution is intuitive: split the group to multiple
-c.

Another incompatibility is that -1 is now silently ignored by -c so
for example psql -1 -c VACUUM now works and it won't work with the new
semantics. But this seems like a good thing because it reflects that
VACUUM doesn't work in a transaction so I don't think it should stop
this proposal from going ahead.

I'll try to write the documentation patch for these semantics sometime
next week.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2015-11-18 Thread Robert Haas
On Mon, Nov 16, 2015 at 9:03 PM, Kouhei Kaigai  wrote:
> I guess we will put a pointer to static ExtensibleNodeMethods structure
> on ForeignScan, CustomScan, CustomScanState and etc...

I think that makes it confusing.  What I'd prefer to do is have only
the name stored in the node, and then people who need methods can look
up those methods based on the name.

> Let me write down the steps for deserialization. Please correct me if it is
> not what you are thinking.
> First, stringToNode picks up a token that shall have node label, like
> "SEQSCAN", "CUSTOMSCAN" and others. Once we can identify the following
> tokens belong to CustomScan node, it can make advance the pg_strtok pointer
> until "methods" field. The method field is consists of a pair of library_name
> and symbol_name, then it tries to load the library and resolve the symbol.

No.  It reads the "extnodename" field (or whatever we decide to call
it) and calls GetExtensibleNodeMethods() on that name.  That name
returns the appropriate structure.  C libraries that get loaded into
the server can register their extensible node types at _PG_init()
time.

> Even if library was not loaded on the worker side, this step allows to ensure
> the library gets loaded, thus, PG_init() can RegisterExtensibleNodeMethods.

A parallel worker will load the same loadable modules which the master
has got loaded.  If you use some other kind of background worker, of
course, you're on your own.

> I have two concerns on the above proposition.
> 1) 'node_size' field implies user defined structure to have fixed length.
>Extension may want to use variable length structure. The example below
>shows GpuJoinState has multiple innerState structure according to the
>number of inner relations joined at once.
>https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L240

OK, we can replace the node_size field with an allocator callback:
Node (*nodeAlloc)(void) or something like that.

> 2) It may be valuable if 'nodeRead(void)' can take an argument of the
>knows/common portion of ForeignScan and etc... It allows extension
>to adjust number of private fields according to the known part.
>For example, I can expect flexible length private fields according
>to the length of custom_subplans list.

I don't understand this bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-11-18 Thread Robert Haas
On Wed, Nov 18, 2015 at 12:48 AM, Amit Kapila  wrote:
>> I suggest that we instead fix ExecParallelFinish() to be idempotent.
>> Add a "bool finished" flag to ParallelExecutorInfo and return at once
>> if it's already set. Get rid of the exposed
>> ExecParallelReinitializeTupleQueues() interface and have
>> ExecParallelReinitialize(pei) instead.  Have that call
>> ReinitializeParallelDSM(), ExecParallelSetupTupleQueues(pei->pcxt,
>> true), and set pei->finished = false.  I think that would give us a
>> slightly cleaner separation of concerns between nodeGather.c and
>> execParallel.c.
>
> Okay, attached patch fixes the issue as per above suggestion.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 4:59 PM, Thom Brown  wrote:
> On 17 November 2015 at 20:08, Robert Haas  wrote:
>> On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown  wrote:
>>
>>> However, the first parallel seq scan shows it getting 170314 rows.
>>> Another run shows it getting 194165 rows.  The final result is
>>> correct, but as you can see from the rows on the Append node (59094295
>>> rows), it doesn't match the number of rows on the Gather node
>>> (3000).
>>
>> Is this the same issue reported in
>> http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com
>> and not yet fixed?  I am inclined to think it probably is.
>
> Yes, that seems to be the same issue.

I've committed a fix for that issue now, so you shouldn't see it any
more if you retest this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in numeric multiplication

2015-11-18 Thread Tom Lane
Dean Rasheed  writes:
> On 17 November 2015 at 23:57, Tom Lane  wrote:
>> I'm not sure that it's provably okay though.  The loose ends are to show
>> that div[qi] can't overflow an int during the divisor-subtraction step
>> and that "outercarry" remains bounded.   Testing suggests that outercarry
>> can't exceed INT_MAX/NBASE, but I don't see how to prove that.

> At the top of the loop we know that
>   Abs(div[qi]) <= maxdiv * (NBASE-1)
><= INT_MAX - INT_MAX/NBASE - 1
> Then we add Abs(qdigit) to maxdiv which potentially triggers a
> normalisation step. That step adds carry to div[qi], and we know that
>   Abs(carry) <= INT_MAX/NBASE + 1
> so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't
> overflow at that point, but it could be on the verge of doing so.

Right.

> Then the divisor-subtraction step does
>   div[qi] -= qdigit * var2digits[0]
> That looks as though it could overflow, although actually I would
> expect qdigit to have the same sign as div[qi], so that this would
> drive div[qi] towards zero.

But if outercarry is nonzero, then qdigit is going to be primarily driven
by outercarry not div[qi], so I'm afraid it's mistaken to rely on it
having the right sign to cause cancellation.

> If you didn't want to rely on that though,
> you could take advantage of the fact that this new value of div[qi] is
> only needed for outercarry, so you could modify the
> divisor-subtraction step to skip div[qi]:
> and fold the most significant digit of the divisor-subtraction step
> into the computation of outercarry

Cute idea.  Another thought I'd had is that we could change the carry
propagation loop limits so that div[qi] is normalized along with the other
digits.  Then we'd have a carry leftover at the end of the loop, but we
could add it to outercarry.  That makes the argument that div[qi] does
not overflow the same as for the other digits.

However, I kind of like your idea of postponing the div[qi] subtraction
to the outercarry update, because then it's very direct to see that
that update should result in near-total cancellation.

> so outercarry ought to be driven towards zero, as the comment says. It
> certainly doesn't look like it will grow without bounds, but I haven't
> attempted to work out what its bounds actually are.

I'm kind of stuck on that too.  I did some experimentation by tracking
maximum values of outercarry in the regression tests (including
numeric_big) and did not see it get larger than a couple hundred thousand,
ie more or less INT_MAX/NBASE.  But I don't have an argument as to why
that would be the limit.

Another issue here is that with outercarry added into the qdigit
computation, it's no longer clear what the bounds of qdigit itself are,
so is it really safe to assume that "div[qi + i] -= qdigit * var2digits[i]"
can't overflow?  Stated in other terms, why are we sure that the new
maxdiv value computed at the bottom of the carry propagation stanza is
within the safe range?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-18 Thread Alvaro Herrera
Hi, I just started looking this over a bit.  The first thing I noticed
is that it adds a dependency on Archive::Tar which isn't already used
anywhere else.  Did anybody check whether this exists back in 5.8
installations?

Why is "recovery" added to ALWAYS_SUBDIRS in src/test/Makefile instead
of to SUBDIRS?  Seems a strange choice.

Instead of adding 
print "# Error output: $stderr\n" if $stderr ne "";
to sub psql, I think it would be better to add line separators, which
would be clearer if the error output ever turns into a multiline error
messages.  It would still show as empty if no stderr is produced; so I
think something like
if ($stderr ne '')
{
print " Begin standard error\n"
print $stderr;
print " End standard error\n";
}
or something like that.

In my days of Perl, it was starting to become frowned upon to call
subroutines without parenthesizing arguments.  Is that no longer the
case?  Because I notice there are many places in this patch and pre-
existing that call psql with an argument list without parens.  And it's
a bit odd because I couldn't find any other subroutine that we're using
in that way.

In 005_replay_delay there's a 2s delay configured; then we test whether
something is replayed in 1s.  I hate tests that run for a long time, but
is 2s good enough considering that some of our test animals in buildfarm
are really slow?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-18 Thread Erik Rijkers

On 2015-11-18 16:21, Alvaro Herrera wrote:

Hi, I just started looking this over a bit.  The first thing I noticed
is that it adds a dependency on Archive::Tar which isn't already used
anywhere else.  Did anybody check whether this exists back in 5.8
installations?


Apparently it did not yet exist in core then, Module::CoreList says 
5.9.3:


$ perl -MModule::CoreList -e ' print 
Module::CoreList->first_release('Archive::Tar'), "\n";'

5.009003





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-18 Thread Robert Haas
On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila  wrote:
> Don't we need the startup cost incase we need to build partial paths for
> joinpaths like mergepath?
> Also, I think there are other cases for single relation scan where startup
> cost can matter like when there are psuedoconstants in qualification
> (refer cost_qual_eval_walker()) or let us say if someone has disabled
> seq scan (disable_cost is considered as startup cost.)

I'm not saying that we don't need to compute it.  I'm saying we don't
need to take it into consideration when deciding which paths have
merit.   Note that consider_statup is set this way:

rel->consider_startup = (root->tuple_fraction > 0);

And for a joinrel:

joinrel->consider_startup = (root->tuple_fraction > 0);

root->tuple_fraction is 0 when we expect all tuples to be retrieved,
and parallel query can currently only be used when all tuples will be
retrieved.

> B. I think partial path is an important concept and desrves some
> explanation in src/backend/optimizer/README.
> There is already a good explanation about Paths, so I think it
> seems that it is better to add explanation about partial paths.

Good idea.  In the attached, revised version of the patch, I've added
a large new section to that README.

> 2.
> + *  costs: parallelism is only used for plans that will be run to
> completion.
> + *Therefore, this
> routine is much simpler than add_path: it needs to
> + *consider only pathkeys and total cost.
>
> There seems to be some spacing issue in last two lines.

Fixed.

> A.
> This means that for inheritance child relations for which rel pages are
> less than parallel_threshold, it will always consider the cost shared
> between 1 worker and leader as per below calc in cost_seqscan:
> if (path->parallel_degree > 0)
> run_cost = run_cost / (path->parallel_degree + 0.5);
>
> I think this might not be the appropriate cost model for even for
> non-inheritence relations which has pages more than parallel_threshold,
> but it seems to be even worst for inheritance children which have
> pages less than parallel_threshold

Why?  I'm certainly open to patches to improve the cost model, but I
don't see why this patch needs to do that.

> B.
> Will it be possible that if none of the inheritence child rels (or very few
> of them) are big enough for parallel scan, then considering Append
> node for parallelism of any use or in other words, won't it be better
> to generate plan as it is done now without this patch for such cases
> considering current execution model of Gather node?

I think we should instead extend the Append node as suggested by
KaiGai, so that it can have both partial and non-partial children.  I
think we can leave that to another patch, though.  Aside from
questions of what the fastest plan are, it's very bad to have Gather
nodes under the Append, because the Append could have many children
and we could end up with a ton of Gather nodes, each using a DSM and a
bunch of workers.  That's important to avoid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 012c14b..fe07176 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1588,6 +1588,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	else
 		_outBitmapset(str, NULL);
 	WRITE_BOOL_FIELD(parallel_aware);
+	WRITE_INT_FIELD(parallel_degree);
 	WRITE_FLOAT_FIELD(rows, "%.0f");
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
@@ -1764,7 +1765,6 @@ _outGatherPath(StringInfo str, const GatherPath *node)
 	_outPathInfo(str, (const Path *) node);
 
 	WRITE_NODE_FIELD(subpath);
-	WRITE_INT_FIELD(num_workers);
 	WRITE_BOOL_FIELD(single_copy);
 }
 
@@ -1887,6 +1887,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
+	WRITE_NODE_FIELD(partial_pathlist);
 	WRITE_NODE_FIELD(cheapest_startup_path);
 	WRITE_NODE_FIELD(cheapest_total_path);
 	WRITE_NODE_FIELD(cheapest_unique_path);
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 916a518..5019804 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -851,4 +851,57 @@ lateral reference.  (Perhaps now that that stuff works, we could relax the
 pullup restriction?)
 
 
--- bjm & tgl
+Parallel Query and Partial Paths
+
+
+Parallel query involves dividing up the work that needs to be performed
+either by an entire query or some portion of the query in such a way that
+some of that work can be done by one or more worker processes, which are
+called parallel workers.  Parallel workers are a subtype of dynamic
+background workers; see src/backend/access/transam/README.parallel for a
+fuller description.  Academic literature on parallel query suggests that
+that parallel 

Re: [HACKERS] Using quicksort for every external sort run

2015-11-18 Thread Jeff Janes
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan  wrote:

Hi Peter,

Your most recent versions of this patch series (not the ones on the
email I am replying to) give a compiler warning:

tuplesort.c: In function 'mergeruns':
tuplesort.c:2741: warning: unused variable 'memNowUsed'


> Multi-pass sorts
> -
>
> I believe, in general, that we should consider a multi-pass sort to be
> a kind of inherently suspect thing these days, in the same way that
> checkpoints occurring 5 seconds apart are: not actually abnormal, but
> something that we should regard suspiciously. Can you really not
> afford enough work_mem to only do one pass?

I don't think it is really about the cost of RAM.  What people can't
afford is spending all of their time personally supervising all the
sorts on the system. It is pretty easy for a transient excursion in
workload to make a server swap itself to death and fall over. Not just
the PostgreSQL server, but the entire OS. Since we can't let that
happen, we have to be defensive about work_mem. Yes, we have far more
RAM than we used to. We also have far more things demanding access to
it at the same time.

I agree we don't want to optimize for low memory, but I don't think we
should throw it under the bus, either.  Right now we are effectively
saying the CPU-cache problems with the heap start exceeding the larger
run size benefits at 64kb (the smallest allowed setting for work_mem).
While any number we pick is going to be a guess that won't apply to
all hardware, surely we can come up with a guess better than 64kb.
Like, 8 MB, say.  If available memory for the sort is 8MB or smaller
and the predicted size anticipates a multipass merge, then we can use
the heap method rather than the quicksort method.  Would a rule like
that complicate things much?

It doesn't matter to me personally at the moment, because the smallest
work_mem I run on a production system is 24MB.  But if for some reason
I had to increase max_connections, or had to worry about plans with
many more possible concurrent work_mem allocations (like some
partitioning), then I might not need to rethink that setting downward.

>
> In theory, the answer could be "yes", but it seems highly unlikely.
> Not only is very little memory required to avoid a multi-pass merge
> step, but as described above the amount required grows very slowly
> relative to linear growth in input. I propose to add a
> checkpoint_warning style warning (with a checkpoint_warning style GUC
> to control it).

I'm skeptical about a warning for this.  I think it is rather unlike
checkpointing, because checkpointing is done in a background process,
which greatly limits its visibility, while sorting is a foreground
thing.  I know if my sorts are slow, without having to go look in the
log file.  If we do have the warning, shouldn't it use a log-level
that gets sent to the front end where the person running the sort can
see it and locally change work_mem?  And if we have a GUC, I think it
should be a dial, not a binary.  If I have a sort that takes a 2-way
merge and then a final 29-way merge, I don't think that that is worth
reporting.  So maybe, if the maximum number of runs on a tape exceeds
2 (rather than exceeds 1, which is the current behavior with the
patch) would be the setting I would want to use, if I were to use it
at all.

...

> This patch continues to have tuplesort determine run size based on the
> availability of work_mem only. It does not entirely fix the problem of
> having work_mem sizing impact performance in counter-intuitive ways.
> In other words, smaller work_mem sizes can still be faster. It does
> make that general situation much better, though, because quicksort is
> a cache oblivious algorithm. Smaller work_mem sizes are sometimes a
> bit faster, but never dramatically faster.

Yes, that is what I found as well.   I think the main reason it is
even that small bit slower at large memory is because writing and
sorting are not finely interleaved, like they are with heap selection.
Once you sit down to qsort 3GB of data, you are not going to write any
more tuples until that qsort is entirely done.  I didn't do any
testing beyond 3GB of maintenance_work_mem, but I imagine this could
get more important if people used dozens or hundreds of GB.

One idea would be to stop and write out a just-sorted partition
whenever that partition is contiguous to the already-written portion.
If the qsort is tweaked to recurse preferentially into the left
partition first, this would result in tuples being written out at a
pretty study pace.  If the qsort was unbalanced and the left partition
was always the larger of the two, then that approach would have to be
abandoned at some point.  But I think there are already defenses
against that, and at worst you would give up and revert to the
sort-them-all then write-them-all behavior.



Overall this is very nice.  Doing some real world index builds of
short text (~20 bytes ascii) 

Re: [HACKERS] LLVM miscompiles numeric.c access to short numeric var headers

2015-11-18 Thread Greg Stark
Fwiw it looks like the LLVM folk think this is an asan bug.

https://llvm.org/bugs/show_bug.cgi?id=25550

It looks like the fix is that the compiler should avoid this
optimization if the code is being compiled with instrumentation. This
worries me a bit but I think our code is safe as the Datum will always
be either palloced which will be fully aligned and therefore can't
overrun a page or on a stack in which case the whole struct will be
allocated regardless of how many digits we need.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] SQL function to report log message

2015-11-18 Thread Robert Haas
On Mon, Nov 16, 2015 at 5:41 PM, Peter Eisentraut  wrote:
> On 11/16/15 5:10 PM, Tom Lane wrote:
>> Peter Eisentraut  writes:
>>> On 11/15/15 9:50 PM, Craig Ringer wrote:
 I'd prefer to omit fields if explicitly assigned to NULL. You can always
 use coalesce if you want the string 'NULL'; I agree with others here
 that the vast majority of users will want the field just omitted.
>>
>>> I think the problem was that you can't omit the primary message.
>>
>> If you ask me, that would be a feature not a bug.
>
> Right.
>
> Frankly, I have lost track of what the problem here is.

I am still of the opinion that this patch is a solution in search of a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin

2015-11-18 Thread Tomasz Rybak
W dniu 12.11.2015, czw o godzinie 22∶23 +0800, użytkownik Craig Ringer
napisał:
> Hi all
> 
> Here's an updated pglogical_output patch.
> 
> Selected changes since v1:
> 
>     - add json protocol output support
>     - fix encoding comparisons to use parsed encoding not string name
>     - import protocol documentation
>     - significantly expand pg_regress tests
>     - move pglogical_output_plhooks to a top-level contrib
>     - remove 9.4 compatibility

I've just skimmed through the patch.
As you removed 9.4 compatibility - are mentions of 9.4 and 9.5 
compatibility needed in README.md and README.plhooks?
It's not much text, but I'm not sure whether they shouldn't be
removed for 9.6-targeting change.

-- 
Tomasz Rybak  GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak




signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-18 Thread Jeff Janes
On Tue, Nov 17, 2015 at 10:32 PM, Masahiko Sawada  wrote:
>
> Attached latest v24 patch.
> I've changed patch so that just adding frozen bit into visibility map.
> So the size of patch is almost half of previous one.
>

Should there be an Assert in visibilitymap_get_status (or elsewhere)
against the impossible state of being all frozen but not all visible?

I get an error when running pg_upgrade from 9.4 to 9.6-this

error while copying relation "mediawiki.archive"
("/tmp/data/base/16414/21043_vm" to
"/tmp/data_fm/base/16400/21043_vm"): No such file or directory


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-18 Thread Pavel Stehule
> 1. -c no longer implies --no-psqlrc.  That's a backwards incompatibility,
> but very easy to explain and very easy to work around.
>
>
This can be very surprising change. Can we disable it temporary by some
environment variable? like NOPSQLRC ?



> 2. You can have multiple -c and/or -f.  Each -c is processed in
> the traditional way, ie, either it's a single backslash command
> or it's sent in a single PQexec.  That doesn't seem to me to have
> much impact on the behavior of adjacent -c or -f.
>
> 3. If you combine -1 with -c and/or -f, you get one BEGIN inserted
> at the beginning and one COMMIT at the end.  Nothing else changes.
>
> As long as you put only one SQL command per -c, I don't think that
> this definition has any real surprises.  And we can discourage
> people from putting more than one, saying that that will invoke
> legacy behaviors you probably don't want.
>

+1

Pavel


>
> regards, tom lane
>


Re: [HACKERS] proposal: multiple psql option -c

2015-11-18 Thread Tom Lane
Pavel Stehule  writes:
>> 1. -c no longer implies --no-psqlrc.  That's a backwards incompatibility,
>> but very easy to explain and very easy to work around.

> This can be very surprising change. Can we disable it temporary by some
> environment variable? like NOPSQLRC ?

Why would that be better than "add -X to the psql call"?

I think generally the idea here is to have fewer warts, not more.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-18 Thread Pavel Stehule
2015-11-18 21:17 GMT+01:00 Tom Lane :

> Pavel Stehule  writes:
> >> 1. -c no longer implies --no-psqlrc.  That's a backwards
> incompatibility,
> >> but very easy to explain and very easy to work around.
>
> > This can be very surprising change. Can we disable it temporary by some
> > environment variable? like NOPSQLRC ?
>
> Why would that be better than "add -X to the psql call"?
>
> I think generally the idea here is to have fewer warts, not more.
>

I can set environment variable once and all scripts will running without
psqlrc, without searching all scripts and editing. If we are breaking
compatibility, then we can be little bit friendly to admins, users, ...

It isn't necessary, but it can decrease a possible impacts

Pavel


>
> regards, tom lane
>


Re: [HACKERS] Minor comment improvement to create_foreignscan_plan

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 9:29 PM, Etsuro Fujita
 wrote:
> On 2015/11/18 2:57, Robert Haas wrote:
>> On Sun, Nov 15, 2015 at 9:25 PM, Etsuro Fujita
>>  wrote:
>>>
>>> Oops, I've found another one.  I think we should update a comment in
>>> postgresGetForeignPlan, too; add remote filtering expressions to the list
>>> of
>>> information needed to create a ForeignScan node.
>
>> Instead of saying "remote/local", how about saying "remote and local"
>> or just leaving it out altogether as in the attached?
>
> +1 for your patch.

OK, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-11-18 Thread Jeff Janes
On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes  wrote:
>
> I get an error when running pg_upgrade from 9.4 to 9.6-this
>
> error while copying relation "mediawiki.archive"
> ("/tmp/data/base/16414/21043_vm" to
> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory

OK, so the problem seems to be that rewriteVisibilitymap can get
called with errno already set to a nonzero value.

It never clears it, and then fails at the end despite that no error
has actually occurred.

Just setting it to 0 at the top of the function seems to be correct
thing to do.  Or does it need to save the old value and restore it?

But now when I want to do the upgrade faster, I run into this:

"This utility cannot upgrade from PostgreSQL version from 9.5 or
before to 9.6 or later with link mode."

Is this really an acceptable a tradeoff?  Surely we can arrange to
link everything else and rewrite just the _vm, which is a tiny portion
of the data directory.  I don't think that -k promises to link
everything it possibly can.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 3:20 PM, Peter Eisentraut  wrote:
> On 11/6/15 3:59 PM, Robert Haas wrote:
>> So, I really wonder why we're not happy with the ability to substitute
>> out just the host and IP.
>
> One of my concerns is that the proposed URLs are not valid URLs anymore
> that can be parsed or composed with a URL library, which would be sad.

I agree that's sad.  But compatibility with JDBC is happy, which is
something to think about, at least.

> The other issue is that I think it's taking the implementation down the
> wrong path.

Well, I don't object to the idea of having some kind of a function
that can take a set of connection strings or URLs and try them all.
But then it's not really a connection string any more.  I mean, with
the proposal where we allow the host option to be multiply specified,
you can just plug this feature into any utility whatever that
understands connection strings, and regardless of whether you are
using the classic connection string format or the URL format, it will
work.  And it's a pretty simple extension of what we support already.

psql 'host=foo host=bar failovertime=1'

If we do what you're proposing, then I'm really not clear on how this
works.  Certainly, there can be new libpq functions that take a list
of connection strings or URLs as an argument... but what would this
look like if you wanted to invoke psql this way?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2015-11-18 Thread Robert Haas
On Wed, Nov 18, 2015 at 3:17 PM, Tom Lane  wrote:
> Pavel Stehule  writes:
>>> 1. -c no longer implies --no-psqlrc.  That's a backwards incompatibility,
>>> but very easy to explain and very easy to work around.
>
>> This can be very surprising change. Can we disable it temporary by some
>> environment variable? like NOPSQLRC ?
>
> Why would that be better than "add -X to the psql call"?
>
> I think generally the idea here is to have fewer warts, not more.

Amen to that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes && superuser review

2015-11-18 Thread Michael Paquier
On Wed, Nov 18, 2015 at 10:06 PM, Michael Paquier  wrote:

>
>
> On Wed, Sep 30, 2015 at 8:11 PM, Stephen Frost  wrote:
> > * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> >> I agree with Robert's earlier point that this needs to be split into
> >> multiple patches, which can then be reviewed and discussed
> >> separately. Pending that, I'm going to mark this as "Waiting on
> >> author" in the commitfest.
> >
> > Attached is an initial split which divides up reserving the role names
> > from actually adding the initial set of default roles.
>
> I have begun looking at this patch, and here are some comments after
> screening it.
>
> Patch needs a rebase, some catalog OIDs and there was a conflict in misc.c
> (see attached for the rebase. none of the comments mentioning issues are
> fixed by it).
>
> =# grant pg_replay to pg_backup  ;
> GRANT ROLE
> =# \du pg_backup
>  List of roles
>  Role name |  Attributes  |  Member of
> ---+--+-
>  pg_backup | Cannot login | {pg_replay}
> Perhaps we should restrict granting a default role to another default role?
>
> +  
> +   Also note that changing the permissions on objects in the system
> +   catalogs, while possible, is unlikely to have the desired effect as
> +   the internal lookup functions use a cache and do not check the
> +   permissions nor policies of tables in the system catalog.  Further,
> +   permission changes to objects in the system catalogs are not
> +   preserved by pg_dump or across upgrades.
> +  
> This bit could be perhaps applied as a separate patch.
>
> +  
> +   pg_backup
> +   Start and stop backups, switch xlogs, and create restore
> points.
> +  
> s/switch xlogs/switch to a new transaction log file/
> It seems weird to not have a dedicated role for pg_switch_xlog.
>
> In func.sgml, the descriptions of pg_switch_xlog, pg_xlog_replay_pause and
> pg_xlog_replay_resume still mention that those functions are restricted
> only to superusers. The documentation needs an update. It would be good to
> double-check if there are similar mistakes for the other functions.
>
> +   pg_montior
> Typo, pg_monitor.
>
> +   Pause and resume xlog replay on replicas.
> Those descriptions should be complete sentences or not? Both styles are
> mixed in user-manag.sgml.
>
> +  
> +   pg_replication
> +   Create, destroy, and work with replication slots.
> +  
> pg_replication_slots would be more adapted?
>
> +  
> +   pg_file_settings
> +   Allowed to view config settings from all config
> files
> +  
> I would say "configuration files" instead. An abbreviation is not adapted.
>
> +   pg_admin
> +   
> +Granted pg_backup, pg_monitor, pg_reply, pg_replication,
> +pg_rotate_logfile, pg_signal_backend and pg_file_settings roles.
> +   
> Typo, s/pg_reply/pg_replay and I think that there should be 
> markups around those role names. I am actually not convinced that we
> actually need it.
>
> +   if (IsReservedName(stmt->role))
> +   ereport(ERROR,
> +   (errcode(ERRCODE_RESERVED_NAME),
> +errmsg("role name \"%s\" is reserved",
> +stmt->role),
> +errdetail("Role names starting with
> \"pg_\" are reserved.")));
> Perhaps this could be a separate change? It seems that restricting roles
> with pg_ on the cluster is not a bad restriction in any case. You may want
> to add regression tests to trigger those errors as well.
>
> Shouldn't pg_current_xlog_location, pg_current_xlog_insert_location,
> pg_last_xlog_receive_location and pg_last_xlog_replay_location fall under a
> restriction category like pg_monitor? I recall of course that we discussed
> that some months ago and that a lot of people were reluctant to harden this
> area to not break any existing monitoring tool, still this patch may be the
> occasion to restrict a bit those functions, particularly on servers where
> wal_compression is enabled.
>
> It would be nice to have regression tests as well to check that role
> restrictions are applied where they should.
>

Bonus:
-static void
-check_permissions(void)
-{
-   if (!superuser() && !has_rolreplication(GetUserId()))
-   ereport(ERROR,
-   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-(errmsg("must be superuser or replication
role to use replication slots";
-}
I would have let check_permissions and directly done the checks on
has_rolreplication and has_privs_of_role in it, the same code being
duplicated three times.
-- 
Michael


Re: [HACKERS] Additional role attributes && superuser review

2015-11-18 Thread Michael Paquier
On Wed, Sep 30, 2015 at 8:11 PM, Stephen Frost  wrote:
> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
>> I agree with Robert's earlier point that this needs to be split into
>> multiple patches, which can then be reviewed and discussed
>> separately. Pending that, I'm going to mark this as "Waiting on
>> author" in the commitfest.
>
> Attached is an initial split which divides up reserving the role names
> from actually adding the initial set of default roles.

I have begun looking at this patch, and here are some comments after
screening it.

Patch needs a rebase, some catalog OIDs and there was a conflict in misc.c
(see attached for the rebase. none of the comments mentioning issues are
fixed by it).

=# grant pg_replay to pg_backup  ;
GRANT ROLE
=# \du pg_backup
 List of roles
 Role name |  Attributes  |  Member of
---+--+-
 pg_backup | Cannot login | {pg_replay}
Perhaps we should restrict granting a default role to another default role?

+  
+   Also note that changing the permissions on objects in the system
+   catalogs, while possible, is unlikely to have the desired effect as
+   the internal lookup functions use a cache and do not check the
+   permissions nor policies of tables in the system catalog.  Further,
+   permission changes to objects in the system catalogs are not
+   preserved by pg_dump or across upgrades.
+  
This bit could be perhaps applied as a separate patch.

+  
+   pg_backup
+   Start and stop backups, switch xlogs, and create restore
points.
+  
s/switch xlogs/switch to a new transaction log file/
It seems weird to not have a dedicated role for pg_switch_xlog.

In func.sgml, the descriptions of pg_switch_xlog, pg_xlog_replay_pause and
pg_xlog_replay_resume still mention that those functions are restricted
only to superusers. The documentation needs an update. It would be good to
double-check if there are similar mistakes for the other functions.

+   pg_montior
Typo, pg_monitor.

+   Pause and resume xlog replay on replicas.
Those descriptions should be complete sentences or not? Both styles are
mixed in user-manag.sgml.

+  
+   pg_replication
+   Create, destroy, and work with replication slots.
+  
pg_replication_slots would be more adapted?

+  
+   pg_file_settings
+   Allowed to view config settings from all config files
+  
I would say "configuration files" instead. An abbreviation is not adapted.

+   pg_admin
+   
+Granted pg_backup, pg_monitor, pg_reply, pg_replication,
+pg_rotate_logfile, pg_signal_backend and pg_file_settings roles.
+   
Typo, s/pg_reply/pg_replay and I think that there should be 
markups around those role names. I am actually not convinced that we
actually need it.

+   if (IsReservedName(stmt->role))
+   ereport(ERROR,
+   (errcode(ERRCODE_RESERVED_NAME),
+errmsg("role name \"%s\" is reserved",
+stmt->role),
+errdetail("Role names starting with
\"pg_\" are reserved.")));
Perhaps this could be a separate change? It seems that restricting roles
with pg_ on the cluster is not a bad restriction in any case. You may want
to add regression tests to trigger those errors as well.

Shouldn't pg_current_xlog_location, pg_current_xlog_insert_location,
pg_last_xlog_receive_location and pg_last_xlog_replay_location fall under a
restriction category like pg_monitor? I recall of course that we discussed
that some months ago and that a lot of people were reluctant to harden this
area to not break any existing monitoring tool, still this patch may be the
occasion to restrict a bit those functions, particularly on servers where
wal_compression is enabled.

It would be nice to have regression tests as well to check that role
restrictions are applied where they should.
Regards,
-- 
Michael
diff --git a/contrib/test_decoding/expected/permissions.out b/contrib/test_decoding/expected/permissions.out
index 212fd1d..79a7f86 100644
--- a/contrib/test_decoding/expected/permissions.out
+++ b/contrib/test_decoding/expected/permissions.out
@@ -54,13 +54,13 @@ RESET ROLE;
 -- plain user *can't* can control replication
 SET ROLE lr_normal;
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
-ERROR:  must be superuser or replication role to use replication slots
+ERROR:  must be superuser or member of pg_replication to use replication slots
 INSERT INTO lr_test VALUES('lr_superuser_init');
 ERROR:  permission denied for relation lr_test
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-ERROR:  must be superuser or replication role to use replication slots
+ERROR:  must be superuser or member of pg_replication to use replication slots
 SELECT 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-18 Thread Amit Kapila
On Sat, Nov 14, 2015 at 3:39 AM, Robert Haas  wrote:
>
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai 
wrote:
> > I'm now designing the parallel feature of Append...
> >
> > Here is one challenge. How do we determine whether each sub-plan
> > allows execution in the background worker context?
>
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.
>

Couple of comments and questions regarding this patch:

1.
+/*
+ * add_partial_path
..
+ *  produce the same number of rows.  Neither do we need to consider
startup
+ *  costs: parallelism
is only used for plans that will be run to completion.

A.
Don't we need the startup cost incase we need to build partial paths for
joinpaths like mergepath?
Also, I think there are other cases for single relation scan where startup
cost can matter like when there are psuedoconstants in qualification
(refer cost_qual_eval_walker()) or let us say if someone has disabled
seq scan (disable_cost is considered as startup cost.)

B. I think partial path is an important concept and desrves some
explanation in src/backend/optimizer/README.
There is already a good explanation about Paths, so I think it
seems that it is better to add explanation about partial paths.

2.
+ *  costs: parallelism is only used for plans that will be run to
completion.
+ *Therefore, this
routine is much simpler than add_path: it needs to
+ *consider only pathkeys and total cost.

There seems to be some spacing issue in last two lines.

3.
+static void
+create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+ int parallel_threshold = 1000;
+ int parallel_degree = 1;
+
+ /*
+ * If this relation is too small to be worth a parallel scan, just return
+ * without doing anything ... unless it's an inheritance child.  In that
case,
+ * we want to generate a parallel path here anyway.  It might not be
worthwhile
+ * just for this relation, but when combined with all of its inheritance
siblings
+ * it may well pay off.
+ */
+ if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL)
+ return;

A.
This means that for inheritance child relations for which rel pages are
less than parallel_threshold, it will always consider the cost shared
between 1 worker and leader as per below calc in cost_seqscan:
if (path->parallel_degree > 0)
run_cost = run_cost / (path->parallel_degree + 0.5);

I think this might not be the appropriate cost model for even for
non-inheritence relations which has pages more than parallel_threshold,
but it seems to be even worst for inheritance children which have
pages less than parallel_threshold

B.
Will it be possible that if none of the inheritence child rels (or very few
of them) are big enough for parallel scan, then considering Append
node for parallelism of any use or in other words, won't it be better
to generate plan as it is done now without this patch for such cases
considering current execution model of Gather node?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Bug in numeric multiplication

2015-11-18 Thread Dean Rasheed
On 17 November 2015 at 23:57, Tom Lane  wrote:
> After thinking some more about what this is doing, it seems to me that
> we could avoid changing div[qi + 1] at all here, and instead deal with
> leftover dividend digits by shoving them into the floating-point part of
> the calculation.  All that we really need to have happen as a consequence
> of the above is to inject an additional "div[qi] * NBASE" term into future
> calculations of fdividend.  So we can do that out-of-band and avoid
> mucking up the maxdiv invariant.
>

That makes sense.

> Also, I realized that this bit:
>
> /*
>  * All the div[] digits except possibly div[qi] are now in the
>  * range 0..NBASE-1.
>  */
> maxdiv = Abs(newdig) / (NBASE - 1);
> maxdiv = Max(maxdiv, 1);
>
> is unduly conservative, because, while indeed div[qi] might be out of
> range, there is no need to consider it in maxdiv anymore, because the new
> maxdiv value only determines what will happen in future loop iterations
> where the current div[qi] will be irrelevant.  So we should just reset
> maxdiv to 1 unconditionally here.
>

OK, makes sense too.

> So that led me to the attached patch, which passes regression tests fine.

I'd feel better about that if the regression tests actually did
something that stressed this, but coming up with such a test is
non-trivial.

> I'm not sure that it's provably okay though.  The loose ends are to show
> that div[qi] can't overflow an int during the divisor-subtraction step
> and that "outercarry" remains bounded.   Testing suggests that outercarry
> can't exceed INT_MAX/NBASE, but I don't see how to prove that.
>

At the top of the loop we know that

  Abs(div[qi]) <= maxdiv * (NBASE-1)
   <= INT_MAX - INT_MAX/NBASE - 1

Then we add Abs(qdigit) to maxdiv which potentially triggers a
normalisation step. That step adds carry to div[qi], and we know that

  Abs(carry) <= INT_MAX/NBASE + 1

so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't
overflow at that point, but it could be on the verge of doing so.

Then the divisor-subtraction step does

  div[qi] -= qdigit * var2digits[0]

That looks as though it could overflow, although actually I would
expect qdigit to have the same sign as div[qi], so that this would
drive div[qi] towards zero. If you didn't want to rely on that though,
you could take advantage of the fact that this new value of div[qi] is
only needed for outercarry, so you could modify the
divisor-subtraction step to skip div[qi]:

  for (i = 1; i < istop; i++)
div[qi + i] -= qdigit * var2digits[i];

and fold the most significant digit of the divisor-subtraction step
into the computation of outercarry so that there is definitely no
possibility of integer overflow:

  outercarry = outercarry * NBASE + (double) div[qi]
 - (double) qdigit * var2digits[0];

It's still difficult to prove that outercarry remains bounded, but to
a first approximation

  qdigit ~= ( outercarry * NBASE + (double) div[qi] ) / var2digits[0]

so outercarry ought to be driven towards zero, as the comment says. It
certainly doesn't look like it will grow without bounds, but I haven't
attempted to work out what its bounds actually are.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-18 Thread Thomas Munro
Hi,

Here is a new version of the patch with a few small improvements:

1.  Adopted the term '[read] lease', replacing various hand-wavy language
in the comments and code.  That seems to be the established term for this
approach[1].

2.  Reduced the stalling time on failure.  When things go wrong with a
standby (such as losing contact with it), instead of stalling for a
conservative amount of time longer than any lease that might have been
granted, the primary now stalls only until the expiry of the last lease
that actually was granted to a given dropped standby, which should be
sooner.

3.  Fixed a couple of bugs that showed up in testing and review (some bad
flow control in the signal handling, and a bug in a circular buffer), and
changed the recovery->walreceiver wakeup signal handling to block the
signal except while waiting in walrcv_receive (it didn't seem a good idea
to interrupt arbitrary syscalls in walreceiver so I thought that would be a
improvement; but of course that area's going to be reworked by Simon's
patch anyway, as discussed elsewhere).

Restating the central idea using the new terminology:  So long as they are
replaying fast enough, the primary grants a series of causal reads leases
to standbys allowing them to handle causal reads queries locally without
any inter-node communication for a limited time.  Leases are promises that
the primary will wait for the standby to apply commit records OR be dropped
from the set of available causal reads standbys and know that it has been
dropped, before the primary returns from commit, in order to uphold the
causal reads guarantee.  In the worst case it can do that by waiting for
the most recently granted lease to expire.

I've also attached a couple of things which might be useful when trying the
patch out: test-causal-reads.c which can be used to test performance and
causality under various conditions, and test-causal-reads.sh which can be
used to bring up a primary and a bunch of local hot standbys to talk to.
 (In the hope of encouraging people to take the patch for a spin...)

[1] Originally from a well known 1989 paper on caching, but in the context
of databases and synchronous replication see for example the recent papers
on "Niobe" and "Paxos Quorum Leases" (especially the reference to Google
Megastore).  Of course a *lot* more is going on in those very different
algorithms, but at some level "read leases" are being used to allow
local-node-only reads for a limited time while upholding some kind of
global consistency guarantee, in some of those consensus database systems.
I spent a bit of time talking about consistency levels to database guru and
former colleague Alex Scotti who works on a Paxos-based system, and he gave
me the initial idea to try out a lease-based consistency system for
Postgres streaming rep.  It seems like a very useful point in the space of
trade-offs to me.

-- 
Thomas Munro
http://www.enterprisedb.com


test-causal-reads.sh
Description: Bourne shell script
/*
 * A simple test program to test performance and visibility with the causal
 * reads patch.
 *
 * Each test loop updates a row on the primary, and then optionally checks if
 * it can see that change immediately on a standby.  If you do this with
 * standard async replication, you should occasionally see an assertion fail
 * if run with --check (depending on the vaguaries of timing -- I can
 * reproduce this very reliably on my system).  If you do it with traditional
 * sync rep, it becomes a little bit less likely (but it's still reliably
 * reproducible on my system).  If you do it with traditional sync rep set up,
 * and "--synchronous-commit apply" then it should no longer be possible to
 * trigger than assertion, but that's just a straw-man mode.  If you do it
 * with --causal-reads then you should not be able to reproduce it, no matter
 * which standby you connect to.  If you're using --check and the standby gets
 * dropped (perhaps because you break/disconnect/pause it etc) you should
 * never see that assertion fail (= SELECT running but seeing stale data),
 * instead you should see an error when running the SELECT.
 *
 * Arguments:
 *
 *  --primary  how to connect to the primary
 *  --standby  how to connect to the standby to check
 *  --check   check that the update is visible on standby
 *  --causal-readsenable causal reads
 *  --synchronous-commit LEVELset synchronous_commit to LEVEL
 *  --loops COUNT how many loops to run through
 *  --verbose chatter
 */

#include 

#include 
#include 
#include 
#include 
#include 

int
main(int argc, char *argv[])
{
	PGconn *primary;
	PGconn *standby;
	PGresult *result;
	int i;
	int loops = 1;
	char buffer[1024];
	const char *synchronous_commit = "on";
	bool causal_reads = false;
	const char *primary_connstr = "dbname=postgres port=5432";
	const char *standby_connstr = "dbname=postgres port=5442";
	bool 

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-11-18 Thread Masahiko Sawada
On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI
 wrote:
> Oops.
>
> At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20151117.194010.17198448.horiguchi.kyot...@lab.ntt.co.jp>
>> Hello,
>>
>> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada  
>> wrote in 

[HACKERS] Feature or bug: getting "Inf"::timestamp[tz] by "regular" value

2015-11-18 Thread Vitaly Burovoy
Hackers!

I'm writing another patch for timestamps and while I tried to cover
corner-cases I found out there is an ability to get
"Infinity"::timestamptz via defining it by a specific (but not
"Infinity") value:

postgres=# SELECT '294277-01-09 04:00:54.775806+00'::timestamptz; -- OK
   timestamptz
-
 294277-01-09 04:00:54.775806+00
(1 row)

postgres=# SELECT '294277-01-09 04:00:54.775807+00'::timestamptz; -- Inf???
 timestamptz
-
 infinity
(1 row)

postgres=# SELECT '294277-01-09 04:00:54.775808+00'::timestamptz; --
Higher values give an error
ERROR:  timestamp out of range: "294277-01-09 04:00:54.775808+00"
LINE 1: SELECT '294277-01-09 04:00:54.775808+00'::timestamptz;
   ^
I could not find a way to get "-Infinity" by similar way.
Is it feature or a bug? Does it worth to insert a check for that
special case to raise an exception "timestamp out of range"?
-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PostgreSQL super HA (High Availability) conception for 9.5+

2015-11-18 Thread Maeldron T.
Hello,

Foreword:

Unfortunately, I have no time to read the mailing lists and attend events
like PostgreSQL and NoSQL. Some of the ideas came from MongoDB and
Cassandra. The inspiration was the pg_rewind.

There is little new here, it’s a wish-list put together, considering what
could be possible in the foreseeable future. It’s likely that people worked
on a similar or a better concept. But let me try.


Reasons:

Downtime is bad. PostgreSQL failover requires manual intervention (client
configuration or host or DNS editing). Third party tools (in my experience)
don’t offer the same stability and quality as PostgreSQL is. Also, this
concept wouldn’t work without pg_rewind.

Less software means less bugs.


Goals:

Providing near to 100% HA with minimal manual intervention. Minimizing
possible human errors during failover. Making startup founders sleep well
in the night. Automatic client configuration. Avoiding split brains.


Extras:

Automatic streaming chain configuration.


No-goals:

Multi-master replication. Sharding. Proxying. Load balancing.


Why these:

It’s better to have a working technology now than a futuristic solution in
the future. For many applications, stability and HA are more important than
sharding or multi-master.


The concept:

You can set up a single-master PostgreSQL cluster with two or more nodes
that can failover several times without manual re-configuration. Restarting
the client isn’t needed if it’s smart enough to reconnect. Third party
software isn’t needed. Proxying isn’t needed.


Cases:


Running the cluster:

The cluster is running. There is one master. Every other nodes are
hot-standby slaves.

The client-driver accepts several hostname(:port) values in the connection
parameters. They must belong to the same cluster. (The cluster’s name might
be provided too).

The rest of the options (username, database name) are the same and needed
only once. It’s not necessary to list every hosts. (Even listing one host
is enough but not recommended).

The client connects to one of the given hosts. If the node is running and
it’s a slave, it tells the client which host the master is. The client
connects to the master, even if the master was not listed in the connection
parameters.

It’s should be possible that the client stays connected to the slave for
read-only queries if the application wants to do that.

If the node the client tried connect to isn’t working, the client tries
another node and so.


Manually promoting a new master:

The administrator promotes any of the slaves. The slave tells the master to
gracefully stop. The master stops executing queries. It waits until the
slave (the new master) receives all the replication log. The new master is
promoted. The old master becomes a slave. (It might use pg_rewind).

The old master asks the connected clients to reconnect to the new master.
Then it drops the existing connections. It accepts new connections though
and tells them who the master is.


Manual step-down of the master:

The administrator kindly asks the master to stop being the master. The
cluster elects a new master. Then it’s the same as promoting a new master.


Manual shutdown of the master:

It’s same as step-down but the master won’t run as a slave until it’s
started up again.


Automatic failover:

The master stops responding for a given period. The majority of the cluster
elects a new master. Then the process is the same as manual promotion.

When the old master starts up, the cluster tells it that it is not a master
anymore. It does pg_rewind and acts as a slave.

Automatic failover can happen again without human intervention. The clients
are reconnected to the new master each time.


Automatic failover without majority:

It’s possible to tell in the config which server may act as a master when
there is no majority to vote.


Replication chain:

There are two cases. 1: All the slaves connect to the master. 2: One slave
connects to the master and the rest of the nodes replicate from this slave.


Configuration:

Every node should have a “recovery.conf” that is not renamed on promotion.

cluster_name: an identifier for the cluster. Why not.

hosts: list of the hosts. It is recommended but not needed to include every
hosts in every file. It could work as the driver, discovering the rest of
the cluster.

master_priority: integer. How likely this node becomes the new master on
failover (except manual promotion). A working cluster should not elect a
new master just because it has higher priority than the current one.
Election happens only for the described reasons above.

slave_priority: integer. If any running node has this value larger than 0,
the replication node is also elected, and the rest of the slaves replicate
from the elected slave. Otherwise, they replicate from the master.

primary_master: boolean. The node may run as master without elected by the
majority. (This is not needed on manual promotion or shutdown. See
bookkeeping.)

safe: boolean. If this is 

[HACKERS] SPI and transactions

2015-11-18 Thread Konstantin Knizhnik

Hello,

SPI was originally developed for execution SQL statements from C user 
defined functions in context of existed transaction.
This is why it is not possible to execute any transaction manipulation 
statement (BEGIN, COMMIT, PREPARE,...) using 
SPI_execute:SPI_ERROR_TRANSACTION is returned.


But now SPI is used not only inside UDFs. It is also used in background 
workers. For example in receiver_raw, written by Michael Paquier (I lot 
of thanks Michael, understand logical replication without them will be 
much more difficult).
Right now transactions have to be started by background worker using  
StartTransactionCommand().
So receiver_raw is not able to preserve master's transaction semantic 
(certainly it can be implemented).


I wonder whether SPI can be extended now to support transaction 
manipulation functions when been called outside transaction context? Or 
there are some principle problem with it?


Thanks in advance,
Konstantin





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add pg_audit, an auditing extension

2015-11-18 Thread Thom Brown
On 10 June 2015 at 14:41, Noah Misch  wrote:
> On Tue, Jun 09, 2015 at 03:54:59PM -0400, David Steele wrote:
>> I've certainly had quite the experience as a first-time contributor
>> working on this patch.  Perhaps I bit off more than I should have and I
>> definitely managed to ruffle a few feathers along the way.  I learned a
>> lot about how the community works, both the good and the bad.  Fear not,
>> though, I'm not so easily discouraged and you'll definitely be hearing
>> more from me.
>
> Glad to hear it.
>
>> The stated purpose of contrib is: "include porting tools, analysis
>> utilities, and plug-in features that are not part of the core PostgreSQL
>> system, mainly because they address a limited audience or are too
>> experimental to be part of the main source tree. This does not preclude
>> their usefulness."
>>
>> Perhaps we should consider modifying that language, because from my
>> perspective pg_audit fit the description perfectly.
>
> "What is contrib?" attracts enduring controversy; see recent thread "RFC:
> Remove contrib entirely" for the latest episode.  However that discussion
> concludes, that documentation passage is not too helpful as a guide to
> predicting contrib patch reception.  (Most recent contrib additions had an
> obvious analogy to an existing module, sidestepping the question.)

Is pg_audit being resubmitted for 9.6 at all?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Feature or bug: getting "Inf"::timestamp[tz] by "regular" value

2015-11-18 Thread Tom Lane
Vitaly Burovoy  writes:
> I'm writing another patch for timestamps and while I tried to cover
> corner-cases I found out there is an ability to get
> "Infinity"::timestamptz via defining it by a specific (but not
> "Infinity") value:
> postgres=# SELECT '294277-01-09 04:00:54.775807+00'::timestamptz; -- Inf???
>  timestamptz
> -
>  infinity
> (1 row)

> I could not find a way to get "-Infinity" by similar way.
> Is it feature or a bug?

It's a bug.  Probably related to the fact that "Infinity" is represented
as INT_MAX in the case of integer timestamps.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-11-18 Thread Greg Stark
On Wed, Nov 18, 2015 at 11:29 PM, Peter Geoghegan  wrote:
> Other systems expose this explicitly, and, as I said, say in an
> unqualified way that a multi-pass merge should be avoided. Maybe the
> warning isn't the right way of communicating that message to the DBA
> in detail, but I am confident that it ought to be communicated to the
> DBA fairly clearly.

I'm pretty convinced warnings from DML are a categorically bad idea.
In any OLTP load they're effectively fatal errors since they'll fill
up log files or client output or cause other havoc. Or they'll cause
no problem because nothing is reading them. Neither behaviour is
useful.

Perhaps the right thing to do is report a statistic to pg_stats so
DBAs can see how often sorts are in memory, how often they're on disk,
and how often the on disk sort requires n passes. That would put them
in the same category as "sequential scans" for DBAs that expect the
application to only run index-based OLTP queries for example. The
problem with this is that sorts are not tied to a particular relation
and without something to group on the stat will be pretty hard to act
on.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY (INSERT/UPDATE/DELETE .. RETURNING ..)

2015-11-18 Thread Marko Tiikkaja

On 2015-11-16 08:24, Michael Paquier wrote:

On Sun, Nov 1, 2015 at 2:49 AM, Marko Tiikkaja  wrote:

Attached is a patch for being able to do $SUBJECT without a CTE.  The
reasons this is better than a CTE version are:

   1) It's not obvious why a CTE version works but a plain one doesn't
   2) This one has less overhead (I measured a ~12% improvement on a
not-too-unreasonable test case)


Would you mind sharing this test case as well as numbers?


I'll try and re-puzzle it together tomorrow.  It looks like I wasn't 
smart enough to actually save the test case anywhere.



Regression tests are broken when copyselect is run in parallel because
test3 is a table used there as well. A simple idea to fix this is to rename
the table test3 to something else or to use a schema dedicated to this
test, I just renamed it to copydml_test in the patch attached.


Seems reasonable.


-   | COPY select_with_parens TO opt_program
copy_file_name opt_with copy_options
+   | COPY '(' PreparableStmt ')' TO opt_program
copy_file_name opt_with copy_options
This does not look right, PreparableStmt is used for statements that are
part of PREPARE queries, any modifications happening there would impact
COPY with this implementation. I think that we had better have a new option
query_with_parens. Please note that the patch attached has not changed that.


This was discussed in 2010 when CTEs got the same treatment, and I agree 
with what was decided back then.  If someone needs to make 
PreparableStmt different from what COPY and CTEs support, we can split 
them up.  But they're completely identical after this patch, so 
splitting them up right now is somewhat pointless.


Further, if someone's going to add new stuff to PreparableStmt, she 
should probably think about whether it would make sense to add it to 
COPY and CTEs from day one, too, and in that case not splitting them up 
is actually a win.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-11-18 Thread Robert Haas
On Wed, Nov 18, 2015 at 6:29 PM, Peter Geoghegan  wrote:
> In principle, I have no problem with doing that. Through testing, I
> cannot see any actual upside, though. Perhaps I just missed something.
> Even 8MB is enough to avoid the multipass merge in the event of a
> surprisingly high volume of data (my work laptop is elsewhere, so I
> don't have my notes on this in front of me, but I figured out the
> crossover point for a couple of cases).

I'd be interested in seeing this analysis in some detail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-18 Thread Noah Misch
On Wed, Nov 18, 2015 at 01:21:45PM -0200, Alvaro Herrera wrote:
> In my days of Perl, it was starting to become frowned upon to call
> subroutines without parenthesizing arguments.  Is that no longer the
> case?

I've not witnessed those frowns.

> Because I notice there are many places in this patch and pre-
> existing that call psql with an argument list without parens.  And it's
> a bit odd because I couldn't find any other subroutine that we're using
> in that way.

TestLib.pm has unparenthesized calls to "standard_initdb", "start" and "run".
070_dropuser.pl has such calls to "start_test_server" and "psql".

> In 005_replay_delay there's a 2s delay configured; then we test whether
> something is replayed in 1s.  I hate tests that run for a long time, but
> is 2s good enough considering that some of our test animals in buildfarm
> are really slow?

That test will be unreliable, agreed.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Add numeric_trim(numeric)

2015-11-18 Thread Marko Tiikkaja

Hi,

Here's a patch for the second function suggested in 
5643125e.1030...@joh.to.  This is my first patch trying to do anything 
with numerics, so please be gentle.  I'm sure it's full of stupid.


January's commit fest, feedback welcome, yada yada..


.m
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 782,787 
--- 782,800 

 
  
+  numeric_trim
+ 
+ 
numeric_trim(numeric)
+
+numeric
+remove trailing decimal zeroes after the decimal point
+numeric_trim(8.4100)
+8.41
+   
+ 
+   
+
+ 
   pi
  
  pi()
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
***
*** 2825,2830  numeric_power(PG_FUNCTION_ARGS)
--- 2825,2904 
PG_RETURN_NUMERIC(res);
  }
  
+ /*
+  * numeric_trim() -
+  *
+  *Remove trailing decimal zeroes after the decimal point
+  */
+ Datum
+ numeric_trim(PG_FUNCTION_ARGS)
+ {
+   Numeric num = PG_GETARG_NUMERIC(0);
+   NumericVar  arg;
+   Numeric res;
+   int dscale;
+   int ndigits;
+ 
+   if (NUMERIC_IS_NAN(num))
+   PG_RETURN_NUMERIC(make_result(_nan));
+ 
+   init_var_from_num(num, );
+ 
+   ndigits = arg.ndigits;
+   /* for simplicity in the loop below, do full NBASE digits at a time */
+   dscale = ((arg.dscale + DEC_DIGITS - 1) / DEC_DIGITS) * DEC_DIGITS;
+   /* trim unstored significant trailing zeroes right away */
+   if (dscale > (ndigits - arg.weight - 1) * DEC_DIGITS)
+   dscale = (ndigits - arg.weight - 1) * DEC_DIGITS;
+   while (dscale > 0 && ndigits > 0)
+   {
+   NumericDigit dig = arg.digits[ndigits - 1];
+ 
+ #if DEC_DIGITS == 4
+   if ((dig % 10) != 0)
+   break;
+   if (--dscale == 0)
+   break;
+   if ((dig % 100) != 0)
+   break;
+   if (--dscale == 0)
+   break;
+   if ((dig % 1000) != 0)
+   break;
+   if (--dscale == 0)
+   break;
+ #elif DEC_DIGITS == 2
+   if ((dig % 10) != 0)
+   break;
+   if (--dscale == 0)
+   break;
+ #elif DEC_DIGITS == 1
+   /* nothing to do */
+ #else
+ #error unsupported NBASE
+ #endif
+   if (dig != 0)
+   break;
+   --dscale;
+   --ndigits;
+   }
+   arg.dscale = dscale;
+   arg.ndigits = ndigits;
+ 
+   /* If it's zero, normalize the sign and weight */
+   if (ndigits == 0)
+   {
+   arg.sign = NUMERIC_POS;
+   arg.weight = 0;
+   arg.dscale = 0;
+   }
+ 
+   res = make_result();
+   free_var();
+ 
+   PG_RETURN_NUMERIC(res);
+ }
+ 
  
  /* --
   *
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***
*** 2361,2366  DESCR("exponentiation");
--- 2361,2368 
  DATA(insert OID = 2169 ( powerPGNSP 
PGUID 12 1 0 0 0 f f f f t f i s 2 0 1700 "1700 1700" _null_ _null_ _null_ 
_null_ _null_  numeric_power _null_ _null_ _null_ ));
  DESCR("exponentiation");
  DATA(insert OID = 1739 ( numeric_powerPGNSP PGUID 12 
1 0 0 0 f f f f t f i s 2 0 1700 "1700 1700" _null_ _null_ _null_ _null_ _null_ 
 numeric_power _null_ _null_ _null_ ));
+ DATA(insert OID =  ( numeric_trim PGNSP PGUID 12 1 0 0 0 
f f f f t f i s 1 0 1700 "1700" _null_ _null_ _null_ _null_ _null_   
numeric_trim _null_ _null_ _null_ ));
+ DESCR("remove trailing decimal zeroes after the decimal point");
  DATA(insert OID = 1740 ( numeric  PGNSP PGUID 12 
1 0 0 0 f f f f t f i s 1 0 1700 "23" _null_ _null_ _null_ _null_ _null_ 
int4_numeric _null_ _null_ _null_ ));
  DESCR("convert int4 to numeric");
  DATA(insert OID = 1741 ( log  PGNSP PGUID 14 
1 0 0 0 f f f f t f i s 1 0 1700 "1700" _null_ _null_ _null_ _null_ _null_ 
"select pg_catalog.log(10, $1)" _null_ _null_ _null_ ));
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
***
*** 1022,1027  extern Datum numeric_exp(PG_FUNCTION_ARGS);
--- 1022,1028 
  extern Datum numeric_ln(PG_FUNCTION_ARGS);
  extern Datum numeric_log(PG_FUNCTION_ARGS);
  extern Datum numeric_power(PG_FUNCTION_ARGS);
+ extern Datum numeric_trim(PG_FUNCTION_ARGS);
  extern Datum int4_numeric(PG_FUNCTION_ARGS);
  extern Datum numeric_int4(PG_FUNCTION_ARGS);
  extern Datum int8_numeric(PG_FUNCTION_ARGS);
*** a/src/test/regress/expected/numeric.out
--- b/src/test/regress/expected/numeric.out
***
*** 1852,1854  select log(3.1954752e47, 

Re: [HACKERS] INSERT ... ON CONFLICT documentation clean-up patch

2015-11-18 Thread Andres Freund
On 2015-11-10 14:28:16 -0800, Peter Geoghegan wrote:
> Attached revision (rebase) of your modified version of my patch (the
> modification you provided privately) has the following notable
> changes:

We might want do a bit further word-smithing, but it does indeed seem
better to me this way. As it's hard to chat about incremental
improvements to a patch that moves this much content around I've
committed this.

Thanks


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting sorted data from foreign server for merge join

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 8:33 AM, Ashutosh Bapat
 wrote:
>> Although I'm usually on the side of marking things as extern whenever
>> we find it convenient, I'm nervous about doing that to
>> make_canonical_pathkey(), because it has side effects.  Searching the
>> list of canonical pathkeys for the one we need is reasonable, but is
>> it really right to ever think that we might create a new one at this
>> stage?  Maybe it is, but if so I'd like to hear a good explanation as
>> to why.
>
> For a foreign table no pathkeys are created before creating paths for
> individual foreign table since there are no indexes. For a regular table,
> the pathkeys get created for useful indexes.

OK.

Could some portion of the second foreach loop inside
get_useful_pathkeys_for_relation be replaced by a call to
eclass_useful_for_merging?  The logic looks very similar.

More broadly, I'm wondering about the asymmetries between this code
and pathkeys_useful_for_merging().  The latter has a
right_merge_direction() test and a case for non-EC-derivable join
clauses that aren't present in your code.  I wonder if we could just
generate pathkeys speculatively and then test which ones survive
truncate_useless_pathkeys(), or something like that.  This isn't an
enormously complicated patch, but it duplicates more of the logic in
pathkeys.c than I'd like.

I'm inclined to think that we shouldn't consider pathkeys that might
be useful for merge-joining unless we're using remote estimates.  It
seems more speculative than pushing down a toplevel sort.

This patch seems rather light on test cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Add scale(numeric)

2015-11-18 Thread Marko Tiikkaja

Hi,

As suggested in 5643125e.1030...@joh.to, here's a patch for extracting 
the scale out of a numeric.


This is 2016-01 CF material, but if someone wants to bas^H^H^Hsay nice 
things in the meanwhile, feel free.



.m
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 852,857 
--- 852,870 

 
  
+  scale
+ 
+ scale(numeric)
+
+numeric
+scale of the argument (the number of decimal digits in the 
fractional part)
+scale(8.41)
+2
+   
+ 
+   
+
+ 
   sign
  
  sign(dp or 
numeric)
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
***
*** 2825,2830  numeric_power(PG_FUNCTION_ARGS)
--- 2825,2847 
PG_RETURN_NUMERIC(res);
  }
  
+ /*
+  * numeric_scale() -
+  *
+  *Returns the scale, i.e. the count of decimal digits in the fractional 
part
+  */
+ Datum
+ numeric_scale(PG_FUNCTION_ARGS)
+ {
+   Numeric num = PG_GETARG_NUMERIC(0);
+ 
+   if (NUMERIC_IS_NAN(num))
+   PG_RETURN_NULL();
+ 
+   PG_RETURN_INT32(NUMERIC_DSCALE(num));
+ }
+ 
+ 
  
  /* --
   *
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***
*** 2361,2366  DESCR("exponentiation");
--- 2361,2368 
  DATA(insert OID = 2169 ( powerPGNSP 
PGUID 12 1 0 0 0 f f f f t f i s 2 0 1700 "1700 1700" _null_ _null_ _null_ 
_null_ _null_  numeric_power _null_ _null_ _null_ ));
  DESCR("exponentiation");
  DATA(insert OID = 1739 ( numeric_powerPGNSP PGUID 12 
1 0 0 0 f f f f t f i s 2 0 1700 "1700 1700" _null_ _null_ _null_ _null_ _null_ 
 numeric_power _null_ _null_ _null_ ));
+ DATA(insert OID =  ( scale  PGNSP PGUID 12 1 0 0 0 f f f f t f i 
s 1 0 23 "1700" _null_ _null_ _null_ _null_ _null_   numeric_scale _null_ 
_null_ _null_ ));
+ DESCR("returns the number of decimal digits in the fractional part");
  DATA(insert OID = 1740 ( numeric  PGNSP PGUID 12 
1 0 0 0 f f f f t f i s 1 0 1700 "23" _null_ _null_ _null_ _null_ _null_ 
int4_numeric _null_ _null_ _null_ ));
  DESCR("convert int4 to numeric");
  DATA(insert OID = 1741 ( log  PGNSP PGUID 14 
1 0 0 0 f f f f t f i s 1 0 1700 "1700" _null_ _null_ _null_ _null_ _null_ 
"select pg_catalog.log(10, $1)" _null_ _null_ _null_ ));
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
***
*** 1022,1027  extern Datum numeric_exp(PG_FUNCTION_ARGS);
--- 1022,1028 
  extern Datum numeric_ln(PG_FUNCTION_ARGS);
  extern Datum numeric_log(PG_FUNCTION_ARGS);
  extern Datum numeric_power(PG_FUNCTION_ARGS);
+ extern Datum numeric_scale(PG_FUNCTION_ARGS);
  extern Datum int4_numeric(PG_FUNCTION_ARGS);
  extern Datum numeric_int4(PG_FUNCTION_ARGS);
  extern Datum int8_numeric(PG_FUNCTION_ARGS);
*** a/src/test/regress/expected/numeric.out
--- b/src/test/regress/expected/numeric.out
***
*** 1852,1854  select log(3.1954752e47, 9.4792021e-73);
--- 1852,1911 
   
-1.51613372350688302142917386143459361608600157692779164475351842333265418126982165
  (1 row)
  
+ --
+ -- Tests for scale()
+ --
+ select scale(numeric 'NaN');
+  scale 
+ ---
+   
+ (1 row)
+ 
+ select scale(NULL::numeric);
+  scale 
+ ---
+   
+ (1 row)
+ 
+ select scale(1.12);
+  scale 
+ ---
+  2
+ (1 row)
+ 
+ select scale(0);
+  scale 
+ ---
+  0
+ (1 row)
+ 
+ select scale(0.00);
+  scale 
+ ---
+  2
+ (1 row)
+ 
+ select scale(1.12345);
+  scale 
+ ---
+  5
+ (1 row)
+ 
+ select scale(110123.12475871856128);
+  scale 
+ ---
+ 14
+ (1 row)
+ 
+ select scale(-1123.12471856128);
+  scale 
+ ---
+ 11
+ (1 row)
+ 
+ select scale(-13.000);
+  scale 
+ ---
+ 15
+ (1 row)
+ 
*** a/src/test/regress/sql/numeric.sql
--- b/src/test/regress/sql/numeric.sql
***
*** 983,985  select log(1.23e-89, 6.4689e45);
--- 983,999 
  select log(0.99923, 4.58934e34);
  select log(1.16, 8.452010e18);
  select log(3.1954752e47, 9.4792021e-73);
+ 
+ --
+ -- Tests for scale()
+ --
+ 
+ select scale(numeric 'NaN');
+ select scale(NULL::numeric);
+ select scale(1.12);
+ select scale(0);
+ select scale(0.00);
+ select scale(1.12345);
+ select scale(110123.12475871856128);
+ select scale(-1123.12471856128);
+ select scale(-13.000);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Trivial heap_finish_speculative() error message inaccuracy

2015-11-18 Thread Andres Freund
On 2015-11-18 23:50:30 +0100, Andres Freund wrote:
> Let's rather rip those function names out. Unless somebody protests I'm
> going to do so in 9.5/master.

Done.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] warning: HS_KEY redefined (9.5 beta2)

2015-11-18 Thread Mike Blackwell
Google says this was present in beta1. (
http://www.postgresql.org/message-id/5596a162.30...@dunslane.net)

Still seems broken, at least for me.

Built with Perl 5.22.
uname -m = x86_64
uname -r = 2.6.32-504.12.2.el6.x86_64

__
*Mike Blackwell | Technical Analyst, Distribution Services/Rollout
Management | RR Donnelley*
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
mike.blackw...@rrd.com
http://www.rrdonnelley.com



* *


Re: [HACKERS] Additional role attributes && superuser review

2015-11-18 Thread Stephen Frost
Michael,

Thanks for the review!

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> Patch needs a rebase, some catalog OIDs and there was a conflict in misc.c
> (see attached for the rebase. none of the comments mentioning issues are
> fixed by it).

Done (did it a bit differently from what you had, to hopefully avoid
future OID conflicts and also to allow us a bit of room to add
additional default roles later, if we choose, in nearby OID space).

> =# grant pg_replay to pg_backup  ;
> GRANT ROLE
> =# \du pg_backup
>  List of roles
>  Role name |  Attributes  |  Member of
> ---+--+-
>  pg_backup | Cannot login | {pg_replay}
> Perhaps we should restrict granting a default role to another default role?

Done (in the second patch in the series, the one reserving 'pg_').

> +  
> +   Also note that changing the permissions on objects in the system
> +   catalogs, while possible, is unlikely to have the desired effect as
> +   the internal lookup functions use a cache and do not check the
> +   permissions nor policies of tables in the system catalog.  Further,
> +   permission changes to objects in the system catalogs are not
> +   preserved by pg_dump or across upgrades.
> +  
> This bit could be perhaps applied as a separate patch.

Done as a separate patch (first one in the series).

> +  
> +   pg_backup
> +   Start and stop backups, switch xlogs, and create restore
> points.
> +  
> s/switch xlogs/switch to a new transaction log file/

Fixed.

> It seems weird to not have a dedicated role for pg_switch_xlog.

I didn't add a pg_switch_xlog default role in this patch series, but
would be happy to do so if that's the consensus.  It's quite easy to do.

> In func.sgml, the descriptions of pg_switch_xlog, pg_xlog_replay_pause and
> pg_xlog_replay_resume still mention that those functions are restricted
> only to superusers. The documentation needs an update. It would be good to
> double-check if there are similar mistakes for the other functions.

Fixed.  Also did a review and found a number of other places which
required updating, so those have also been fixed.

> +   pg_montior
> Typo, pg_monitor.

Fixed.

> +   Pause and resume xlog replay on replicas.
> Those descriptions should be complete sentences or not? Both styles are
> mixed in user-manag.sgml.

I didn't take any action on this.

> +  
> +   pg_replication
> +   Create, destroy, and work with replication slots.
> +  
> pg_replication_slots would be more adapted?

Perhaps...  I didn't make this change, but if others agree that adding
"_slots" would help, I'd be happy to do so.  Personally, I'd like to
keep these names shorter, if possible, but I don't want it to be
confusing either.

> +  
> +   pg_file_settings
> +   Allowed to view config settings from all config files
> +  
> I would say "configuration files" instead. An abbreviation is not adapted.

Done.

> +   pg_admin
> +   
> +Granted pg_backup, pg_monitor, pg_reply, pg_replication,
> +pg_rotate_logfile, pg_signal_backend and pg_file_settings roles.
> +   
> Typo, s/pg_reply/pg_replay and I think that there should be 
> markups around those role names. I am actually not convinced that we
> actually need it.

I added  markups around the role names, where used.

I'm guessing you were referring to pg_admin with your comment that you
were "not convinced that we actually need it."  After thinking about
it for a bit, I tend to agree.  A user could certainly create their own
role which combines these all together if they wanted to (or do any
other combinations, based on their particular needs).  I've gone ahead
and removed pg_admin from the set of default roles.

> +   if (IsReservedName(stmt->role))
> +   ereport(ERROR,
> +   (errcode(ERRCODE_RESERVED_NAME),
> +errmsg("role name \"%s\" is reserved",
> +stmt->role),
> +errdetail("Role names starting with
> \"pg_\" are reserved.")));
> Perhaps this could be a separate change? It seems that restricting roles
> with pg_ on the cluster is not a bad restriction in any case. You may want
> to add regression tests to trigger those errors as well.

I'm a bit confused- this is a separate change.  Note that the previous
patch was actually a git patchset which had two patches- one to do the
reservation and a second to add the default roles.  The attached
patchset is actually three patches:

1- Update to catalog documentation regarding permissions
2- Reserve 'pg_' role namespace
3- Add default roles

> Shouldn't pg_current_xlog_location, pg_current_xlog_insert_location,
> pg_last_xlog_receive_location and pg_last_xlog_replay_location fall under a
> restriction category like pg_monitor? I recall of course that we discussed
> that some months ago and that a lot of people were reluctant 

Re: [HACKERS] Trivial heap_finish_speculative() error message inaccuracy

2015-11-18 Thread Andres Freund
On 2015-11-03 19:14:44 -0800, Peter Geoghegan wrote:
> On Tue, Nov 3, 2015 at 7:10 PM, Tom Lane  wrote:
> > This seems like a fine teaching moment in which to point out our
> > longstanding error message style guideline that says not to put
> > names of C functions into error messages in the first place.
> 
> I don't ordinarily do that, of course, but I thought it was important
> to be consistent with other such elog() calls within heapam.c.

Internal, super unlikely, elog()s don't seem to need much consistency
among themselves.

> I think that there at least 10 that look like this.

Let's rather rip those function names out. Unless somebody protests I'm
going to do so in 9.5/master.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in numeric multiplication

2015-11-18 Thread Tom Lane
I wrote:
> I'm kind of stuck on that too.  I did some experimentation by tracking
> maximum values of outercarry in the regression tests (including
> numeric_big) and did not see it get larger than a couple hundred thousand,
> ie more or less INT_MAX/NBASE.  But I don't have an argument as to why
> that would be the limit.

A bit of progress on this: I've found a test case, namely

select 
sqrt(.00);

If one inserts no-overflow Asserts into div_var_fast, this case will trip
them, specifically this:

/*
 * The dividend digit we are about to replace might still be nonzero.
 * Fold it into the next digit position.  We don't need to worry about
 * overflow here since this should nearly cancel with the subtraction
 * of the divisor.
 */
+   Assert(Abs(div[qi]) <= INT_MAX/NBASE);
div[qi + 1] += div[qi] * NBASE;

Unfortunately, there isn't any SQL-visible misbehavior in this example,
because the loop in sqrt_var is more or less self-correcting for minor
errors (and this example produces bogus results only in very low-order
digits).  Most of the other calls of div_var_fast give it inputs that are
even harder to control than sqrt_var's, so finding something that did
produce visibly wrong results might take a whole lot of trial and error.

Still, this proves that we are onto a real problem.

> Another issue here is that with outercarry added into the qdigit
> computation, it's no longer clear what the bounds of qdigit itself are,

I concluded that that particular issue is a red herring: qdigit should
always be a fairly accurate estimate of the next output digit, so it
cannot fall very far outside the range 0..NBASE-1.  Testing confirms that
the range seen in the regression tests is exactly -1 to 1, which is
what you'd expect since there can be some roundoff error.

Also, after further thought I've been able to construct an argument why
outercarry stays bounded.  See what you think of the comments in the
attached patch, which incorporates your ideas about postponing the div[qi]
calculation.

regards, tom lane

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index dcdc5cf..2c6a3f9 100644
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
*** div_var_fast(NumericVar *var1, NumericVa
*** 6186,6192 
  	double		fdividend,
  fdivisor,
  fdivisorinverse,
! fquotient;
  	int			qi;
  	int			i;
  
--- 6186,6193 
  	double		fdividend,
  fdivisor,
  fdivisorinverse,
! fquotient,
! outercarry;
  	int			qi;
  	int			i;
  
*** div_var_fast(NumericVar *var1, NumericVa
*** 6274,6289 
  	 * To avoid overflow in maxdiv itself, it represents the max absolute
  	 * value divided by NBASE-1, ie, at the top of the loop it is known that
  	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
  	 */
  	maxdiv = 1;
  
  	/*
! 	 * Outer loop computes next quotient digit, which will go into div[qi]
  	 */
  	for (qi = 0; qi < div_ndigits; qi++)
  	{
  		/* Approximate the current dividend value */
! 		fdividend = (double) div[qi];
  		for (i = 1; i < 4; i++)
  		{
  			fdividend *= NBASE;
--- 6275,6297 
  	 * To avoid overflow in maxdiv itself, it represents the max absolute
  	 * value divided by NBASE-1, ie, at the top of the loop it is known that
  	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+ 	 *
+ 	 * Note that maxdiv only includes digit positions that are still part of
+ 	 * the dividend, ie, strictly speaking the above holds only for div[i]
+ 	 * where i >= qi, at the top of the loop.
  	 */
  	maxdiv = 1;
  
  	/*
! 	 * Outer loop computes next quotient digit, which will go into div[qi].
! 	 * "outercarry" represents high-order dividend digits carried across
! 	 * iterations.
  	 */
+ 	outercarry = 0;
  	for (qi = 0; qi < div_ndigits; qi++)
  	{
  		/* Approximate the current dividend value */
! 		fdividend = outercarry * NBASE + (double) div[qi];
  		for (i = 1; i < 4; i++)
  		{
  			fdividend *= NBASE;
*** 

Re: [HACKERS] Parallel Seq Scan

2015-11-18 Thread Amit Kapila
On Tue, Nov 17, 2015 at 1:21 AM, Bert  wrote:

> Hey,
>
> I've just pulled and compiled the new code.
> I'm running a TPC-DS like test on different PostgreSQL installations, but
> running (max) 12queries in parallel on a server with 12cores.
> I've configured max_parallel_degree to 2, and I get messages that backend
> processes crash.
> I am running the same test now with 6queries in parallel, and parallel
> degree to 2, and they seem to work. for now. :)
>
> This is the output I get in /var/log/messages
> Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at
> 7fa3437bf104 ip 00490b56 sp 7ffdf2f083a0 error 6 in
> postgres[40+5b5000]
>
>
Thanks for reporting the issue.

I think whats going on here is that when any of the session doesn't
get any workers, we shutdown the Gather node which internally destroys
the dynamic shared memory segment as well.  However the same is
needed as per current design for doing scan by master backend as
well.  So I think the fix would be to just do shutdown of workers which
actually won't do anything in this scenario.  I have tried to reproduce
this issue with a simpler test case as below:

Create two tables with large data:
CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 1000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 100) g;

Set max_worker_processes = 2 in postgresql.conf

Session-1
set max_parallel_degree=4;
set parallel_tuple_cost=0;
set parallel_setup_cost=0;
Explain analyze select count(*) from t1 where c1 > 1;

Session-2
set max_parallel_degree=4;
set parallel_tuple_cost=0;
set parallel_setup_cost=0;
Explain analyze select count(*) from t2 where c1 > 1;

The trick to reproduce is that the Explain statement in Session-2
needs to be executed immediately after Explain statement in
Session-1.

Attached patch fixes the issue for me.

I think here we can go for somewhat more invasive fix as well which is
if the statement didn't find any workers, then reset the dsm and also
reset the execution tree (which in case of seq scan means clear the
parallel scan desc and may be few more fields in scan desc) such that it
performs seq scan.  I am not sure how future-proof such a change would
be, because resetting some of the fields in execution tree and expecting
it to work in all cases might not be feasible for all nodes.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_early_dsm_destroy_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: LISTEN *

2015-11-18 Thread Pavel Stehule
Hi

2015-11-19 4:40 GMT+01:00 Marko Tiikkaja :

> Hi,
>
> I've in the past wanted to listen on all notification channels in the
> current database for debugging purposes.  But recently I came across
> another use case.  Since having multiple postgres backends listening for
> notifications is very inefficient (one Thursday I found out 30% of our CPU
> time was spent spinning on s_locks around the notification code), it makes
> sense to implement a notification server of sorts which then passes on
> notifications from postgres to interested clients.  A server like this
> would be a lot more simple to implement if there was a way to announce that
> the client wants to see all notifications, regardless of the name of the
> channel.
>
> Attached is a very crude proof-of-concept patch implementing this.  Any
> thoughts on the idea?
>

It is looking much more like workaround. Proposed feature isn't bad, but
optimization of current implementation should be better.

Isn't possible to fix internal implementation?

Pavel


>
>
> .m
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-18 Thread Etsuro Fujita

On 2015/11/19 12:34, Robert Haas wrote:

On Tue, Nov 17, 2015 at 9:30 PM, Etsuro Fujita
 wrote:

I suppose you (and KaiGai-san) are probably right, but I really fail to see
it actually doing that.



Noted, but let's do it that way and move on.  It would be a shame if
we didn't end up with a working FDW join pushdown system in 9.6
because of a disagreement on this point.


Another idea would be to consider join pushdown as unsupported for now 
when select-for-update is involved in 9.5, as described in [1], and 
revisit this issue when adding join pushdown to postgres_fdw in 9.6.


Best regards,
Etsuro Fujita

[1] https://wiki.postgresql.org/wiki/Open_Items



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2015-11-18 Thread Peter Geoghegan
Hi Jeff,

On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes  wrote:
> tuplesort.c: In function 'mergeruns':
> tuplesort.c:2741: warning: unused variable 'memNowUsed'

That was caused by a last-minute change to the mulitpass warning
message. I forgot to build at -O2, and missed this.

>> I believe, in general, that we should consider a multi-pass sort to be
>> a kind of inherently suspect thing these days, in the same way that
>> checkpoints occurring 5 seconds apart are: not actually abnormal, but
>> something that we should regard suspiciously. Can you really not
>> afford enough work_mem to only do one pass?
>
> I don't think it is really about the cost of RAM.  What people can't
> afford is spending all of their time personally supervising all the
> sorts on the system. It is pretty easy for a transient excursion in
> workload to make a server swap itself to death and fall over. Not just
> the PostgreSQL server, but the entire OS. Since we can't let that
> happen, we have to be defensive about work_mem. Yes, we have far more
> RAM than we used to. We also have far more things demanding access to
> it at the same time.

I agree with you, but I'm not sure that I've been completely clear on
what I mean. Even as the demand on memory has grown, the competitive
advantage of replacement selection in avoiding a multi-pass merge has
diminished far faster. You should simply not allow it to happen as a
DBA -- that's the advice that other systems' documentation makes.
Avoiding a multi-pass merge was always the appeal of replacement
selection, even in the 1970s, but it will rarely if ever make that
critical difference these days.

As I said, as the volume of data to be sorted in memory increases
linearly, the point at which a multi-pass merge phase happens
increases quadratically with my patch. The advantage of replacement
selection is therefore almost irrelevant. That is why, in general,
interest in replacement selection is far far lower today than it was
in the past.

The poor CPU cache characteristics of the heap (priority queue) are
only half the story about why replacement selection is more or less
obsolete these days.

> I agree we don't want to optimize for low memory, but I don't think we
> should throw it under the bus, either.  Right now we are effectively
> saying the CPU-cache problems with the heap start exceeding the larger
> run size benefits at 64kb (the smallest allowed setting for work_mem).
> While any number we pick is going to be a guess that won't apply to
> all hardware, surely we can come up with a guess better than 64kb.
> Like, 8 MB, say.  If available memory for the sort is 8MB or smaller
> and the predicted size anticipates a multipass merge, then we can use
> the heap method rather than the quicksort method.  Would a rule like
> that complicate things much?

I'm already using replacement selection for the first run when it is
predicted by my new ad-hoc cost model that we can get away with a
"quicksort with spillover", avoiding almost all I/O. We only
incrementally spill as many tuples as needed right now, but it would
be pretty easy to not quicksort the remaining tuples, but continue to
incrementally spill everything. So no, it wouldn't be too hard to hang
on to the old behavior sometimes, if it looked worthwhile.

In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).

>> In theory, the answer could be "yes", but it seems highly unlikely.
>> Not only is very little memory required to avoid a multi-pass merge
>> step, but as described above the amount required grows very slowly
>> relative to linear growth in input. I propose to add a
>> checkpoint_warning style warning (with a checkpoint_warning style GUC
>> to control it).
>
> I'm skeptical about a warning for this.

Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.

> One idea would be to stop and write out a just-sorted partition
> whenever that partition is contiguous to the already-written portion.
> If the qsort is tweaked to recurse preferentially into the left
> partition first, this would result in tuples being written out at a
> pretty study pace.  If the qsort was unbalanced and the left partition
> was always the larger of the two, then that approach would have to be
> abandoned at some point.  But I think there are already defenses
> against that, and at worst you would give up and revert to the
> sort-them-all then write-them-all behavior.

Seems 

[HACKERS] proposal: LISTEN *

2015-11-18 Thread Marko Tiikkaja

Hi,

I've in the past wanted to listen on all notification channels in the 
current database for debugging purposes.  But recently I came across 
another use case.  Since having multiple postgres backends listening for 
notifications is very inefficient (one Thursday I found out 30% of our 
CPU time was spent spinning on s_locks around the notification code), it 
makes sense to implement a notification server of sorts which then 
passes on notifications from postgres to interested clients.  A server 
like this would be a lot more simple to implement if there was a way to 
announce that the client wants to see all notifications, regardless of 
the name of the channel.


Attached is a very crude proof-of-concept patch implementing this.  Any 
thoughts on the idea?



.m
*** a/src/backend/commands/async.c
--- b/src/backend/commands/async.c
***
*** 294,299  static SlruCtlData AsyncCtlData;
--- 294,304 
  static List *listenChannels = NIL;/* list of C strings */
  
  /*
+  * If listenWildcard is set, we're listening on all channels.
+  */
+ static bool listenWildcard = false;
+ 
+ /*
   * State for pending LISTEN/UNLISTEN actions consists of an ordered list of
   * all actions requested in the current transaction.  As explained above,
   * we don't actually change listenChannels until we reach transaction commit.
***
*** 306,311  static List *listenChannels = NIL; /* list of C 
strings */
--- 311,317 
  typedef enum
  {
LISTEN_LISTEN,
+   LISTEN_LISTEN_ALL,
LISTEN_UNLISTEN,
LISTEN_UNLISTEN_ALL
  } ListenActionKind;
***
*** 373,378  static void queue_listen(ListenActionKind action, const char 
*channel);
--- 379,385 
  static void Async_UnlistenOnExit(int code, Datum arg);
  static void Exec_ListenPreCommit(void);
  static void Exec_ListenCommit(const char *channel);
+ static void Exec_ListenAllCommit(void);
  static void Exec_UnlistenCommit(const char *channel);
  static void Exec_UnlistenAllCommit(void);
  static bool IsListeningOn(const char *channel);
***
*** 598,604  Async_Notify(const char *channel, const char *payload)
  
  /*
   * queue_listen
!  *Common code for listen, unlisten, unlisten all commands.
   *
   *Adds the request to the list of pending actions.
   *Actual update of the listenChannels list happens during 
transaction
--- 605,611 
  
  /*
   * queue_listen
!  *Common code for listen, listen all, unlisten, unlisten all 
commands.
   *
   *Adds the request to the list of pending actions.
   *Actual update of the listenChannels list happens during 
transaction
***
*** 613,620  queue_listen(ListenActionKind action, const char *channel)
/*
 * Unlike Async_Notify, we don't try to collapse out duplicates. It 
would
 * be too complicated to ensure we get the right interactions of
!* conflicting LISTEN/UNLISTEN/UNLISTEN_ALL, and it's unlikely that 
there
!* would be any performance benefit anyway in sane applications.
 */
oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
--- 620,627 
/*
 * Unlike Async_Notify, we don't try to collapse out duplicates. It 
would
 * be too complicated to ensure we get the right interactions of
!* conflicting LISTEN/LISTEN_ALL/UNLISTEN/UNLISTEN_ALL, and it's 
unlikely
!* that there would be any performance benefit anyway in sane 
applications.
 */
oldcontext = MemoryContextSwitchTo(CurTransactionContext);
  
***
*** 644,649  Async_Listen(const char *channel)
--- 651,671 
  }
  
  /*
+  * Async_ListenAll
+  *
+  *This is executed by the LISTEN * command.
+  */
+ void
+ Async_ListenAll(void)
+ {
+   if (Trace_notify)
+   elog(DEBUG1, "Async_ListenAll(%d)", MyProcPid);
+ 
+   queue_listen(LISTEN_LISTEN_ALL, "");
+ }
+ 
+ 
+ /*
   * Async_Unlisten
   *
   *This is executed by the SQL unlisten command.
***
*** 790,795  PreCommit_Notify(void)
--- 812,818 
switch (actrec->action)
{
case LISTEN_LISTEN:
+   case LISTEN_LISTEN_ALL:
Exec_ListenPreCommit();
break;
case LISTEN_UNLISTEN:
***
*** 895,900  AtCommit_Notify(void)
--- 918,926 
case LISTEN_LISTEN:
Exec_ListenCommit(actrec->channel);
break;
+   case LISTEN_LISTEN_ALL:
+   Exec_ListenAllCommit();
+   break;
case LISTEN_UNLISTEN:
Exec_UnlistenCommit(actrec->channel);

[HACKERS] Re: [COMMITTERS] pgsql: Improve vcregress.pl's handling of tap tests for client programs

2015-11-18 Thread Michael Paquier
On Thu, Nov 19, 2015 at 12:51 PM, Andrew Dunstan 
wrote:

> Improve vcregress.pl's handling of tap tests for client programs
>
> The target is now named 'bincheck' rather than 'tapcheck' so that it
> reflects what is checked instead of the test mechanism. Some of the
> logic is improved, making it easier to add further sets of TAP based
> tests in future. Also, the environment setting logic is imrpoved.
>
> As discussed on -hackers a couple of months ago.
>

Seems like you forgot to update doc/src/sgml/install-windows.sgml, this one
needs a refresh from tapcheck to bincheck. A backpatch to 9.5 and 9.4 would
also be nice.
Regards,
-- 
Michael


Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-18 Thread Michael Paquier
On Thu, Nov 19, 2015 at 12:21 AM, Alvaro Herrera
 wrote:
>
> Hi, I just started looking this over a bit.  The first thing I noticed
> is that it adds a dependency on Archive::Tar which isn't already used
> anywhere else.  Did anybody check whether this exists back in 5.8
> installations?

Actually I didn't and that's a good point, we have decided to support
TAP down to 5.8.9. The only reason why I introduced this dependency is
that there is no easy native way to copy an entire folder in perl, and
that's for handling base backups. There are things like File::NCopy of
File::Copy::Recursive however it does not seem like a good idea to
depend on other modules that IPC::Run. Would it be better to have an
in-core module dedicated to that similar to SimpleTee.pm? Or are you
guys fine to accept a dependency with another module?

> Why is "recovery" added to ALWAYS_SUBDIRS in src/test/Makefile instead
> of to SUBDIRS?  Seems a strange choice.

Because I thought that it should not be part of the main regression
suite, like ssl/. Feel free to correct me if my feeling is wrong.

> Instead of adding
> print "# Error output: $stderr\n" if $stderr ne "";
> to sub psql, I think it would be better to add line separators, which
> would be clearer if the error output ever turns into a multiline error
> messages.  It would still show as empty if no stderr is produced; so I
> think something like
> if ($stderr ne '')
> {
> print " Begin standard error\n"
> print $stderr;
> print " End standard error\n";
> }
> or something like that.

Yes, that would be better.

> In my days of Perl, it was starting to become frowned upon to call
> subroutines without parenthesizing arguments.  Is that no longer the
> case?  Because I notice there are many places in this patch and pre-
> existing that call psql with an argument list without parens.  And it's
> a bit odd because I couldn't find any other subroutine that we're using
> in that way.

Hm, yeah. If we decide about a perl coding policy I would be happy to
follow it. Personally I prefer usually using parenthesis however if we
decide to make the calls consistent we had better address that as a
separate patch.

> In 005_replay_delay there's a 2s delay configured; then we test whether
> something is replayed in 1s.  I hate tests that run for a long time, but
> is 2s good enough considering that some of our test animals in buildfarm
> are really slow?

A call to poll_query_until ensures that we wait for the standby to
replay once the minimum replay threshold is reached. Even with a slow
machine the first query would still see only 10 rows at the first try,
and then wait for the standby to replay before checking if 20 rows are
visible. Or I am not following your point.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 9:30 PM, Etsuro Fujita
 wrote:
> I suppose you (and KaiGai-san) are probably right, but I really fail to see
> it actually doing that.

Noted, but let's do it that way and move on.  It would be a shame if
we didn't end up with a working FDW join pushdown system in 9.6
because of a disagreement on this point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 8:47 PM, Kouhei Kaigai  wrote:
> The attached patch is the portion cut from the previous EPQ recheck
> patch.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 10:22 PM, Kouhei Kaigai  wrote:
> Apart from EPQ rechecks, the above aggressive push-down idea allows to send
> contents of multiple relations to the remote side. In this case, ForeignScan
> needs to have multiple sub-plans.
>
> For example, please assume here is three relations; tbl_A and tbl_B are
> local and small, tbl_F is remote and large.
> In case when both of (tbl_A JOIN tbl_F) and (tbl_B JOIN tbl_F) produces
> large number of rows thus consumes deserved amount of network traffic but
> (tbl_A JOIN tbl_B JOIN tbl_F) produce small number of rows, the optimal
> strategy is to send local contents to the remote side once then run
> a remote query here to produce relatively smaller rows.
> In the implementation level, ForeignScan shall represent (tbl_A JOIN tbl_B
> JOIN tbl_F), then it returns a bunch of joined tuples. Its remote query
> contains VALUES(...) clauses to pack contents of the tbl_A and tbl_B, thus,
> it needs to be capable to execute underlying multiple scan plans and fetch
> tuples prior to remote query execution.

Hmm, maybe.  I'm not entirely sure multiple subplans is the best way
to implement that, but let's argue about that another day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Improve vcregress.pl's handling of tap tests for client programs

2015-11-18 Thread Andrew Dunstan



On 11/18/2015 10:58 PM, Michael Paquier wrote:



On Thu, Nov 19, 2015 at 12:51 PM, Andrew Dunstan > wrote:


Improve vcregress.pl 's handling of tap tests
for client programs

The target is now named 'bincheck' rather than 'tapcheck' so that it
reflects what is checked instead of the test mechanism. Some of the
logic is improved, making it easier to add further sets of TAP based
tests in future. Also, the environment setting logic is imrpoved.

As discussed on -hackers a couple of months ago.


Seems like you forgot to update doc/src/sgml/install-windows.sgml, 
this one needs a refresh from tapcheck to bincheck. A backpatch to 9.5 
and 9.4 would also be nice.

Regards,



Yep, thanks for the reminder, done.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-18 Thread Amit Langote
On 2015/11/10 17:02, Amit Langote wrote:
> On 2015/10/29 23:22, Syed, Rahila wrote:
>> Please find attached an updated patch.
> 
> A few more comments on v6:

I backed up a little, studied the proposal and the patch in little some
more detail. Here are still more comments -

Going through the thread, it seems there are following problems being solved -

1) General purpose interface for (maintenance?) commands to report a set
of internal values of different internal types using shared memory as IPC;
values are exposed to users as is and/or some derived values like
percent_done using view/functions

2) For a start, instrumenting lazy vacuum to report such internal values

And maybe,

3) Estimating the amount of work to be done and time required based on
historical statistics like n_dead_tup, visibility map and run-time
resources available like maintenance_work_mem

Latest version of the patch (v6) implements 1 and 2. The code is starting
to look good though see some comments below.

* Regarding (2): Some random thoughts on the patch and in general -

For lazy vacuum, lazy_scan_heap() seems like the best place which can
provide granular progress report in terms of the heap block number (of
total number of heap blocks in the relation) currently undergoing
per-block pass 1 processing. About pass 2, ie, lazy_index_vacuum() and
lazy_vacuum_heap(), I don't see how we can do better than reporting its
progress only after finishing all of it without any finer-grained
instrumentation. They are essentially block-box as far as the proposed
instrumentation approach is concerned. Being able to report progress per
index seems good but as a whole, a user would have to wait arbitrarily
long before numbers move forward. We might as well just report a bool
saying we're about to enter a potentially time-consuming index vacuum
round with possibly multiple indexes followed by lazy_vacuum_heap()
processing. Additionally, we can report the incremented count of the
vacuuming round (pass 2) once we are through it. So, we'd report two
values viz. waiting_vacuum_pass (bool) and num_vacuum_pass (int). The
former is reported twice - 'true' as we are about to begin the round and
'false' once done. We can keep the total_index_pages (counting all
indexes) and index_pages_done as the patch currently reports. The latter
moves forward for every index we finish processing, and also should be
reset for every pass 2 round. Note that we can leave them out of
percent_done of overall vacuum progress. Until we have a good solution for
number (3) above, it seems to difficult to incorporate index pages into
overall progress.

As someone pointed out upthread, the final heap truncate phase can take
arbitrarily long and is outside the scope of lazy_scan_heap() to
instrument. Perhaps a bool, say, waiting_heap_trunc could be reported for
the same. Note that, it would have to be reported from lazy_vacuum_rel().

I spotted a potential oversight regarding report of scanned_pages. It
seems pages that are skipped because of not getting a pin, being new,
being empty could be left out of the progress equation.

* Regarding (1): These are mostly code comments -

IMHO, float progress parameters (st_progress_param_float[]) can be taken
out. They are currently unused and it's unlikely that some command would
want to report them. OTOH, as suggested in above paragraph, why not have
bool parameters? In addition to a few I mentioned in the context of lazy
vacuum instrumentation, it seems likely that they would be useful for
other commands, too.

Instead of st_activity_flag, how about st_command and calling
ACTIVITY_IS_VACUUM, say, COMMAND_LAZY_VACUUM?
pgstat_report_activity_flag() then would become pgstat_report_command().

Like floats, I would think we could take out st_progress_message[][]. I
see that it is currently used to report table name. For that, we might as
well add a single st_command_target[NAMEDATALEN] string which is set at
the beginning of command processing using, say,
pgstat_report_command_target(). It stores the name of relation/object that
the command is going to work on.

Maybe, we don't need each command to proactively pgstat_reset_command().
That would be similar to how st_activity is not proactively cleared but is
rather reset by the next query/command or when some other backend uses the
shared memory slot. Also, we could have a SQL function
pg_stat_reset_local_progress() which clears the st_command after which the
backend is no longer shown in the progress view.

I think it would be better to report only changed parameter arrays when
performing pgstat_report_progress(). So, if we have following shared
memory progress parameters and the reporting function signature:

typedef struct PgBackendStatus
{
...
uint16st_command;
char  st_command[NAMEDATALEN];
uint32st_progress_uint32_param[N_PROGRESS_PARAM];
bool  st_progress_bool_param[N_PROGRESS_PARAM];
} PgBackendStatus;

void pgstat_report_progress(uint32 

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-18 Thread Amit Langote
On 2015/11/19 16:18, Amit Langote wrote:
> I'm marking this as "Waiting on author" in the commitfest app. Also, let's
> hear from more people.

Well, it seems Michael beat me to it.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes && superuser review

2015-11-18 Thread Michael Paquier
On Thu, Nov 19, 2015 at 7:10 AM, Stephen Frost wrote:
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> It seems weird to not have a dedicated role for pg_switch_xlog.
>
> I didn't add a pg_switch_xlog default role in this patch series, but
> would be happy to do so if that's the consensus.  It's quite easy to do.

Agreed. I am not actually getting why that's part of the backup
actually. That would be more related to archiving, both being
unrelated concepts. But at this point I guess that's mainly a
philosophical split.

> I'm guessing you were referring to pg_admin with your comment that you
> were "not convinced that we actually need it."  After thinking about
> it for a bit, I tend to agree.  A user could certainly create their own
> role which combines these all together if they wanted to (or do any
> other combinations, based on their particular needs).  I've gone ahead
> and removed pg_admin from the set of default roles.

Yeah that 's what I meant. Thanks, I should have been more precise
with my words.

>> +   if (IsReservedName(stmt->role))
>> +   ereport(ERROR,
>> +   (errcode(ERRCODE_RESERVED_NAME),
>> +errmsg("role name \"%s\" is reserved",
>> +stmt->role),
>> +errdetail("Role names starting with
>> \"pg_\" are reserved.")));
>> Perhaps this could be a separate change? It seems that restricting roles
>> with pg_ on the cluster is not a bad restriction in any case. You may want
>> to add regression tests to trigger those errors as well.
>
> I'm a bit confused- this is a separate change.  Note that the previous
> patch was actually a git patchset which had two patches- one to do the
> reservation and a second to add the default roles.  The attached
> patchset is actually three patches:
> 1- Update to catalog documentation regarding permissions
> 2- Reserve 'pg_' role namespace
> 3- Add default roles

I see, that's why I got confused. I just downloaded your file and
applied it blindly with git apply or patch (don't recall which). Other
folks tend to publish that as a separate set of files generated by
git-format-patch.

>> Shouldn't pg_current_xlog_location, pg_current_xlog_insert_location,
>> pg_last_xlog_receive_location and pg_last_xlog_replay_location fall under a
>> restriction category like pg_monitor? I recall of course that we discussed
>> that some months ago and that a lot of people were reluctant to harden this
>> area to not break any existing monitoring tool, still this patch may be the
>> occasion to restrict a bit those functions, particularly on servers where
>> wal_compression is enabled.
>
> For my 2c, I believe they should.  I've not modified them in that way in
> this patchset, but I can certainly do so if others agree.

My vote goes into hardening them a bit this way.

> I've added regression tests for the default roles where it was
> relatively straight-forward to do so.  I don't see a trivial way to test
> that the pg_signal_backend role works though, as an example.

It is at least possible to check that the error code paths are
working. For the code paths where a backend would be actually running,
you could use the following:
SET client_min_messages TO 'error';
-- This should return "1" and not an ERROR nor a WARNING if PID does not exist.
select count(pg_terminate_backend(0));
But that's ugly depending on your taste.

>> Bonus:
>> -static void
>> -check_permissions(void)
>> -{
>> -   if (!superuser() && !has_rolreplication(GetUserId()))
>> -   ereport(ERROR,
>> -   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>> -(errmsg("must be superuser or replication
>> role to use replication slots";
>> -}
>> I would have let check_permissions and directly done the checks on
>> has_rolreplication and has_privs_of_role in it, the same code being
>> duplicated three times.
>
> For my 2c, I dislike the notion of "check_permissions()" functions being
> added throughout the codebase as I'm afraid it'd get confusing, which is
> why I got rid of it.  I'm much happier seeing the actual permissions
> check as it should be happening early on in the primary function which
> is being called into.  If there is really a push to go back to having a
> 'check_permissions()' like function, I'd argue that we should make the
> function name more descriptive of what's actually being done and have
> them be distinct from each other.  As I recall, I discussed this change
> with Andres and he didn't feel particularly strongly about this one way
> or the other, therefore, I've not made this change for this patchset.

Fine for me. Usually people point out such code duplications are bad
things, and here we just have a static routine being used for a set of
functions interacting with the same kind of object, here a replication
slot. But I'm fine either way.

Do you think that it 

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-18 Thread Michael Paquier
On Thu, Nov 19, 2015 at 4:30 PM, Amit Langote
 wrote:
> On 2015/11/19 16:18, Amit Langote wrote:
>> I'm marking this as "Waiting on author" in the commitfest app. Also, let's
>> hear from more people.
>
> Well, it seems Michael beat me to it.

Yeah, some other folks provided feedback that's why I did it.
@Rahila: are you planning more work on this patch? It seems that at
this point we're waiting for you. If not, and because I have a couple
of times waited for long VACUUM jobs to finish on some relations
without having much information about their progress, I would be fine
to spend time hacking this thing. That's up to you.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor comment improvement to create_foreignscan_plan

2015-11-18 Thread Etsuro Fujita

On 2015/11/19 5:29, Robert Haas wrote:

On Tue, Nov 17, 2015 at 9:29 PM, Etsuro Fujita
 wrote:

On 2015/11/18 2:57, Robert Haas wrote:

On Sun, Nov 15, 2015 at 9:25 PM, Etsuro Fujita
 wrote:

Oops, I've found another one.  I think we should update a comment in
postgresGetForeignPlan, too; add remote filtering expressions to the list
of
information needed to create a ForeignScan node.



Instead of saying "remote/local", how about saying "remote and local"
or just leaving it out altogether as in the attached?



+1 for your patch.



OK, committed.


Thanks!

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers