Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-07-03 Thread Amit Kapila
On Mon, Jul 3, 2017 at 8:57 PM, Simon Riggs  wrote:
> On 30 June 2017 at 05:14, Amit Kapila  wrote:
>
>> This is explained in section 15.2 [1], refer below para:
>> "The query might be suspended during execution. In any situation in
>> which the system thinks that partial or incremental execution might
>> occur, no parallel plan is generated. For example, a cursor created
>> using DECLARE CURSOR will never use a parallel plan. Similarly, a
>> PL/pgSQL loop of the form FOR x IN query LOOP .. END LOOP will never
>> use a parallel plan, because the parallel query system is unable to
>> verify that the code in the loop is safe to execute while parallel
>> query is active."
>
> Can you explain "unable to verify that the code in the loop is safe to
> execute while parallel query is active". Surely we aren't pushing code
> in the loop into the actual query, so the safety of command in the FOR
> loop has nothing to do with the parallel safety of the query.
>
> Please give an example of something that would be unsafe? Is that
> documented anywhere, README etc?
>
> FOR x IN query LOOP .. END LOOP
> seems like a case that would be just fine, since we're going to loop
> thru every row or break early.
>

It is not fine because we don't support partial execution support.  In
above case, if the loop breaks, we can't break parallel query
execution.  Now, I don't think it will impossible to support the same,
but as of now, parallel subsystem doesn't have such a support.

> Surely NO SCROLL and WITH HOLD cursors would work fine?
>

No, as of now they are not supported due to same reason (partial execution).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More race conditions in logical replication

2017-07-03 Thread Noah Misch
On Sun, Jul 02, 2017 at 07:54:48PM -0400, Tom Lane wrote:
> I noticed a recent failure that looked suspiciously like a race condition:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2017-07-02%2018%3A02%3A07
> 
> The critical bit in the log file is
> 
> error running SQL: 'psql::1: ERROR:  could not drop the replication 
> slot "tap_sub" on publisher
> DETAIL:  The error was: ERROR:  replication slot "tap_sub" is active for PID 
> 3866790'
> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R 
> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION 
> tap_sub' at 
> /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm
>  line 1198.
> 
> After poking at it a bit, I found that I can cause several different
> failures of this ilk in the subscription tests by injecting delays at
> the points where a slot's active_pid is about to be cleared, as in the
> attached patch (which also adds some extra printouts for debugging
> purposes; none of that is meant for commit).  It seems clear that there
> is inadequate interlocking going on when we kill and restart a logical
> rep worker: we're trying to start a new one before the old one has
> gotten out of the slot.
> 
> I'm not particularly interested in fixing this myself, so I'm just
> going to add it to the open items list.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race conditions with WAL sender PID lookups

2017-07-03 Thread Noah Misch
On Mon, Jul 03, 2017 at 12:14:55PM -0400, Alvaro Herrera wrote:
> Michael Paquier wrote:

> > > In passing, clean up some leftover braces which were used to create
> > > unconditional blocks.  Once upon a time these were used for
> > > volatile-izing accesses to those shmem structs, which is no longer
> > > required.  Many other occurrences of this pattern remain.
> > 
> > Here are the places where a cleanup can happen:
> > - WalSndSetState
> > - ProcessStandbyReplyMessage
> > - XLogRead, 2 places
> > - XLogSendLogical
> > - WalRcvWaitForStartPosition
> > - WalRcvDie
> > - XLogWalRcvFlush
> > - ProcessWalSndrMessage
> > In most of the places of the WAL sender, braces could be removed to
> > improve the style. For the WAL receiver, declarations are not
> > necessary. As a matter of style, why not cleaning up just the WAL
> > sender stuff? Changing the WAL receiver code just to remove some
> > declarations would not improve readability, and would make back-patch
> > more difficult.
> 
> I think we should clean this up whenever we're modifying the surrounding
> code, but otherwise we can leave well enough alone.  It's not a high
> priority item at any rate.

Bundling code cleanup into commits that also do something else is strictly
worse than bundling whitespace cleanup, which is itself bad:
https://postgr.es/m/flat/20160113144826.gb3379...@tornado.leadboat.com

Deferring cleanup and pushing cleanup-only commits are each good options.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2017-07-03 Thread Amit Kapila
On Mon, Jul 3, 2017 at 6:15 PM, Amit Kapila  wrote:
> On Thu, Mar 23, 2017 at 1:18 PM, Ashutosh Sharma 
> wrote:
>>
>> Conclusion:
>> As seen from the test results mentioned above, there is some performance
>> improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly
>> better than HEAD, with 7 and 10 SP(s) we do see regression with patch.
>> Therefore, I think the threshold value of 4 for number of subtransactions
>> considered in the patch looks fine to me.
>>
>
> Thanks for the tests.  Attached find the rebased patch on HEAD.  I have ran
> latest pgindent on patch.  I have yet to add wait event for group lock waits
> in this patch as is done by Robert in commit
> d4116a771925379c33cf4c6634ca620ed08b551d for ProcArrayGroupUpdate.
>

I have updated the patch to support wait events and moved it to upcoming CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


group_update_clog_v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-07-03 Thread Ashutosh Bapat
On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih
 wrote:
>
>
> On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat
>  wrote:
>>
>>
>> Here's set of patches rebased on latest head.
>
>
> In an attempt to test this set of patches, I found that not all of the
> patches could be applied on latest head-- commit
> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
> Might be in need of rebasing.

Thanks Rafia for your interest. I have started rebasing the patches on
the latest head. I am expecting it to take some time. Will update the
thread with the patches once I am done rebasing them.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A bug in mapping attributes in ATExecAttachPartition()

2017-07-03 Thread Amit Langote
Thanks for the review.

On 2017/07/03 20:13, Ashutosh Bapat wrote:
> Thanks for working on the previous comments. The code really looks good now.
> On Fri, Jun 23, 2017 at 2:29 PM, Amit Langote
>  wrote:
>>
>>> Don't we need an exclusive lock to
>>> make sure that the constraints are not changed while we are validating 
>>> those?
>>
>> If I understand your question correctly, you meant to ask if we don't need
>> the strongest lock on individual partitions while looking at their
>> constraints to prove that we don't need to scan them.  We do and we do
>> take the strongest lock on individual partitions even today in the second
>> call to find_all_inheritors().  We're trying to eliminate the second call
>> here.
> 
> The comment seems to imply that we need strongest lock only when we
> "scan" the table/s.
> 
> Some more comments on 0001
> - * Prevent circularity by seeing if rel is a partition of attachRel. (In
> + * Prevent circularity by seeing if rel is a partition of attachRel, (In
>   * particular, this disallows making a rel a partition of itself.)
> The sentence outside () doesn't have a full-stop. I think the original
> construct was better.

Yep, fixed.

> + * We want to avoid having to construct this list again, so we request 
> the
>
> "this list" is confusing here since the earlier sentence doesn't mention any
> list at all. Instead we may reword it as "We will need the list of children
> later to check whether any of those have a row which would not fit the
> partition constraints. So, take the strongest lock ..."

It was confusing for sure, so rewrote.

>  * XXX - Do we need to lock the partitions here if we already have the
>  * strongest lock on attachRel?  The information we need here to check
>  * for circularity cannot change without taking a lock on attachRel.
>
> I wondered about this. Do we really need an exclusive lock to check whether
> partition constraint is valid? May be we can compare this condition with ALTER
> TABLE ... ADD CONSTRAINT since the children will all get a new constraint
> effectively. So, exclusive lock it is.

Actually, the XXX comment is about whether we need to lock the children at
all when checking the circularity of inheritance, that is, not about
whether we need lock to check the partition constraint is valid.

> Assert(linitial_oid(attachRel_children) ==
> RelationGetRelid(attachRel));
> if (attachRel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> attachRel_children = list_delete_first(attachRel_children);
>
> Is it necessary for this code to have OID of the relation being attached as 
> the
> first one? You could simply call list_delete_oid() instead of
> list_delete_first(). If for any reason find_all_inheritors() changes the 
> output
> order, this assertion and code would need a change.\

I concluded that removing attachRel's OID from attachRel_children is
pointless, considering that we have to check for attachRel in the loop
below anyway.  The step of removing the OID wasn't helping much.

>>> The name skipPartConstraintValidation() looks very specific to the case at
>>> hand. The function is really checking whether existing constraints on the 
>>> table
>>> can imply the given constraints (which happen to be partition constraints). 
>>> How
>>> about PartConstraintImpliedByRelConstraint()? The idea is to pick a general
>>> name so that the function can be used for purposes other than skipping
>>> validation scan in future.
>>
>> I liked this idea, so done.
> 
> + * skipPartConstraintValidation
> +PartConstraintImpliedByRelConstraint(Relation partrel, List *partConstraint)
> Different function names in prologue and the definition.

Oops, fixed.

> +if (predicate_implied_by(partConstraint, existConstraint, true))
> +return true;
> +
> +/* Tough luck. */
> +return false;
>
> why not to just return the return value of predicate_implied_by() as is?

Sigh.  Done. :)

> With this we can actually handle constr == NULL a bit differently.
> +if (constr == NULL)
> +return false;
>
> To be on safer side, return false when partConstraint is not NULL. If both the
> relation constraint and partConstraint are both NULL, the first does imply the
> other. Or may be leave that decision to predicate_implied_by(), which takes
> care of it right at the beginning of the function.

Rearranged code considering this comment.  Let's let
predicate_implied_by() make ultimate decisions about logical implication. :)

> 
> + * For each leaf partition, check if it we can skip the validation
>
> An extra "it".

Fixed.

> 
> + * Note that attachRel's OID is in this list.  Since we already
> + * determined above that its validation scan cannot be skipped, we
> + * need not check that again in the loop below.  If it's partitioned,
>
> I don't see code to skip checking whether scan can be 

Re: [HACKERS] asynchronous execution

2017-07-03 Thread Kyotaro HORIGUCHI
Thank you for the thought.

This is at PoC level so I'd be grateful for this kind of
fundamental comments.

At Wed, 28 Jun 2017 20:22:24 +0200, Antonin Houska  wrote in 
<392.1498674144@localhost>
> Kyotaro HORIGUCHI  wrote:
> 
> > The patch got conflicted. This is a new version just rebased to
> > the current master. Furtuer amendment will be taken later.
> 
> Just one idea that I had while reading the code.
> 
> In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
> complete requests to the end and finaly adjust estate->es_num_pending_async so
> that the array no longer contains the complete requests. I think the point is
> that then you can add new requests to the end of the array.
> 
> I wonder if a set (Bitmapset) of incomplete requests would make the code more
> efficient. The set would contain position of each incomplete request in
> estate->es_num_pending_async (I think it's the myindex field of
> PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
> requests subject to ExecAsyncNotify etc, then the compaction of
> estate->es_pending_async wouldn't be necessary.
> 
> ExecAsyncRequest would use the set to look for space for new requests by
> iterating it and trying to find the first gap (which corresponds to completed
> request).
> 
> And finally, item would be removed from the set at the moment the request
> state is being set to ASYNCREQ_COMPLETE.

Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.

The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.

Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.

Sorry for the discursive discussion but in short, I have noticed
that I have a lot to consider on this:p Thanks!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-03 Thread Tom Lane
I was reminded by bug #14729 that we are not very good on corner cases
involving inequality operators, ie < <= > >=.  Historically, the
selectivity functions have simply not distinguished < from <= or >
from >=, arguing that the fraction of the population that satisfies
the "=" aspect can be considered to be vanishingly small, if the
comparison value isn't any of the most-common-values for the variable.
(If it is, the code path that executes the operator against each MCV
will take care of things properly.)  But that isn't really true unless
we're dealing with a continuum of variable values, and in practice
we seldom are.  If "x = const" would estimate a nonzero number of
rows for a given const value, then it follows that we ought to estimate
different numbers of rows for "x < const" and "x <= const", whether or
not the const is one of the MCVs.

Hence, the attached patch, which splits the scalar inequality selectivity
functions into four sets instead of two.  This demonstrably produces
better estimates, for example using the "tenk1" table in the regression
database:

regression=# explain analyze select * from tenk1 where thousand < 10;
before:
 Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
time=0.121..0.623 rows=100 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=5.06..227.42 rows=100 width=244) (actual 
time=0.054..0.300 rows=100 loops=1)

regression=# explain analyze select * from tenk1 where thousand <= 10;
before:
 Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
time=0.120..0.383 rows=110 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=5.14..241.38 rows=110 width=244) (actual 
time=0.062..0.288 rows=110 loops=1)

regression=# explain analyze select * from tenk1 where thousand > 10;
before:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9890 width=244) (actual 
time=0.030..6.276 rows=9890 loops=1)
with patch:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9890 width=244) (actual 
time=0.019..4.881 rows=9890 loops=1)

regression=# explain analyze select * from tenk1 where thousand >= 10;
before:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9890 width=244) (actual 
time=0.022..5.371 rows=9900 loops=1)
with patch:
 Seq Scan on tenk1  (cost=0.00..483.00 rows=9900 width=244) (actual 
time=0.014..3.783 rows=9900 loops=1)

regression=# explain analyze select * from tenk1 where thousand between 10 and 
11;
before:
 Bitmap Heap Scan on tenk1  (cost=4.39..39.52 rows=10 width=244) (actual 
time=0.082..0.215 rows=20 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=4.49..70.61 rows=20 width=244) (actual 
time=0.080..0.207 rows=20 loops=1)
   Recheck Cond: ((thousand >= 10) AND (thousand <= 11))

regression=# explain analyze select * from tenk1 where thousand between 10 and 
10;
before:
 Index Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..8.30 rows=1 
width=244) (actual time=0.041..0.112 rows=10 loops=1)
with patch:
 Bitmap Heap Scan on tenk1  (cost=4.39..39.52 rows=10 width=244) (actual 
time=0.074..0.142 rows=10 loops=1)

As these examples show, it's cases with very tight range constraints where
this really makes enough difference to be worth doing.  I believe the
complaint in bug #14729 basically corresponds to the last example, where
the erroneous rowcount estimate is driving a bad choice of join plan.

Aside from the mind-bendingly-tedious changes in pg_operator.h, the meat
of the patch is in selfuncs.c's ineq_histogram_selectivity(), which now
applies a correction for equal values in the cases where we were getting
it wrong before.  While this logic seems experimentally correct (see
above), I have to admit that I've failed to wrap my brain around exactly
why it's correct.  The arguments that I've constructed so far seem to
point in the direction of applying the opposite correction, which is
demonstrably wrong.  Perhaps someone whose college statistics class
wasn't quite so long ago can explain this satisfactorily?

Aside from the missing/inadequate comment about why to apply this
correction, there remains work to update several contrib modules that
reference scalarltsel/scalargtsel.  That could be done separately though.
An extension that doesn't change its <= or >= operators to reference
scalarlesel/scalargesel isn't worse off than before, it's just failing
to benefit from this improvement.

Obviously this is too late for v10; I'll stick it into the next
commitfest.

regards, tom lane

diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 333a36c..e9ff110 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -792,8 +792,7 @@ CREATE OPERATOR  (
It is important to specify the correct commutator and negator operators,
as well as suitable restriction and join selectivity
functions, otherwise the optimizer will be unable to make effective
-   use of the index.  Note that the less-than, equal, and
-   greater-than cases should use different selectivity functions.
+   

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-07-03 Thread Masahiko Sawada
On Tue, Jul 4, 2017 at 12:21 PM, Peter Eisentraut
 wrote:
> On 6/25/17 06:35, Petr Jelinek wrote:
>> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
>> is preexisting but mentioning it for context), SetSubscriptionRelState,
>> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
>> we use granular per object locks to deal with concurrency.
>
> I have committed those locking changes, as we had already discussed them
> previously.  This should address the open item.

Thank you for committing the patch!

>
> Maybe we can start new threads for the other parts of the patch and
> maybe split the patch up a bit.

Yeah, let's discuss about reworking the locking and management on new thread.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14699: Statement trigger and logical replication

2017-07-03 Thread Peter Eisentraut
On 6/17/17 08:29, Euler Taveira wrote:
> 2017-06-16 22:08 GMT-03:00 Peter Eisentraut
>  >:
> 
> 
> The issue is that the logical replication initial data copy fires a
> statement trigger for INSERT, because it's implemented as a COPY
> internally.
> 
> We should document such behavior. AFAICS we discuss later if we should
> provide an option to fire statement triggers during initial copy.

I have added the documentation.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] subscription worker signalling wal writer too much

2017-07-03 Thread Jeff Janes
On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund  wrote:

> Hi,
>
> On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
>
> > It looks like only limited consolidation was happening, the number of
> kills
> > invoked was more than half of the number of SetLatch.  I think the reason
> > is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> > suspends the calling process and gives the signalled process a chance to
> > run in that time slice.  The Wal Writer gets woken up, sees that it only
> > has a partial page to write and decides not to do anything, but resets
> the
> > latch.  It does this fast enough that the subscription worker hasn't
> > migrated CPUs yet.
>
> I think part of the problem here is that latches signal the owning
> process even if the owning process isn't actually sleeping - that's
> obviously quite pointless.  In cases where walwriter is busy, that
> actually causes a lot of superflous interrupted syscalls, because it
> actually never ends up waiting on the latch. There's also a lot of
> superflous context switches.  I think we can avoid doing so quite
> easily, as e.g. with the attached patch.  Could you check how much that
> helps your benchmark?
>

That patch doesn't make any improvement to the pgbench --unlogged-tables
benchmark.  I expected that, because this benchmark exposes the opposite
problem.  The wal writer isn't busy, it is constantly being woken up but
then immediately going back to sleep.



> > The first change made the WALWriter wake up and do a write and flush
> > whenever an async commit occurred and there was a completed WAL page to
> be
> > written.  This way the hint bits could be set while the heap page was
> still
> > hot, because they couldn't get set until the WAL covering the hinted-at
> > transaction commit is flushed.
> >
> > The second change said we can set hint bits before the WAL covering the
> > hinted-at transaction is flushed, as long the page LSN is newer than that
> > (so the wal covering the hinted-at transaction commit must be flushed
> > before the page containing the hint bit can be written).
> >
> > Then the third change makes the wal writer only write the full pages most
> > of the times it is woken up, not flush them.  It only flushes them once
> > every wal_writer_delay or wal_writer_flush_after, regardless of how often
> > it is woken up.
> >
> > It seems like the third change rendered the first one useless.
>
> I don't think so. Isn't the walwriter writing out the contents of the
> WAL is quite important because it makes room in wal_buffers for new WAL?
>

If we wanted to do that, I think the asynchronous committer should just
issue the write directly (unless wal_sync_method is open_datasync
or open_sync).  Writes are usually extremely fast and it not worth trying
to offload them to a background process, as your half of the signalling
takes longer than just doing the write yourself.  When that is not true and
writes start blocking due to extreme io congestion, them everything is
going to start blocking anyway (wal_buffers head will run into the tail,
and the tail can't advance because the writes are blocking) so while
offloading writes to a background process is then good in theory, it isn't
very effective.  If everyone is stuck, it doesn't matter which process is
directly stuck and which ones are indirectly stuck.


>
> I suspect we actually should wake up wal-writer *more* aggressively, to
> offload wal fsyncs from individual backends, even when they're not
> committing.  Right now we fsync whenever a segment is finished - we
> really don't want to do that in backends that could do other useful
> work.  I suspect it'd be a good idea to remove that logic from
> XLogWrite() and replace it with
> if (ProcGlobal->walwriterLatch)
> SetLatch(ProcGlobal->walwriterLatch);
>
>
My understanding is that you can't start on a new file until the old file
is completely synced, because the book keeping can currently only handle
one file at a time.  So if you signal the wal writer to do the sync, you
would just end up immediately blocking on it to finish.  Getting around
that would require substantially more changes; we would need to implement a
queue of full log files not yet synced so that we could move on without
waiting.  If we had such a queue, then sure, this would be a good idea to
just wake the wal writer. But currently, it only matters for large
transactions (\copy, insert ... select, bulk updates).  With small
transactions, you can't fill up log files quickly enough for it to make
much of a difference.  So I think this is something to do, but it is
independent from what I am currently going on about.



>
> > Wouldn't it
> > better, and much simpler, just to have reverted the first change once the
> > second one was done?
>
> I think both can actually happen independently, no? It's pretty common
> for the page lsn to be *older* than the corresponding commit.  In fact
> you'll always hit 

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-07-03 Thread Peter Eisentraut
On 6/25/17 06:35, Petr Jelinek wrote:
> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
> is preexisting but mentioning it for context), SetSubscriptionRelState,
> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
> we use granular per object locks to deal with concurrency.

I have committed those locking changes, as we had already discussed them
previously.  This should address the open item.

Maybe we can start new threads for the other parts of the patch and
maybe split the patch up a bit.  At this point I don't want to commit
major code rearrangements without specific reasons and detailed analysis.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-03 Thread Michael Paquier
On Tue, Jul 4, 2017 at 4:27 AM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> On 7/3/17 09:53, Tom Lane wrote:
>>> Hm.  Before we add a bunch of code to deal with that, are we sure we
>>> *want* it to copy such files?  Seems like that's expending a lot of
>>> data-transfer work for zero added value --- consider e.g. a server
>>> with a bunch of old core files laying about in $PGDATA.  Given that
>>> it's already excluded all database-data-containing files, maybe we
>>> should just set a cap on the plausible size of auxiliary files.
>
>> It seems kind of lame to fail on large files these days, even if they
>> are not often useful in the particular case.
>
> True.  But copying useless data is also lame.

We don't want to complicate pg_rewind code with filtering
capabilities, so if the fix is simple I think that we should include
it and be done. That will avoid future complications as well.

>> Also, most of the segment and file sizes are configurable, and we have
>> had reports of people venturing into much larger file sizes.
>
> But if I understand the context correctly, we're not transferring relation
> data files this way anyway.  If we do transfer WAL files this way, we
> could make sure to set the cutoff larger than the WAL segment size.

WAL segments are not transferred. Only the WAL data of the the target
data folder is gone through to find all the blocks that have been
touched from the last checkpoint before WAL forked.

Now, I think that this is broken for relation files higher than 2GB,
see fetch_file_range where the begin location is an int32.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for avoiding duplicate initdb runs during "make check"

2017-07-03 Thread Michael Paquier
On Mon, Jul 3, 2017 at 11:00 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Mon, Jul 3, 2017 at 9:25 PM, Greg Stark  wrote:
>>> On 2 July 2017 at 18:33, Tom Lane  wrote:
 system("cp -a ...") call in favor of something more portable.
>
>>> If we're ok with using Perl there's File::Copy::Recursive::dircopy()
>>> which does exactly that.
>
>> This stuff needs to support perl down to 5.8.0, and that's a reason
>> behind having src/test/perl/RecursiveCopy.pm. So I would suggest just
>> to use that. cp is not portable on Windows as well, that's a recipe
>> for non-portable code there.

This was under the assumption of "if we use perl" :)

> I can't see going this path in pg_regress, because then you would have
> exactly zero test functionality in a non-Perl build.

Indeed, release tarballs don't need perl to work. So that's a no-go.

> What I had in
> mind was a frontend-friendly version of backend/storage/file/copydir.c,
> either just dropped into pg_regress.c or put in src/common/.

+1 for src/common/.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2017-07-03 Thread Alvaro Herrera
Mark Rofail wrote:
> On Mon, Jun 26, 2017 at 6:44 PM, Alexander Korotkov 
> wrote:
> 
> > Have you met any particular problem here?  Or is it just a lot of
> > mechanical work?
> >
> 
> Just A LOT of mechanictal work, thankfully. The patch is now rebased and
> all regress tests have passed (even the element_foreign_key). Please find
> the patch below !

Great!

> *What I plan to do next *
> 
>- study ri_triggers.c (src/backend/utils/adt/ri_triggers.c) since this
>is where the new RI code will reside

Any news?


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting NAMEDATALEN

2017-07-03 Thread Tom Lane
Emrul  writes:
> Is this something that can be revisited for an upcoming release? Also, are
> there any technical problems that would be created by increasing this
> attribute?

This has been discussed before, eg here:
https://www.postgresql.org/message-id/23546.1353223...@sss.pgh.pa.us

You're free to build your own copy with whatever NAMEDATALEN you want,
but it seems unlikely to me that we'll change the default.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-03 Thread Tom Lane
Peter Eisentraut  writes:
> On 7/3/17 09:53, Tom Lane wrote:
>> Hm.  Before we add a bunch of code to deal with that, are we sure we
>> *want* it to copy such files?  Seems like that's expending a lot of
>> data-transfer work for zero added value --- consider e.g. a server
>> with a bunch of old core files laying about in $PGDATA.  Given that
>> it's already excluded all database-data-containing files, maybe we
>> should just set a cap on the plausible size of auxiliary files.

> It seems kind of lame to fail on large files these days, even if they
> are not often useful in the particular case.

True.  But copying useless data is also lame.

> Also, most of the segment and file sizes are configurable, and we have
> had reports of people venturing into much larger file sizes.

But if I understand the context correctly, we're not transferring relation
data files this way anyway.  If we do transfer WAL files this way, we
could make sure to set the cutoff larger than the WAL segment size.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-03 Thread Peter Eisentraut
On 7/3/17 09:53, Tom Lane wrote:
> Hm.  Before we add a bunch of code to deal with that, are we sure we
> *want* it to copy such files?  Seems like that's expending a lot of
> data-transfer work for zero added value --- consider e.g. a server
> with a bunch of old core files laying about in $PGDATA.  Given that
> it's already excluded all database-data-containing files, maybe we
> should just set a cap on the plausible size of auxiliary files.

It seems kind of lame to fail on large files these days, even if they
are not often useful in the particular case.

Also, most of the segment and file sizes are configurable, and we have
had reports of people venturing into much larger file sizes.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting NAMEDATALEN

2017-07-03 Thread Emrul
Yes, for the example given in the Reddit post I would tend to agree.  This is
one of those issues where for the most part the solution is better naming
conventions but for the few instances where this isn't possible it is a
right pain.




--
View this message in context: 
http://www.postgresql-archive.org/Revisiting-NAMEDATALEN-tp5969858p5969860.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting NAMEDATALEN

2017-07-03 Thread Joshua D. Drake

On 07/03/2017 11:31 AM, Emrul wrote:

Hi hackers,

This question came up again on Reddit:
https://www.reddit.com/r/PostgreSQL/comments/6kyyev/i_have_hit_the_table_name_length_limit_a_number/
and I thought I'd echo it here.

I totally am on board with short, descriptive names and a good convention.
However, there are just so many cases where 63 characters can't
descriptively describe a column name.  I've been on projects where we have
one table maybe with only a few thousand records but hundreds of columns
each uniquely describing an attribute on the record.  It is a challenge
bordering on impossible to fit them into a consistently named field of <63
characters that someone can later refer to and know what piece of
information it actually refers to.

Is this something that can be revisited for an upcoming release? Also, are
there any technical problems that would be created by increasing this
attribute?


Although I appreciate the sentiment this seems over the top:

datasystem_adjustmentmanagement_mm_datasystem_adjustmentmanagement_products

You can always use COMMENT ON to explode the actual meaning.

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
* Unless otherwise stated, opinions are my own.   *


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Revisiting NAMEDATALEN

2017-07-03 Thread Emrul
Hi hackers,

This question came up again on Reddit:
https://www.reddit.com/r/PostgreSQL/comments/6kyyev/i_have_hit_the_table_name_length_limit_a_number/
and I thought I'd echo it here.

I totally am on board with short, descriptive names and a good convention. 
However, there are just so many cases where 63 characters can't
descriptively describe a column name.  I've been on projects where we have
one table maybe with only a few thousand records but hundreds of columns
each uniquely describing an attribute on the record.  It is a challenge
bordering on impossible to fit them into a consistently named field of <63
characters that someone can later refer to and know what piece of
information it actually refers to.

Is this something that can be revisited for an upcoming release? Also, are
there any technical problems that would be created by increasing this
attribute?



--
View this message in context: 
http://www.postgresql-archive.org/Revisiting-NAMEDATALEN-tp5969858.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race conditions with WAL sender PID lookups

2017-07-03 Thread Alvaro Herrera
Michael Paquier wrote:

> Thanks Álvaro for pushing the patch. I had a second look and the
> result looks good to me.
> 
> -   SpinLockAcquire(>mutex);
> +   }
> +   pid = walsnd->pid;
> The WAL receiver code used a cast to (int) in
> pg_stat_get_wal_receiver(). I don't recall adding it.

I added it because it made me a bit nervous to pass a pid_t to
DatumGetInt32.  This one is assigning to a variable of type pid_t, so it
doesn't need a cast.

I'm not too clear on using pid_t variables as int32 Datum-packed
variables.  We don't do it a lot in the backend code (I found some
occurrences in contrib, but these don't inspire me a lot of confidence.)

> Why not being consistent for both by removing the one of the WAL
> receiver code?

I can't think of any reason to remove that cast.  It serves as
documentation if nothing else -- it alerts the reader that something is
going on.

> > In passing, clean up some leftover braces which were used to create
> > unconditional blocks.  Once upon a time these were used for
> > volatile-izing accesses to those shmem structs, which is no longer
> > required.  Many other occurrences of this pattern remain.
> 
> Here are the places where a cleanup can happen:
> - WalSndSetState
> - ProcessStandbyReplyMessage
> - XLogRead, 2 places
> - XLogSendLogical
> - WalRcvWaitForStartPosition
> - WalRcvDie
> - XLogWalRcvFlush
> - ProcessWalSndrMessage
> In most of the places of the WAL sender, braces could be removed to
> improve the style. For the WAL receiver, declarations are not
> necessary. As a matter of style, why not cleaning up just the WAL
> sender stuff? Changing the WAL receiver code just to remove some
> declarations would not improve readability, and would make back-patch
> more difficult.

I think we should clean this up whenever we're modifying the surrounding
code, but otherwise we can leave well enough alone.  It's not a high
priority item at any rate.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2017-07-03 Thread Chapman Flack
On 07/03/2017 09:39 AM, Heikki Linnakangas wrote:

> Hmm. That's not the problem, though. Imagine that instead of the loop
> above, you do just:
> 
> WALInsertLockUpdateInsertingAt(CurrPos);
> AdvanceXLInsertBuffer(EndPos, false);
> 
> AdvanceXLInsertBuffer() will call XLogWrite(), to flush out any pages
> before EndPos, to make room in the wal_buffers for the new pages. Before
> doing that, it will call WaitXLogInsertionsToFinish()

Although it's moot in the straightforward approach of re-zeroing in
the loop, it would still help my understanding of the system to know
if there is some subtlety that would have broken what I proposed
earlier, which was an extra flag to AdvanceXLInsertBuffer() that
would tell it not only to skip initializing headers, but also to
skip the WaitXLogInsertionsToFinish() check ... because I have
the entire region reserved and I hold all the writer slots
at that moment, it seems safe to assure AdvanceXLInsertBuffer()
that there are no outstanding writes to wait for.

I suppose it's true there's not much performance to gain; it would
save a few pairs of lock operations per empty page to be written,
but then, the more empty pages there are at the time of a log switch,
the less busy the system is, so the less it matters.

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-07-03 Thread Simon Riggs
On 30 June 2017 at 05:14, Amit Kapila  wrote:

> This is explained in section 15.2 [1], refer below para:
> "The query might be suspended during execution. In any situation in
> which the system thinks that partial or incremental execution might
> occur, no parallel plan is generated. For example, a cursor created
> using DECLARE CURSOR will never use a parallel plan. Similarly, a
> PL/pgSQL loop of the form FOR x IN query LOOP .. END LOOP will never
> use a parallel plan, because the parallel query system is unable to
> verify that the code in the loop is safe to execute while parallel
> query is active."

Can you explain "unable to verify that the code in the loop is safe to
execute while parallel query is active". Surely we aren't pushing code
in the loop into the actual query, so the safety of command in the FOR
loop has nothing to do with the parallel safety of the query.

Please give an example of something that would be unsafe? Is that
documented anywhere, README etc?

FOR x IN query LOOP .. END LOOP
seems like a case that would be just fine, since we're going to loop
thru every row or break early.

Surely NO SCROLL and WITH HOLD cursors would work fine?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dumping database creation options and ACLs

2017-07-03 Thread Rafael Martinez


On 06/29/2017 06:30 PM, Adrien Nayrat wrote:

> As reported by Ronan there's no other option than using pg_dumpall to restore
> database options and ACLs.
> 
> So, we use this trick to stop pg_dumpall before \connect and then use 
> pg_restore:
> 
> pg_dumpall -s | sed -rn '/^\\connect/{q}; p' > database+grants.sql
> 
> 
> Of course, it is not graceful as we just need results of pg_dumpall -g and 
> what
> the dumpCreateDB() function outputs.
> 
> What do you think about adding an option like --createdb-only (as suggested by
> Ronan) for this?  I'm not fully satisfied with this name though, I'll be happy
> if you have a better suggestion.
> 

Hello

We have a discussion about this some time ago and we created a wiki page
where we tried to write down some ideas/proposals and links to threads
discussing the subject:

https://wiki.postgresql.org/wiki/Pg_dump_improvements

regards,
-- 
 Rafael Martinez Guerrero
 Center for Information Technology
 University of Oslo, Norway

 PGP Public Key: http://folk.uio.no/rafael/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-03 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Craig,

Thanks for the response.

Scenario tried here is restart of the system multiple times. sh-QUIT core is 
generated when Postgres is invoking the shell to exit and may not be due to 
kernel or file system issues. I will try to reproduce the issue with dmesg 
output being printed.

However, is there any instance in Postgres where 'sh -c exit 1' will be invoked?

Regards,
Sandhya

-Original Message-
From: Craig Ringer [mailto:cr...@2ndquadrant.com] 
Sent: Friday, June 30, 2017 5:40 PM
To: K S, Sandhya (Nokia - IN/Bangalore) 
Cc: pgsql-hackers@postgresql.org; pgsql-b...@postgresql.org; T, Rasna (Nokia - 
IN/Bangalore) ; Itnal, Prakash (Nokia - IN/Bangalore) 

Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

On 30 June 2017 at 17:41, K S, Sandhya (Nokia - IN/Bangalore)
 wrote:

> When we checked the process listing during the time of core generation, we
> found Postgres startup process is invoking “sh -c exit 1”:
> 4518  9249  0.1  0.0 155964  2036 ?Ss   15:05   0:00 postgres:
> startup process   waiting for 000102EB

Looks like an archive_command or restore_command .

If 'sh' is dumping core, you probably have issues at a low level in
the kernel, file system, etc. Check dmesg.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-07-03 Thread Robert Haas
On Fri, Jun 30, 2017 at 9:32 AM, Mark Dilger  wrote:
>> On Jun 29, 2017, at 8:55 PM, Mark Dilger  wrote:
>> Changing myfunc to create a temporary table, to execute the sql to populate
>> that temporary table, and to then loop through the temporary table's rows
>> fixes the problem.  For the real-world example where I hit this, that single
>> change decreases the runtime from 13.5 seconds to 2.5 seconds.
>
> Actually, this is wrong.  On further review, by the time I had changed the
> function definition, the data in the test server I was querying had likely 
> changed
> enough for the performance to change.  That duped me into thinking I had
> found a work-around.  I can no longer reproduce any performance improvement
> in this manner.

Also note that temporary tables are yet another thing that doesn't
work with parallel query yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dumping database creation options and ACLs

2017-07-03 Thread Robert Haas
On Thu, Jun 29, 2017 at 12:30 PM, Adrien Nayrat
 wrote:
> As reported by Ronan there's no other option than using pg_dumpall to restore
> database options and ACLs.
>
> So, we use this trick to stop pg_dumpall before \connect and then use 
> pg_restore:
>
> pg_dumpall -s | sed -rn '/^\\connect/{q}; p' > database+grants.sql
>
>
> Of course, it is not graceful as we just need results of pg_dumpall -g and 
> what
> the dumpCreateDB() function outputs.
>
> What do you think about adding an option like --createdb-only (as suggested by
> Ronan) for this?  I'm not fully satisfied with this name though, I'll be happy
> if you have a better suggestion.
>
> Attached a naive patch.

Note that some progress has been made on the CURRENT_DATABASE thing:

https://www.postgresql.org/message-id/caf3+xm+xsswcwqzmp1cjj12gpz8dxhcm9_ft1y-0fvzxi9p...@mail.gmail.com

I tend to favor that approach myself, although one point in favor of
your suggestion is that adding another flag to pg_dumpall is a heck of
a lot less work to get to some kind of solution to this issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-07-03 Thread Dilip Kumar
On Fri, Jun 30, 2017 at 11:56 AM, Beena Emerson  wrote:
> Hello Dilip,
>
> On Wed, Jun 21, 2017 at 6:27 PM, Dilip Kumar  wrote:
>> On Tue, Jun 20, 2017 at 6:57 PM, Dilip Kumar  wrote:
>>> This is basically crashing in RelationBuildPartitionDesc, so I think

>
> Thank you for your review and analysis.
>
> I have updated the patch.
> - bound->content is set to RANGE_DATUM_DEFAULT for each of the keys
> and not just the first one.
> - Improve the way of handling DEFAULT bounds in RelationBuildPartitionDesc,
>
> There is a test for multiple column range in alter_table.sql

Thanks for update patch,  After this fix segmentation fault is solved.

I have some more comments.

1.

lower = make_one_range_bound(key, -1, spec->lowerdatums, true);
  upper = make_one_range_bound(key, -1, spec->upperdatums, false);

+ /* Default case is not handled here */
+ if (spec->is_default)
+ break;
+

I think we can move default check just above the make range bound it
will avoid unnecessary processing.


2.
RelationBuildPartitionDesc
{


   /*
* If either of them has infinite element, we can't equate
* them.  Even when both are infinite, they'd have
* opposite signs, because only one of cur and prev is a
* lower bound).
*/
if (cur->content[j] != RANGE_DATUM_FINITE ||
  prev->content[j] != RANGE_DATUM_FINITE)
{
is_distinct = true;
break;
}
After default range partitioning patch the above comment doesn't sound
very accurate and
can lead to the confusion, now here we are not only considering
infinite bounds but
there is also a possibility that prev bound can be DEFAULT, so
comments should clearly
mention that.

3.

+ bound->datums = datums ? (Datum *) palloc0(key->partnatts * sizeof(Datum))
+ : NULL;
  bound->content = (RangeDatumContent *) palloc0(key->partnatts *
sizeof(RangeDatumContent));
  bound->lower = lower;

  i = 0;
+
+ /* datums are NULL for default */
+ if (datums == NULL)
+ for (i = 0; i < key->partnatts; i++)
+ bound->content[i] = RANGE_DATUM_DEFAULT;

To me, it will look cleaner if we keep bound->content=NULL for
default, instead of allocating memory
and initializing each element,  But it's just a suggestions and you
can decide whichever way
seems better to you.  Then the other places e.g.
+ if (cur->content[i] == RANGE_DATUM_DEFAULT)
+ {
+ /*

will change like

+ if (cur->content == NULL)
+ {
+ /*

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2017-07-03 Thread Chapman Flack
On 07/03/2017 09:39 AM, Heikki Linnakangas wrote:

> The most straightforward solution would be to just clear each page with
> memset() in the loop. It's a bit wasteful to clear the page again, just
> after AdvanceXLInsertBuffer() has initialized it, but this isn't
> performance-critical.

An in that straightforward approach, I imagine it would suffice to
memset just the length of a (short) page header; the page content
is already zeroed, and there isn't going to be a switch at the very
start of a segment, so a long header won't be encountered ... will it?

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for avoiding duplicate initdb runs during "make check"

2017-07-03 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Jul 3, 2017 at 9:25 PM, Greg Stark  wrote:
>> On 2 July 2017 at 18:33, Tom Lane  wrote:
>>> system("cp -a ...") call in favor of something more portable.

>> If we're ok with using Perl there's File::Copy::Recursive::dircopy()
>> which does exactly that.

> This stuff needs to support perl down to 5.8.0, and that's a reason
> behind having src/test/perl/RecursiveCopy.pm. So I would suggest just
> to use that. cp is not portable on Windows as well, that's a recipe
> for non-portable code there.

I can't see going this path in pg_regress, because then you would have
exactly zero test functionality in a non-Perl build.  What I had in
mind was a frontend-friendly version of backend/storage/file/copydir.c,
either just dropped into pg_regress.c or put in src/common/.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-03 Thread Tom Lane
Kuntal Ghosh  writes:
> pg_rewind throws the following error when there is a file of large
> size available in the Slave server's data directory.

Hm.  Before we add a bunch of code to deal with that, are we sure we
*want* it to copy such files?  Seems like that's expending a lot of
data-transfer work for zero added value --- consider e.g. a server
with a bunch of old core files laying about in $PGDATA.  Given that
it's already excluded all database-data-containing files, maybe we
should just set a cap on the plausible size of auxiliary files.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2017-07-03 Thread Heikki Linnakangas

On 06/26/2017 04:20 AM, Chapman Flack wrote:

I notice CopyXLogRecordToWAL contains this loop (in the case where
the record being copied is a switch):

while (CurrPos < EndPos)
{
/* initialize the next page (if not initialized already) */
WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(CurrPos, false);
CurrPos += XLOG_BLCKSZ;
}

in which it calls, one page at a time, AdvanceXLInsertBuffer, which contains
its own loop able to do a sequence of pages. A comment explains why:

/*
 * We do this one page at a time, to make sure we don't deadlock
 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
 */

I want to make sure I understand what the deadlock potential is
in this case. AdvanceXLInsertBuffer will call WaitXLogInsertionsToFinish
before writing any dirty buffer, and we do hold insertion slot locks
(all of 'em, in the case of a log switch, because that makes
XlogInsertRecord call WALInsertLockAcquireExclusive instead of just
WALInsertLockAcquire for other record types).

Does not the fact we hold all the insertion slots exclude the possibility
that any dirty buffer (preceding the one we're touching) needs to be checked
for in-flight insertions?


Hmm. That's not the problem, though. Imagine that instead of the loop 
above, you do just:


WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(EndPos, false);

AdvanceXLInsertBuffer() will call XLogWrite(), to flush out any pages 
before EndPos, to make room in the wal_buffers for the new pages. Before 
doing that, it will call WaitXLogInsertionsToFinish() to wait for any 
insertions to those pages to be completed. But the backend itself is 
advertising the insertion position CurrPos, and it will therefore wait 
for itself, forever.



I've been thinking along the lines of another parameter to
AdvanceXLInsertBuffer to indicate when the caller is exactly this loop
filling out the tail after a log switch (originally, to avoid filling
in page headers). It now seems to me that, if AdvanceXLInsertBuffer
has that information, it could also be safe for it to skip the
WaitXLogInsertionsToFinish in that case. Would that eliminate the
deadlock potential, and allow the loop in CopyXLogRecordToWAL to be
replaced with a single call to AdvanceXLInsertBuffer and a single
WALInsertLockUpdateInsertingAt ?

Or have I overlooked some other subtlety?


The most straightforward solution would be to just clear each page with 
memset() in the loop. It's a bit wasteful to clear the page again, just 
after AdvanceXLInsertBuffer() has initialized it, but this isn't 
performance-critical.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] user-based query white list

2017-07-03 Thread Euler Taveira
2017-07-03 3:11 GMT-03:00 Tim Burgan :

>
> Since then, is it now possible to configure a user to only be able to
> execute a limited white-listing of queries? Is this something that could
> now be implemented through extensions?
>

Since pg_stat_statements infrastructure, it is possible to create
extensions that prohibit query execution for certain users (see
sql_firewall [1] as an example).


[1] https://github.com/uptimejp/sql_firewall


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento



Re: [HACKERS] WIP patch for avoiding duplicate initdb runs during "make check"

2017-07-03 Thread Michael Paquier
On Mon, Jul 3, 2017 at 10:02 PM, Alvaro Herrera
 wrote:
> Tom Lane wrote:
>
>> (Other unfinished work: teaching the MSVC scripts to use this,
>> and teaching pg_upgrade's test script to use it.)
>
> Maybe it'd be simpler to rewrite pg_upgrade tests using PostgresNode
> instead?

You are looking for this patch:
https://commitfest.postgresql.org/14/1101/
And for this thread:
https://www.postgresql.org/message-id/flat/cab7npqrdan1a1ynjxnl9t1juewct8ttqq29dnv8w_o37+e8...@mail.gmail.com
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-03 Thread Michael Paquier
On Mon, Jul 3, 2017 at 8:22 PM, Kuntal Ghosh  wrote:
> pg_rewind throws the following error when there is a file of large
> size available in the Slave server's data directory.

Oops.

> I guess we've to change the data type to bigint. Also, we need some
> implementation of ntohl() for 8-byte data types. I've attached a
> script to reproduce the error and a draft patch.

pg_basebackup/ with fe_recvint64() has its own way to do things, as
does the large object things in libpq. I would think that at least on
HEAD things could be gathered with a small set of routines. It is
annoying to have a third copy of the same thing. OK that's not
invasive but src/common/ would be a nice place to put things.

-if (PQgetlength(res, 0, 1) != sizeof(int32))
+if (PQgetlength(res, 0, 1) != sizeof(long long int))
This had better be int64.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for avoiding duplicate initdb runs during "make check"

2017-07-03 Thread Alvaro Herrera
Tom Lane wrote:

> (Other unfinished work: teaching the MSVC scripts to use this,
> and teaching pg_upgrade's test script to use it.)

Maybe it'd be simpler to rewrite pg_upgrade tests using PostgresNode
instead?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-07-03 Thread Fabien COELHO



The number of retries and maybe failures should be counted, maybe with
some adjustable maximum, as suggested.


If we fix the maximum number of attempts the maximum number of failures 
for one script execution will be bounded above 
(number_of_transactions_in_script * maximum_number_of_attempts). Do you 
think we should make the option in program to limit this number much more?


Probably not. I think that there should be a configurable maximum of
retries on a transaction, which may be 0 by default if we want to be
upward compatible with the current behavior, or maybe something else.


I propose the option --max-attempts-number=NUM which NUM cannot be less than 
1. I propose it because I think that, for example, --max-attempts-number=100 
is better than --max-retries-number=99. And maybe it's better to set its 
default value to 1 too because retrying of shell commands can produce new 
errors..


Personnaly, I like counting retries because it also counts the number of 
time the transaction actually failed for some reason. But this is a 
marginal preference, and one can be switchted to the other easily.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for avoiding duplicate initdb runs during "make check"

2017-07-03 Thread Michael Paquier
On Mon, Jul 3, 2017 at 9:25 PM, Greg Stark  wrote:
> On 2 July 2017 at 18:33, Tom Lane  wrote:
>> system("cp -a ...") call in favor of something more portable.
>
> If we're ok with using Perl there's File::Copy::Recursive::dircopy()
> which does exactly that.

This stuff needs to support perl down to 5.8.0, and that's a reason
behind having src/test/perl/RecursiveCopy.pm. So I would suggest just
to use that. cp is not portable on Windows as well, that's a recipe
for non-portable code there.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error-like LOG when connecting with SSL for password authentication

2017-07-03 Thread Michael Paquier
On Mon, Jul 3, 2017 at 9:02 PM, Heikki Linnakangas  wrote:
> Thanks, I've pushed the backend read part of this patch.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2017-07-03 Thread Amit Kapila
On Thu, Mar 23, 2017 at 1:18 PM, Ashutosh Sharma 
wrote:
>
> *Conclusion:*
> As seen from the test results mentioned above, there is some performance
> improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly
> better than HEAD, with 7 and 10 SP(s) we do see regression with patch.
> Therefore, I think the threshold value of 4 for number of subtransactions
> considered in the patch looks fine to me.
>
>
Thanks for the tests.  Attached find the rebased patch on HEAD.  I have ran
latest pgindent on patch.  I have yet to add wait event for group lock
waits in this patch as is done by Robert in commit
 d4116a771925379c33cf4c6634ca620ed08b551d for ProcArrayGroupUpdate.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


group_update_clog_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch for avoiding duplicate initdb runs during "make check"

2017-07-03 Thread Greg Stark
On 2 July 2017 at 18:33, Tom Lane  wrote:
> system("cp -a ...") call in favor of something more portable.

If we're ok with using Perl there's File::Copy::Recursive::dircopy()
which does exactly that.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-07-03 Thread Marina Polyakova

The current error handling is either "close connection" or maybe in
some cases even "exit". If this is changed, then the client may
continue execution in some unforseen state and behave unexpectedly.
We'll see.


Thanks, now I understand this.


ISTM that the retry implementation should be implemented somehow in
the automaton, restarting the same script for the beginning.


If there are several transactions in this script - don't you think 
that we should restart only the failed transaction?..


On some transaction failures based on their status. My point is that
the retry process must be implemented clearly with a new state in the
client automaton. Exactly when the transition to this new state must
be taken is another issue.


About it, I agree with you that it should be done in this way.

The number of retries and maybe failures should be counted, maybe 
with

some adjustable maximum, as suggested.


If we fix the maximum number of attempts the maximum number of 
failures for one script execution will be bounded above 
(number_of_transactions_in_script * maximum_number_of_attempts). Do 
you think we should make the option in program to limit this number 
much more?


Probably not. I think that there should be a configurable maximum of
retries on a transaction, which may be 0 by default if we want to be
upward compatible with the current behavior, or maybe something else.


I propose the option --max-attempts-number=NUM which NUM cannot be less 
than 1. I propose it because I think that, for example, 
--max-attempts-number=100 is better than --max-retries-number=99. And 
maybe it's better to set its default value to 1 too because retrying of 
shell commands can produce new errors..



In doLog, added columns should be at the end of the format.


I have inserted it earlier because these columns are not optional. Do 
you think they should be optional?


I think that new non-optional columns it should be at the end of the
existing non-optional columns so that existing scripts which may
process the output may not need to be updated.


Thanks, I agree with you :)


I'm not sure that there should be an new option to report failures,
the information when relevant should be integrated in a clean format
into the existing reports... Maybe the "per command latency"
report/option should be renamed if it becomes more general.


I have tried do not change other parts of program as much as possible. 
But if you think that it will be more useful to change the option I'll 
do it.


I think that the option should change if its naming becomes less
relevant, which is to be determined. AFAICS, ISTM that new measures
should be added to the various existing reports unconditionnaly (i.e.
without a new option), so maybe no new option would be needed.


Thanks! I didn't think about it in this way..

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error-like LOG when connecting with SSL for password authentication

2017-07-03 Thread Heikki Linnakangas

On 05/24/2017 05:29 PM, Michael Paquier wrote:

Attached is an updated patch.


Thanks, I've pushed the backend read part of this patch. That's enough 
to fix the original complaint with password authentication. I think the 
rest was a bit dubious, and I'm hesitant to commit that (or at least to 
backport):


* Backend write: you wouldn't really expect a client to disconnect, 
while the backend is trying to send something to it. You'll get an error 
in the non-SSL case too, although I guess the error message would be 
different.


* Frontend read: pqReadData has this, after calling pqsecure_read:


/*
 * A return value of 0 could mean just that no data is now available, or
 * it could mean EOF --- that is, the server has closed the connection.
 * Since we have the socket in nonblock mode, the only way to tell the
 * difference is to see if select() is saying that the file is ready.
 * Grumble.  Fortunately, we don't expect this path to be taken much,
 * since in normal practice we should not be trying to read data unless
 * the file selected for reading already.
 *
 * In SSL mode it's even worse: SSL_read() could say WANT_READ and then
 * data could arrive before we make the pqReadReady() test, but the 
second
 * SSL_read() could still say WANT_READ because the data received was 
not
 * a complete SSL record.  So we must play dumb and assume there is more
 * data, relying on the SSL layer to detect true EOF.
 */

#ifdef USE_SSL
if (conn->ssl_in_use)
return 0;
#endif


* Frontend write: Same as in the backend, I think having the server 
disconnect while you're trying to send to it is not expected. Not sure 
if the error message is here again different, but seems best to just 
leave it alone.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-03 Thread Ashutosh Sharma
Hi,

On Mon, Jul 3, 2017 at 4:10 PM, Amit Kapila  wrote:
>
> On Mon, Jul 3, 2017 at 3:24 PM, Ashutosh Sharma  wrote:
> > On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila  wrote:
> >> The basic issue was that the WAL logging for Create Index operation
> >> was oblivion of the fact that for unlogged tables only INIT forks need
> >> to be logged.  Another point which we need to consider is that while
> >> replaying the WAL for the create index operation, we need to flush the
> >> buffer if it is for init fork.  This needs to be done only for pages
> >> that can be part of init fork file (like metapage, bitmappage).
> >> Attached patch fixes the issue.
> >>
> >> Another approach to fix the issue could be that always log complete
> >> pages for the create index operation on unlogged tables (in
> >> hashbuildempty).  However, the logic for initial hash index pages
> >> needs us to perform certain other actions (like update metapage after
> >> the creation of bitmappage) which make it difficult to follow that
> >> approach.
> >>
> >
> > Thanks for sharing the steps to reproduce the issue in detail. I could
> > easily reproduce this issue. I also had a quick look into the patch and the
> > fix looks fine to me. However, it would be good if we can add at least one
> > test for unlogged table with hash index in the regression test suite.
> >
>
> There is already such a test, see create_index.sql.
> CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
> CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
> int4_ops);
>
> Do you have something else in mind?

Yes, that's what i had in my mind. But, good to know that we already
have a test-case with hash index created on an unlogged table.

>
>
> I think the problem mentioned in this thread can be caught only if we
> promote the standby and restart it.  We might be able to write it
> using TAP framework, but I think such a test should exist for other
> index types as well or in general for unlogged tables.  I am not
> opposed to writing such a test if you and or others feel so.

I think, it would be a good thing to add such test cases using TAP
framework but as you said, it would be good to take others opinion as
well on this. Thanks.

>
> > Apart from that i have tested the patch and the patch seems to be working
> > fine.
>
> Thanks.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-07-03 Thread Fabien COELHO


Hello Marina,


I agree that improving the error handling ability of pgbench is a good
thing, although I'm not sure about the implications...


Could you tell a little bit more exactly.. What implications are you worried 
about?


The current error handling is either "close connection" or maybe in some 
cases even "exit". If this is changed, then the client may continue 
execution in some unforseen state and behave unexpectedly. We'll see.



ISTM that the retry implementation should be implemented somehow in
the automaton, restarting the same script for the beginning.


If there are several transactions in this script - don't you think that we 
should restart only the failed transaction?..


On some transaction failures based on their status. My point is that the 
retry process must be implemented clearly with a new state in the client 
automaton. Exactly when the transition to this new state must be taken is 
another issue.



The number of retries and maybe failures should be counted, maybe with
some adjustable maximum, as suggested.


If we fix the maximum number of attempts the maximum number of failures for 
one script execution will be bounded above (number_of_transactions_in_script 
* maximum_number_of_attempts). Do you think we should make the option in 
program to limit this number much more?


Probably not. I think that there should be a configurable maximum of 
retries on a transaction, which may be 0 by default if we want to be 
upward compatible with the current behavior, or maybe something else.



In doLog, added columns should be at the end of the format.


I have inserted it earlier because these columns are not optional. Do you 
think they should be optional?


I think that new non-optional columns it should be at the end of the 
existing non-optional columns so that existing scripts which may process 
the output may not need to be updated.



I'm not sure that there should be an new option to report failures,
the information when relevant should be integrated in a clean format
into the existing reports... Maybe the "per command latency"
report/option should be renamed if it becomes more general.


I have tried do not change other parts of program as much as possible. But if 
you think that it will be more useful to change the option I'll do it.


I think that the option should change if its naming becomes less relevant, 
which is to be determined. AFAICS, ISTM that new measures should be added 
to the various existing reports unconditionnaly (i.e. without a new 
option), so maybe no new option would be needed.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Error while copying a large file in pg_rewind

2017-07-03 Thread Kuntal Ghosh
Hello all,

pg_rewind throws the following error when there is a file of large
size available in the Slave server's data directory.

unexpected result while sending file list: ERROR:  value "214800"
is out of range for type integer
CONTEXT:  COPY fetchchunks, line 2402, column begin: "214800"

How to reproduce

1. Set up replication between Server A(master) and Server B(slave)

2. Promote the slave server(Server B )

3. Stop the old master (Server A)

4. Create a large file in the newly promoted master's (Server B) data
directory using the below command

 dd if=/dev/zero of=large.file bs=1024 count=400

 [root@localhost data]# dd if=/dev/zero of=large.file bs=1024 count=400
400+0 records in
400+0 records out
409600 bytes (4.1 GB) copied, 8.32263 s, 492 MB/s

5. Execute pg_rewind command from old master(server A)

 ./pg_rewind -D /home/enterprisedb/master/ --debug --progress
--source-server="port=5661 user=enterprisedb dbname=edb"

IMHO, it seems to be a bug in pg_rewind.

As mentioned in pg_rewind documentation, there are few files which are
copied in whole.

"Copy all other files such as pg_xact and configuration files from the
source cluster to the target cluster (everything except the relation
files)." -- https://www.postgresql.org/docs/devel/static/app-pgrewind.html

Those files are copied in max CHUNKSIZE(default 100) bytes at a
time. In the process, pg_rewind creates a table with the following
schema and loads information about blocks that need to be copied.

CREATE TEMPORARY TABLE fetchchunks(path text, begin int4, len int4);

postgres=# select * from fetchchunks where begin != 0;
  path   |  begin   |   len
-+--+-
 pg_wal/00010002 |  100 | 100
 pg_wal/00010002 |  200 | 100
 pg_wal/00010002 |  300 | 100
 pg_wal/00010002 |  400 | 100
..
and so on.

The range for begin is between -2147483648 to +2147483647. For a 4GB
file, begin definitely goes beyond 2147483647 and it throws the
following error:
unexpected result while sending file list: ERROR:  value "214800"
is out of range for type integer
CONTEXT:  COPY fetchchunks, line 2659, column begin: "214800"

I guess we've to change the data type to bigint. Also, we need some
implementation of ntohl() for 8-byte data types. I've attached a
script to reproduce the error and a draft patch.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


standby-server-setup.sh
Description: Bourne shell script


fix_copying_large_file_pg_rewind_v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [BUGS] [HACKERS] Segmentation fault in libpq

2017-07-03 Thread Michal Novotny



On 07/03/2017 04:58 AM, Craig Ringer wrote:

On 3 July 2017 at 03:12, Andres Freund  wrote:

Hi,

On 2017-07-02 20:58:52 +0200, Michal Novotný wrote:

thank you all for your advice. I've been investigating this a little more
and finally it turned out it's not a bug in libpq although I got confused
by going deep as several libpq functions. The bug was really on our side
after trying to use connection pointer after calling PQfinish(). The code
is pretty complex so it took some time to investigate however I would like
to apologize for "blaming" libpq instead of our code.

Usually using a tool like valgrind is quite helpful to find issues like
that, because it'll show you the call-stack accessing the memory and
*also* the call-stack that lead to the memory being freed.

Yep, huge help.

BTW, on Windows, the free tool DrMemory (now 64-bit too, yay) or
commercial Purify work great.


Well, good to know about Windows stuff however we use Linux so that's 
not a big deal. Unfortunately it's easy to miss something in valgrind if 
you have once multi-threaded library linked to libpq and this 
multi-threaded library is used in conjunction with another libraries 
sharing some of the data among them.


Thanks once again,
Michal

--
Michal Novotny
System Development Lead
michal.novo...@greycortex.com

GREYCORTEX s.r.o.
Purkynova 127, 61200 Brno
Czech Republic
www.greycortex.com



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [BUGS] [HACKERS] Segmentation fault in libpq

2017-07-03 Thread Michal Novotny



On 07/02/2017 09:12 PM, Andres Freund wrote:

Hi,

On 2017-07-02 20:58:52 +0200, Michal Novotný wrote:

thank you all for your advice. I've been investigating this a little more
and finally it turned out it's not a bug in libpq although I got confused
by going deep as several libpq functions. The bug was really on our side
after trying to use connection pointer after calling PQfinish(). The code
is pretty complex so it took some time to investigate however I would like
to apologize for "blaming" libpq instead of our code.

Usually using a tool like valgrind is quite helpful to find issues like
that, because it'll show you the call-stack accessing the memory and
*also* the call-stack that lead to the memory being freed.

- Andres


Well, I've tried but I was unable to locate the issue so I had to 
investigate the code our little further and finally I've been able to 
find the issue.


Thanks again,
Michal


--
Michal Novotny
System Development Lead
michal.novo...@greycortex.com

GREYCORTEX s.r.o.
Purkynova 127, 61200 Brno
Czech Republic
www.greycortex.com



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] user-based query white list

2017-07-03 Thread Tim Burgan
This old thread on "user-based query white list" is now nearly 10 years old!
http://grokbase.com/t/postgresql/pgsql-hackers/08c6zh42fa/user-based-query-white-list

Since then, is it now possible to configure a user to only be able to
execute a limited white-listing of queries? Is this something that could
now be implemented through extensions?


Re: [HACKERS] A bug in mapping attributes in ATExecAttachPartition()

2017-07-03 Thread Ashutosh Bapat
Thanks for working on the previous comments. The code really looks good now.

On Fri, Jun 23, 2017 at 2:29 PM, Amit Langote
 wrote:
>
>> Don't we need an exclusive lock to
>> make sure that the constraints are not changed while we are validating those?
>
> If I understand your question correctly, you meant to ask if we don't need
> the strongest lock on individual partitions while looking at their
> constraints to prove that we don't need to scan them.  We do and we do
> take the strongest lock on individual partitions even today in the second
> call to find_all_inheritors().  We're trying to eliminate the second call
> here.

The comment seems to imply that we need strongest lock only when we
"scan" the table/s.

Some more comments on 0001
- * Prevent circularity by seeing if rel is a partition of attachRel. (In
+ * Prevent circularity by seeing if rel is a partition of attachRel, (In
  * particular, this disallows making a rel a partition of itself.)
The sentence outside () doesn't have a full-stop. I think the original
construct was better.

+ * We want to avoid having to construct this list again, so we request the
"this list" is confusing here since the earlier sentence doesn't mention any
list at all. Instead we may reword it as "We will need the list of children
later to check whether any of those have a row which would not fit the
partition constraints. So, take the strongest lock ..."


 * XXX - Do we need to lock the partitions here if we already have the
 * strongest lock on attachRel?  The information we need here to check
 * for circularity cannot change without taking a lock on attachRel.
I wondered about this. Do we really need an exclusive lock to check whether
partition constraint is valid? May be we can compare this condition with ALTER
TABLE ... ADD CONSTRAINT since the children will all get a new constraint
effectively. So, exclusive lock it is.

Assert(linitial_oid(attachRel_children) ==
RelationGetRelid(attachRel));
if (attachRel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
attachRel_children = list_delete_first(attachRel_children);
Is it necessary for this code to have OID of the relation being attached as the
first one? You could simply call list_delete_oid() instead of
list_delete_first(). If for any reason find_all_inheritors() changes the output
order, this assertion and code would need a change.\

>
>> Comments on 0002 patch.
>> Thanks for the refactoring. The code looks really good now.
>
> Thanks.
>
>> The name skipPartConstraintValidation() looks very specific to the case at
>> hand. The function is really checking whether existing constraints on the 
>> table
>> can imply the given constraints (which happen to be partition constraints). 
>> How
>> about PartConstraintImpliedByRelConstraint()? The idea is to pick a general
>> name so that the function can be used for purposes other than skipping
>> validation scan in future.
>
> I liked this idea, so done.

+ * skipPartConstraintValidation
+PartConstraintImpliedByRelConstraint(Relation partrel, List *partConstraint)
Different function names in prologue and the definition.

>
>>  * This basically returns if the partrel's existing constraints, which
>> returns "true". Add "otherwise returns false".
>>
>> if (constr != NULL)
>> {
>> ...
>> }
>> return false;
>> May be you should return false when constr == NULL (I prefer !constr, but
>> others may have different preferences.). That should save us one level of
>> indentation. At the end just return whatever predicate_implied_by() returns.
>
> Good suggestion, done.

+if (predicate_implied_by(partConstraint, existConstraint, true))
+return true;
+
+/* Tough luck. */
+return false;
why not to just return the return value of predicate_implied_by() as is?

With this we can actually handle constr == NULL a bit differently.
+if (constr == NULL)
+return false;
To be on safer side, return false when partConstraint is not NULL. If both the
relation constraint and partConstraint are both NULL, the first does imply the
other. Or may be leave that decision to predicate_implied_by(), which takes
care of it right at the beginning of the function.

+ * For each leaf partition, check if it we can skip the validation
An extra "it".

+ * Note that attachRel's OID is in this list.  Since we already
+ * determined above that its validation scan cannot be skipped, we
+ * need not check that again in the loop below.  If it's partitioned,
I don't see code to skip checking whether scan can be skipped for relation
being attached. The loop below this comments executes for every unpartitioned
table in the list of OIDs returned. Thus for an unpartitioned relation being
attached, it will try to compare the constraints again. Am I correct?

+ * comparing it to similarly-processed 

Re: [HACKERS] [POC] hash partitioning

2017-07-03 Thread amul sul
On Fri, Jun 23, 2017 at 11:19 AM, Yugo Nagata  wrote:
> On Fri, 23 Jun 2017 13:41:15 +0900
> Yugo Nagata  wrote:
>
>> On Tue, 6 Jun 2017 13:03:58 +0530
>> amul sul  wrote:
>>
>>
>> > Updated patch attached.
>>
>> I looked into the latest patch (v13) and have some comments
>> althogh they might be trivial.
>
> One more comment:
>
> +   if (spec->remainder < 0)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +errmsg("remainder for hash partition must be a 
> non-negative integer")));
>
> The value of remainder is defined as Iconst in gram.y, so it never be 
> negative.
> Hence, I think this check is not necessary or Assert is enough.
>
Make sense, fixed this as well in the v14 patch. Thanks again.

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-07-03 Thread amul sul
On Fri, Jun 23, 2017 at 10:11 AM, Yugo Nagata  wrote:
> On Tue, 6 Jun 2017 13:03:58 +0530
> amul sul  wrote:
>
>
>> Updated patch attached.
>
> I looked into the latest patch (v13) and have some comments
> althogh they might be trivial.
>
Thanks for your review.

> First, I couldn't apply this patch to the latest HEAD due to
> a documentation fix and pgintend updates. It needes rebase.
>
> $ git apply /tmp/0002-hash-partitioning_another_design-v13.patch
> error: patch failed: doc/src/sgml/ref/create_table.sgml:87
> error: doc/src/sgml/ref/create_table.sgml: patch does not apply
> error: patch failed: src/backend/catalog/partition.c:76
> error: src/backend/catalog/partition.c: patch does not apply
> error: patch failed: src/backend/commands/tablecmds.c:13371
> error: src/backend/commands/tablecmds.c: patch does not apply
>
Fixed.

>
>
> +   Hash Partitioning
> +
> +   
> +
> + The table is partitioned by specifying modulus and remainder for 
> each
> + partition. Each partition holds rows for which the hash value of
> + partition keys when divided by specified modulus produces specified
> + remainder. For more clarification on modulus and remainder please 
> refer
> + .
> +
> +   
> +  
> +
> +  
> Range Partitioning
>
> I think this section should be inserted after List Partitioning section 
> because
> the order of the descriptions is Range, List, then Hash in other places of
> the documentation. At least,
>
Fixed in the attached version.

>
> -partition bounds.  Currently supported
> -partitioning methods include range and list, where each partition is
> -assigned a range of keys and a list of keys, respectively.
> +partition bounds.  The currently supported
> +partitioning methods are list, range, and hash.
> 
>
> Also in this hunk. I think "The currently supported partitioning methods are
> range, list, and hash." is better. We don't need to change the order of
> the original description.
>
Fixed in the attached version.

>
>
> 
> -Declarative partitioning only supports list and range partitioning,
> -whereas table inheritance allows data to be divided in a manner of
> -the user's choosing.  (Note, however, that if constraint exclusion is
> -unable to prune partitions effectively, query performance will be 
> very
> -poor.)
> +Declarative partitioning only supports hash, list and range
> +partitioning, whereas table inheritance allows data to be divided in 
> a
> +manner of the user's choosing.  (Note, however, that if constraint
> +exclusion is unable to prune partitions effectively, query 
> performance
> +will be very poor.)
>
> Similarly, I think "Declarative partitioning only supports range, list and 
> hash
> partitioning," is better.
>
Fixed in the attached version.

>
> +
> +  
> +   Create a hash partitioned table:
> +
> +CREATE TABLE orders (
> +order_id bigint not null,
> +cust_id  bigint not null,
> +status   text
> +) PARTITION BY HASH (order_id);
> +
> +
>
> This paragraph should be inserted between "Create a list partitioned table:"
> paragraph and "Ceate partition of a range partitioned table:" paragraph
> as well as range and list.
>
Fixed in the attached version.

>
> *strategy = PARTITION_STRATEGY_LIST;
> else if (pg_strcasecmp(partspec->strategy, "range") == 0)
> *strategy = PARTITION_STRATEGY_RANGE;
> +   else if (pg_strcasecmp(partspec->strategy, "hash") == 0)
> +   *strategy = PARTITION_STRATEGY_HASH;
> else
> ereport(ERROR,
>
> In the most of codes, the order is hash, range, then list, but only
> in transformPartitionSpec(), the order is list, range, then hash,
> as above. Maybe it is better to be uniform.
>
Make sense, fixed in the attached version.

>
> +   {
> +   if (strategy == PARTITION_STRATEGY_HASH)
> +   ereport(ERROR,
> +   
> (errcode(ERRCODE_UNDEFINED_OBJECT),
> +errmsg("data type %s 
> has no default hash operator class",
> +   
> format_type_be(atttype)),
> +errhint("You must 
> specify a hash operator class or define a default hash operator class for the 
> data type.")));
> +   else
> +   ereport(ERROR,
> +   
> (errcode(ERRCODE_UNDEFINED_OBJECT),
> +errmsg("data type %s 
> has no default btree operator class",
> + 

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-03 Thread Amit Kapila
On Mon, Jul 3, 2017 at 3:24 PM, Ashutosh Sharma  wrote:
> On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila  wrote:
>> The basic issue was that the WAL logging for Create Index operation
>> was oblivion of the fact that for unlogged tables only INIT forks need
>> to be logged.  Another point which we need to consider is that while
>> replaying the WAL for the create index operation, we need to flush the
>> buffer if it is for init fork.  This needs to be done only for pages
>> that can be part of init fork file (like metapage, bitmappage).
>> Attached patch fixes the issue.
>>
>> Another approach to fix the issue could be that always log complete
>> pages for the create index operation on unlogged tables (in
>> hashbuildempty).  However, the logic for initial hash index pages
>> needs us to perform certain other actions (like update metapage after
>> the creation of bitmappage) which make it difficult to follow that
>> approach.
>>
>
> Thanks for sharing the steps to reproduce the issue in detail. I could
> easily reproduce this issue. I also had a quick look into the patch and the
> fix looks fine to me. However, it would be good if we can add at least one
> test for unlogged table with hash index in the regression test suite.
>

There is already such a test, see create_index.sql.
CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
int4_ops);

Do you have something else in mind?

I think the problem mentioned in this thread can be caught only if we
promote the standby and restart it.  We might be able to write it
using TAP framework, but I think such a test should exist for other
index types as well or in general for unlogged tables.  I am not
opposed to writing such a test if you and or others feel so.

> Apart from that i have tested the patch and the patch seems to be working
> fine.

Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-03 Thread Amit Kapila
On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas  wrote:
> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy  
> wrote:
>
> * Instead of creating our own buffering system via buffer_file_write()
> and buffer_file_flush(), why not just use the facilities provided by
> the operating system?  fopen() et. al. provide buffering, and we have
> AllocateFile() to provide a FILE *; it's just like
> OpenTransientFile(), which you are using, but you'll get the buffering
> stuff for free.  Maybe there's some reason why this won't work out
> nicely, but off-hand it seems like it might.  It looks like you are
> already using AllocateFile() to read the dump, so using it to write
> the dump as well seems like it would be logical.
>

One thing that is worth considering is AllocateFile is recommended to
be used for short operations.  Refer text atop AllocateFile().  If the
number of blocks to be dumped is large, then the file can remain open
for the significant period of time.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-03 Thread Amit Kapila
On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy  wrote:
> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy  
> wrote:
>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown  wrote:
>>>
>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>>> detect an autoprewarm process already running.  I'd want this to
>>> return NULL or an error if called for a 2nd time.
>>
>> We log instead of error as we try to check only after launching the
>> worker and inside worker. One solution could be as similar to
>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>> memory and check if we can launch worker in backend itself. I will try
>> to fix same.
>
> I have fixed it now as follows

Few comments on the latest patch:

1.
+ LWLockRelease(_state->lock);
+ if (!is_bgworker)
+ ereport(ERROR,
+ (errmsg("could not perform block dump because dump file is being
used by PID %d",
+ apw_state->pid_using_dumpfile)));
+ ereport(LOG,
+ (errmsg("skipping block dump because it is already being performed by PID %d",
+ apw_state->pid_using_dumpfile)));


The above code looks confusing as both the messages are saying the
same thing in different words.  I think you keep one message (probably
the first one) and decide error level based on if this is invoked for
bgworker.  Also, you can move LWLockRelease after error message,
because if there is any error, then it will automatically release all
lwlocks.

2.
+autoprewarm_dump_now(PG_FUNCTION_ARGS)
+{
+ uint32 num_blocks = 0;
+
..
+ PG_RETURN_INT64(num_blocks);
..
}

Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32?

3.
+dump_now(bool is_bgworker)
{
..
+ if (buf_state & BM_TAG_VALID)
+ {
+ block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
+ block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
+ block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
+ block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
+ block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
+ ++num_blocks;
+ }
..
}

I think there is no use of writing Unlogged buffers unless the dump is
for the shutdown.  You might want to use BufferIsPermanent to detect
the same.

4.
+static uint32
+dump_now(bool is_bgworker)
{
..
+ for (num_blocks = 0, i = 0; i < NBuffers; i++)
+ {
+ uint32 buf_state;
+
+ if (!is_bgworker)
+ CHECK_FOR_INTERRUPTS();
..
}

Why checking for interrupts is only for non-bgwroker cases?

5.
+ * Each entry of BlockRecordInfo consists of database, tablespace, filenode,
+ * forknum, blocknum. Note that this is in the text form so that the dump
+ * information is readable and can be edited, if required.
+ */

In the above comment, you are saying that the dump file is in text
form whereas in the code you are using binary form.  I think code
should match comments. Is there a reason of preferring binary over
text or vice versa?

6.
+dump_now(bool is_bgworker)
{
..
+ (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
+ apw_state->pid_using_dumpfile = InvalidPid;
..
}

How will pid_using_dumpfile be set to InvalidPid in the case of error
for non-bgworker cases?

7.
+dump_now(bool is_bgworker)
{
..
+ (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);

..
}

How will transient_dump_file_path be unlinked in the case of error in
durable_rename?  I think you need to use PG_TRY..PG_CATCH to ensure
same?

8.
+ file = AllocateFile(transient_dump_file_path, PG_BINARY_W);
+ if (!file)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ transient_dump_file_path)));
+
+ ret = fprintf(file, "<<%u>>\n", num_blocks);
+ if (ret < 0)
+ {
+ int save_errno = errno;
+
+ FreeFile(file);

I think you don't need to close the file in case of error, it will be
automatically taken care in case of error (via transaction abort
path).

9.
+ /* Register autoprewarm load. */
+ setup_autoprewarm(_worker, "autoprewarm", "autoprewarm_main",
+  Int32GetDatum(TASK_PREWARM_BUFFERPOOL), 0, 0);

What does "load" signify in above comment?  Do you want to say worker instead?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-07-03 Thread Amit Langote
On 2017/07/02 20:10, Robert Haas wrote:
> On Fri, Jun 30, 2017 at 4:20 PM, Robert Haas  wrote:
>> I don't think the approach of building a hash table to figure out
>> which result rels have already been created is a good one.  That too
>> feels like something that the planner should be figuring out and the
>> executor should just be implementing what the planner decided.  I
>> haven't figured out exactly how that should work yet, but it seems
>> like it ought to be doable.
> 
> I was imagining when I wrote the above that the planner should somehow
> compute a list of relations that it has excluded so that the executor
> can skip building ResultRelInfos for exactly those relations, but on
> further study, that's not particularly easy to achieve and wouldn't
> really save anything anyway, because the list of OIDs is coming
> straight out of the partition descriptor, so it's pretty much free.
> However, I still think it would be a nifty idea if we could avoid
> needing the hash table to deduplicate.  The reason we need that is, I
> think, that expand_inherited_rtentry() is going to expand the
> inheritance hierarchy in whatever order the scan(s) of pg_inherits
> return the descendant tables, whereas the partition descriptor is
> going to put them in a canonical order.
> 
> But that seems like it wouldn't be too hard to fix: let's have
> expand_inherited_rtentry() expand the partitioned table in the same
> order that will be used by ExecSetupPartitionTupleRouting().  That
> seems pretty easy to do - just have expand_inherited_rtentry() notice
> that it's got a partitioned table and call
> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
> produce the list of OIDs.  Then - I think -
> ExecSetupPartitionTupleRouting() doesn't need the hash table; it can
> just scan through the return value of ExecSetupPartitionTupleRouting()
> and the list of already-created ResultRelInfo structures in parallel -
> the order must be the same, but the latter can be missing some
> elements, so it can just create the missing ones.

Interesting idea.

If we are going to do this, I think we may need to modify
RelationGetPartitionDispatchInfo() a bit or invent an alternative that
does not do as much work.  Currently, it assumes that it's only ever
called by ExecSetupPartitionTupleRouting() and hence also generates
PartitionDispatchInfo objects for partitioned child tables.  We don't need
that if called from within the planner.

Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
with its usage within the executor, because there is this comment:

/*
 * We keep the partitioned ones open until we're done using the
 * information being collected here (for example, see
 * ExecEndModifyTable).
 */

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-03 Thread Ashutosh Sharma
Hi,

On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila  wrote:
> While discussing the behavior of hash indexes with Bruce in the nearby
> thread [1], it has been noticed that hash index on unlogged tables
> doesn't behave as expected.  Prior to 10, it has the different set of
> problems (mainly because hash indexes are not WAL-logged) which were
> discussed on that thread [1], however when I checked, it doesn't work
> even for 10.  Below are steps to reproduce the problem.
>
> 1. Setup master and standby
> 2. On the master, create unlogged table and hash index.
>2A.  Create unlogged table t1(c1 int);
>2B.  Create hash index idx_t1_hash on t1 using hash(c1);
> 3. On Standby, try selecting data,
>select * from t1;
>ERROR:  cannot access temporary or unlogged relations during recovery
> ---Till here everything works as expected
> 4. Promote standby to master (I have just stopped the standby and
> master and removed recovery.conf file from the standby database
> location) and try starting the new master, it
> gives below error and doesn't get started.
> FATAL:  could not create file "base/12700/16387": File exists
>
> The basic issue was that the WAL logging for Create Index operation
> was oblivion of the fact that for unlogged tables only INIT forks need
> to be logged.  Another point which we need to consider is that while
> replaying the WAL for the create index operation, we need to flush the
> buffer if it is for init fork.  This needs to be done only for pages
> that can be part of init fork file (like metapage, bitmappage).
> Attached patch fixes the issue.
>
> Another approach to fix the issue could be that always log complete
> pages for the create index operation on unlogged tables (in
> hashbuildempty).  However, the logic for initial hash index pages
> needs us to perform certain other actions (like update metapage after
> the creation of bitmappage) which make it difficult to follow that
> approach.
>

Thanks for sharing the steps to reproduce the issue in detail. I could
easily reproduce this issue. I also had a quick look into the patch and the
fix looks fine to me. However, it would be good if we can add at least one
test for unlogged table with hash index in the regression test suite.

Apart from that i have tested the patch and the patch seems to be working
fine. Here are the steps that i have followed to verify the fix,

With Patch:

postgres=# SELECT pg_relation_filepath('t1');
 pg_relation_filepath
--
 base/14915/16384
(1 row)

postgres=# SELECT pg_relation_filepath('idx_t1_hash');
 pg_relation_filepath
--
 base/14915/16387
(1 row)

Master:
=
[ashu@localhost bin]$ ls -l ../master/base/14915/1638*
-rw---. 1 ashu ashu 18128896 Jul  3 14:29 ../master/base/14915/16384
-rw---. 1 ashu ashu24576 Jul  3 14:29 ../master/base/14915/16384_fsm
-rw---. 1 ashu ashu0 Jul  3 14:28
../master/base/14915/16384_init
-rw---. 1 ashu ashu 22339584 Jul  3 14:29 ../master/base/14915/16387
-rw---. 1 ashu ashu32768 Jul  3 14:28
../master/base/14915/16387_init


Standby:
==
[ashu@localhost bin]$ ls -l ../standby/base/14915/1638*
-rw---. 1 ashu ashu 0 Jul  3 14:28 ../standby/base/14915/16384_init
-rw---. 1 ashu ashu 32768 Jul  3 14:28 ../standby/base/14915/16387_init


Without patch:
==
postgres=# SELECT pg_relation_filepath('t1');
 pg_relation_filepath
--
 base/14915/16384
(1 row)

postgres=# SELECT pg_relation_filepath('idx_t1_hash');
 pg_relation_filepath
--
 base/14915/16387
(1 row)

Master:
=
[ashu@localhost bin]$ ls -l ../master/base/14915/1638*
-rw---. 1 ashu ashu 18128896 Jul  3 14:36 ../master/base/14915/16384
-rw---. 1 ashu ashu24576 Jul  3 14:36 ../master/base/14915/16384_fsm
-rw---. 1 ashu ashu0 Jul  3 14:35
../master/base/14915/16384_init
-rw---. 1 ashu ashu 22339584 Jul  3 14:36 ../master/base/14915/16387
-rw---. 1 ashu ashu32768 Jul  3 14:35
../master/base/14915/16387_init

Standby:
==
[ashu@localhost bin]$ ls -l ../standby/base/14915/1638*
-rw---. 1 ashu ashu 0 Jul  3 14:35 ../standby/base/14915/16384_init
*-rw---. 1 ashu ashu 73728 Jul  3 14:35 ../standby/base/14915/16387*
-rw---. 1 ashu ashu 24576 Jul  3 14:35 ../standby/base/14915/16387_init


As seen from the test results above, it is clear that without patch, both
main fork and init fork were getting WAL logged as the create hash index
WAL logging was being done without checking forkNum type.

_hash_init()
{


log_newpage(>rd_node,
forkNum,
blkno,
BufferGetPage(buf),
true);
_hash_relbuf(rel, buf);


}

I have also ran the WAL consistency check tool to verify the things and
didn't find any issues. Thanks.

--
With Regards,
Ashutosh Sharma

Re: [HACKERS] Multi column range partition table

2017-07-03 Thread Amit Langote
Hi Dean,

On 2017/07/03 17:36, Dean Rasheed wrote:
> On 3 July 2017 at 06:00, Amit Langote  wrote:
>> On 2017/07/03 2:15, Dean Rasheed wrote:
>>> My first thought was UNBOUNDED ABOVE/BELOW, because that matches the
>>> terminology already in use of upper and lower bounds.
>>
>> I was starting to like the Ashutosh's suggested UNBOUNDED MIN/MAX syntax,
>> but could you clarify your comment that ABOVE/BELOW is the terminology
>> already in use of upper and lower bounds?  I couldn't find ABOVE/BELOW in
>> our existing syntax anywhere that uses the upper/lower bound notion, so
>> was confused a little bit.
>>
> 
> I just meant that the words "above" and "below" more closely match the
> already-used terms "upper" and "lower" for the bounds, so that
> terminology seemed more consistent, e.g. "UNBOUNDED ABOVE" => no upper
> bound.
>
>> Also, I assume UNBOUNDED ABOVE signifies positive infinity and vice versa.
>>
> 
> Right.

I see, thanks for clarifying.

> I'm not particularly wedded to that terminology. I always find naming
> things hard, so if anyone can think of anything better, let's hear it.
> 
> The bigger question is do we want this for PG10? If so, time is
> getting tight. My feeling is that we do, because otherwise we'd be
> changing the syntax in PG11 of a feature only just released in PG10,
> and I think the current syntax is flawed, so it would be better not to
> have it in any public release. I'd feel better hearing from the
> original committer though.

The way I have extended the syntax in the posted patch, ABOVE/BELOW (or
whatever we decide instead) are optional.  UNBOUNDED without the
ABOVE/BELOW specifications implicitly means UNBOUNDED ABOVE if in FROM and
vice versa, which seems to me like sensible default behavior and what's
already present in PG 10.

Do you think ABOVE/BELOW shouldn't really be optional?

> Meanwhile, I'll continue trying to review the latest patches...

I had forgotten to update the CREATE TABLE documentation in 0002 to
reflect the syntax extension.  Fixed in the attached latest patch.

Thanks,
Amit
From e99ee071125b0026e69c5a49cee1865bf380883b Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 3 Jul 2017 10:52:45 +0900
Subject: [PATCH 1/2] Dean's patch to simply range partition overlap check

---
 src/backend/catalog/partition.c| 90 --
 src/test/regress/expected/create_table.out |  2 +-
 2 files changed, 38 insertions(+), 54 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 7da2058f15..96760a0f05 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -745,78 +745,62 @@ check_new_partition_bound(char *relname, Relation parent,
if (partdesc->nparts > 0)
{
PartitionBoundInfo boundinfo = 
partdesc->boundinfo;
-   int off1,
-   off2;
-   boolequal = false;
+   int offset;
+   boolequal;
 
Assert(boundinfo && boundinfo->ndatums 
> 0 &&
   boundinfo->strategy == 
PARTITION_STRATEGY_RANGE);
 
/*
-* Firstly, find the greatest range 
bound that is less
-* than or equal to the new lower bound.
+* Test whether the new lower bound 
(which is treated
+* inclusively as part of the new 
partition) lies inside an
+* existing partition, or in a gap.
+*
+* If it's in a gap, the next index 
value will be -1 (the
+* lower bound of the next partition).  
This is also true
+* if there is no next partition, since 
the index array is
+* initialised with an extra -1 at the 
end.
+*
+* Note that this also allows for the 
possibility that the
+* new lower bound equals an existing 
upper bound.
 */
-   off1 = partition_bound_bsearch(key, 
boundinfo, lower, true,
-   
   );
+   offset = partition_bound_bsearch(key, 
boundinfo, 

Re: [HACKERS] asynchronous execution

2017-07-03 Thread Kyotaro HORIGUCHI
Hi, I've returned.

At Thu, 29 Jun 2017 14:08:27 +0900, Amit Langote 
 wrote in 
<63a5a01c-2967-83e0-8bbf-c981404f5...@lab.ntt.co.jp>
> Hi,
> 
> On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:
> > Thank you for looking this.
> > 
> > At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:
> >> Can you please explain this part of make_append() ?
> >>
> >> /* Currently async on partitioned tables is not available */
> >> Assert(nasyncplans == 0 || partitioned_rels == NIL);
> >>
> >> I don't think the output of Append plan is supposed to be ordered even if 
> >> the
> >> underlying relation is partitioned. Besides ordering, is there any other
> >> reason not to use the asynchronous execution?
> 
> When making an Append for a partitioned table, among the arguments passed
> to make_append(), 'partitioned_rels' is a list of RT indexes of
> partitioned tables in the inheritance tree of which the aforementioned
> partitioned table is the root.  'appendplans' is a list of subplans for
> scanning the leaf partitions in the tree.  Note that the 'appendplans'
> list contains no members corresponding to the partitioned tables, because
> we don't need to scan them (only leaf relations contain any data).
> 
> The point of having the 'partitioned_rels' list in the resulting Append
> plan is so that the executor can identify those relations and take the
> appropriate locks on them.

Amit, thank you for the detailed explanation. I understand what
it is and that just ignoring it is enough, then confirmed that
actually works as before.

I'll then adresss Antonin's comments tomorrow.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-07-03 Thread Marina Polyakova

Hello Marina,


Hello, Fabien!


A few comments about the submitted patches.


Thank you very much for them!


I agree that improving the error handling ability of pgbench is a good
thing, although I'm not sure about the implications...


Could you tell a little bit more exactly.. What implications are you 
worried about?



About the "retry" discussion: I agree that retry is the relevant
option from an application point of view.


I'm glad to hear it!


ISTM that the retry implementation should be implemented somehow in
the automaton, restarting the same script for the beginning.


If there are several transactions in this script - don't you think that 
we should restart only the failed transaction?..



As pointed out in the discussion, the same values/commands should be
executed, which suggests that random generated values should be the
same on the retry runs, so that for a simple script the same
operations are attempted. This means that the random generator state
must be kept & reinstated for a client on retries. Currently the
random state is in the thread, which is not convenient for this
purpose, so it should be moved in the client so that it can be saved
at transaction start and reinstated on retries.


I think about it in the same way =)


The number of retries and maybe failures should be counted, maybe with
some adjustable maximum, as suggested.


If we fix the maximum number of attempts the maximum number of failures 
for one script execution will be bounded above 
(number_of_transactions_in_script * maximum_number_of_attempts). Do you 
think we should make the option in program to limit this number much 
more?



About 0001:

In accumStats, just use one level if, the two levels bring nothing.


Thanks, I agree =[


In doLog, added columns should be at the end of the format.


I have inserted it earlier because these columns are not optional. Do 
you think they should be optional?



The number
of column MUST NOT change when different issues arise, so that it
works well with cut/... unix commands, so inserting a sentence such as
"serialization and deadlock failures" is a bad idea.


Thanks, I agree again.


threadRun: the point of the progress format is to fit on one not too
wide line on a terminal and to allow some simple automatic processing.
Adding a verbose sentence in the middle of it is not the way to go.


I was thinking about it.. Thanks, I'll try to make it shorter.


About tests: I do not understand why test 003 includes 2 transactions.
It would seem more logical to have two scripts.


Ok!


About 0003:

I'm not sure that there should be an new option to report failures,
the information when relevant should be integrated in a clean format
into the existing reports... Maybe the "per command latency"
report/option should be renamed if it becomes more general.


I have tried do not change other parts of program as much as possible. 
But if you think that it will be more useful to change the option I'll 
do it.



About 0004:

The documentation must not be in a separate patch, but in the same
patch as their corresponding code.


Ok!

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi column range partition table

2017-07-03 Thread Ashutosh Bapat
On Mon, Jul 3, 2017 at 2:06 PM, Dean Rasheed  wrote:
> On 3 July 2017 at 06:00, Amit Langote  wrote:
>> On 2017/07/03 2:15, Dean Rasheed wrote:
>>> My first thought was UNBOUNDED ABOVE/BELOW, because that matches the
>>> terminology already in use of upper and lower bounds.
>>
>> I was starting to like the Ashutosh's suggested UNBOUNDED MIN/MAX syntax,
>> but could you clarify your comment that ABOVE/BELOW is the terminology
>> already in use of upper and lower bounds?  I couldn't find ABOVE/BELOW in
>> our existing syntax anywhere that uses the upper/lower bound notion, so
>> was confused a little bit.
>>
>
> I just meant that the words "above" and "below" more closely match the
> already-used terms "upper" and "lower" for the bounds, so that
> terminology seemed more consistent, e.g. "UNBOUNDED ABOVE" => no upper
> bound.
>
>
>> Also, I assume UNBOUNDED ABOVE signifies positive infinity and vice versa.
>>
>
> Right.
>
> I'm not particularly wedded to that terminology. I always find naming
> things hard, so if anyone can think of anything better, let's hear it.

Yet another option: UNBOUNDED UPPER/LOWER.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi column range partition table

2017-07-03 Thread Dean Rasheed
On 3 July 2017 at 06:00, Amit Langote  wrote:
> On 2017/07/03 2:15, Dean Rasheed wrote:
>> My first thought was UNBOUNDED ABOVE/BELOW, because that matches the
>> terminology already in use of upper and lower bounds.
>
> I was starting to like the Ashutosh's suggested UNBOUNDED MIN/MAX syntax,
> but could you clarify your comment that ABOVE/BELOW is the terminology
> already in use of upper and lower bounds?  I couldn't find ABOVE/BELOW in
> our existing syntax anywhere that uses the upper/lower bound notion, so
> was confused a little bit.
>

I just meant that the words "above" and "below" more closely match the
already-used terms "upper" and "lower" for the bounds, so that
terminology seemed more consistent, e.g. "UNBOUNDED ABOVE" => no upper
bound.


> Also, I assume UNBOUNDED ABOVE signifies positive infinity and vice versa.
>

Right.

I'm not particularly wedded to that terminology. I always find naming
things hard, so if anyone can think of anything better, let's hear it.

The bigger question is do we want this for PG10? If so, time is
getting tight. My feeling is that we do, because otherwise we'd be
changing the syntax in PG11 of a feature only just released in PG10,
and I think the current syntax is flawed, so it would be better not to
have it in any public release. I'd feel better hearing from the
original committer though.

Meanwhile, I'll continue trying to review the latest patches...

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macros bundling RELKIND_* conditions

2017-07-03 Thread Ashutosh Bapat
Forgot to attach the patch with the earlier mail.

On Mon, Jul 3, 2017 at 1:22 PM, Ashutosh Bapat
 wrote:
> On Mon, May 29, 2017 at 12:55 PM, Ashutosh Bapat
>  wrote:
>>
>> --
>> I noticed, that
>> after we introduced RELKIND_PARTITIONED_TABLE, we required to change a
>> number of conditions to include this relkind. We missed some places in
>> initial commits and fixed those later. I am wondering whether we
>> should creates macros clubbing relevant relkinds together based on the
>> purpose of the tests e.g. IS_RELKIND_HAS_STORAGE(). When a new relkind
>> is added, one can examine these macros to check whether the new
>> relkind fits in the given macro. If all those macros are placed
>> together, there is a high chance that we will not miss any place in
>> the initial commit itself.
>>
>> For example, if we had a macro IS_RELKIND_WITH_STATS defined as
>> #define IS_RELKIND_WITH_STATS(relkind) \
>> ((relkind) == RELKIND_RELATION || \
>> (relkind) == RELKIND_MATVIEW)
>>
>> and CreateStatistics() and getExtendedStatistics() had following conditions
>> if (!IS_RELKIND_WITH_STATS(rel->rd_rel->relkind)) and if
>> (!IS_RELKIND_WITH_STATS(tabinfo->relkind)) resp. The patch would be
>> just adding
>> (relkind) == RELKIND_FOREIGN_TABLE || \
>> (relkind) == RELKIND_PARTITIONED_TABLE)
>>
>> to that macro without requiring to find out where all we need to add
>> those two relkinds for statistics purposes.
>> -- excerpt ends
>>
>> Peter Eisentraut thought that idea is worth a try. I gave it a try on
>> my way back from PGCon. Attached is a series of patches, one per
>> macro. This isn't a complete series but will give an idea of what's
>> involved. It might be possible to change switch cases at some places
>> to use if else with these macros. But I haven't done any changes
>> towards that.
>>
>
> On the thread [1] Joe and Dean expressed that it would be good if we
> could also keep the related error messages at a central place. In the
> attached patches, I have tried to do that my defining corresponding
> ERRMSG macros encapsulating errmsg() macros in ereport() calls. Please
> let me know, if this looks good.
>
> With this approach the macro which tests relkinds and the macro which
> reports error are places together in pg_class.h. If somebody adds a
> new relkind, s/he will notice the comment there and update the macros
> below also keeping the error message in sync with the test. Please
> note that partitioned tables are not explicitly mentioned in the error
> messages when the corresponding test has RELKIND_PARTITIONED_TABLE. I
> think we don't need to differentiate between a regular table and
> partitioned table in those error messages; a "table" implies both a
> regular table and a partitioned table.
>
> With this approach, if a developer may still fail to update the error
> message when the test is updated. We can further tighten this by
> following approach.
> 1. For every test declare an array of relkinds that the test accepts e.g.
> int relkinds_with_vm[] = {RELKIND_RELATION, RELKIND_MATVIEW,
> RELKIND_TOASTVALUE};
> 2. Write a function is_relkind_in_array(int *relkinds_array, int
> num_relkinds, int relkind) to check whether the given relkind is in
> the array.
> 3. Each test macro now calls this function passing appropriate array
> e.g. #define RELKIND_WITH_VISIBILITY_MAP(relkind) \
>  is_relkind_in_array(relkinds_with_vm,
> sizeof(relkinds_with_vm)/sizeof(relkinds_with_vm[0], (relkind))
> 4. Declare an array of relkinds and their readable strings e.g
> {{RELKIND_RELATION, "table"}, {RELKIND_MATVIEW, "materialized view"}}
> 5. Write a function to collect the readable strings for all relkinds a
> given array of relkinds say char *relkind_names(int *relkinds, int
> num_relkinds)
> 6. Declare error message macros to call this function by passing
> appropriate array.
>
> Obviously with this approach we would loose a bit of readability. Also
> we will be forced to include all the relkind strings and won't be able
> to use "table" to mean both regular and partitioned table as we do
> today. I am not able to decide whether that's a good change or not.
> So, haven't coded it up.
>
> Let me know your opinion.
>
> The patches do not convert all the places that can be converted to use
> macros. Once we agree upon the approach, I will continue converting
> those.
>
> [1] 
> https://www.postgresql.org/message-id/803c7229-bac6-586a-165b-990d2e0aa...@joeconway.com
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


pg_relkind_macros_patches_v2.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macros bundling RELKIND_* conditions

2017-07-03 Thread Ashutosh Bapat
On Mon, May 29, 2017 at 12:55 PM, Ashutosh Bapat
 wrote:
>
> --
> I noticed, that
> after we introduced RELKIND_PARTITIONED_TABLE, we required to change a
> number of conditions to include this relkind. We missed some places in
> initial commits and fixed those later. I am wondering whether we
> should creates macros clubbing relevant relkinds together based on the
> purpose of the tests e.g. IS_RELKIND_HAS_STORAGE(). When a new relkind
> is added, one can examine these macros to check whether the new
> relkind fits in the given macro. If all those macros are placed
> together, there is a high chance that we will not miss any place in
> the initial commit itself.
>
> For example, if we had a macro IS_RELKIND_WITH_STATS defined as
> #define IS_RELKIND_WITH_STATS(relkind) \
> ((relkind) == RELKIND_RELATION || \
> (relkind) == RELKIND_MATVIEW)
>
> and CreateStatistics() and getExtendedStatistics() had following conditions
> if (!IS_RELKIND_WITH_STATS(rel->rd_rel->relkind)) and if
> (!IS_RELKIND_WITH_STATS(tabinfo->relkind)) resp. The patch would be
> just adding
> (relkind) == RELKIND_FOREIGN_TABLE || \
> (relkind) == RELKIND_PARTITIONED_TABLE)
>
> to that macro without requiring to find out where all we need to add
> those two relkinds for statistics purposes.
> -- excerpt ends
>
> Peter Eisentraut thought that idea is worth a try. I gave it a try on
> my way back from PGCon. Attached is a series of patches, one per
> macro. This isn't a complete series but will give an idea of what's
> involved. It might be possible to change switch cases at some places
> to use if else with these macros. But I haven't done any changes
> towards that.
>

On the thread [1] Joe and Dean expressed that it would be good if we
could also keep the related error messages at a central place. In the
attached patches, I have tried to do that my defining corresponding
ERRMSG macros encapsulating errmsg() macros in ereport() calls. Please
let me know, if this looks good.

With this approach the macro which tests relkinds and the macro which
reports error are places together in pg_class.h. If somebody adds a
new relkind, s/he will notice the comment there and update the macros
below also keeping the error message in sync with the test. Please
note that partitioned tables are not explicitly mentioned in the error
messages when the corresponding test has RELKIND_PARTITIONED_TABLE. I
think we don't need to differentiate between a regular table and
partitioned table in those error messages; a "table" implies both a
regular table and a partitioned table.

With this approach, if a developer may still fail to update the error
message when the test is updated. We can further tighten this by
following approach.
1. For every test declare an array of relkinds that the test accepts e.g.
int relkinds_with_vm[] = {RELKIND_RELATION, RELKIND_MATVIEW,
RELKIND_TOASTVALUE};
2. Write a function is_relkind_in_array(int *relkinds_array, int
num_relkinds, int relkind) to check whether the given relkind is in
the array.
3. Each test macro now calls this function passing appropriate array
e.g. #define RELKIND_WITH_VISIBILITY_MAP(relkind) \
 is_relkind_in_array(relkinds_with_vm,
sizeof(relkinds_with_vm)/sizeof(relkinds_with_vm[0], (relkind))
4. Declare an array of relkinds and their readable strings e.g
{{RELKIND_RELATION, "table"}, {RELKIND_MATVIEW, "materialized view"}}
5. Write a function to collect the readable strings for all relkinds a
given array of relkinds say char *relkind_names(int *relkinds, int
num_relkinds)
6. Declare error message macros to call this function by passing
appropriate array.

Obviously with this approach we would loose a bit of readability. Also
we will be forced to include all the relkind strings and won't be able
to use "table" to mean both regular and partitioned table as we do
today. I am not able to decide whether that's a good change or not.
So, haven't coded it up.

Let me know your opinion.

The patches do not convert all the places that can be converted to use
macros. Once we agree upon the approach, I will continue converting
those.

[1] 
https://www.postgresql.org/message-id/803c7229-bac6-586a-165b-990d2e0aa...@joeconway.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-03 Thread Amit Kapila
On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy  wrote:
> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy  
> wrote:
>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown  wrote:
>>>
>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>>> detect an autoprewarm process already running.  I'd want this to
>>> return NULL or an error if called for a 2nd time.
>>
>> We log instead of error as we try to check only after launching the
>> worker and inside worker. One solution could be as similar to
>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>> memory and check if we can launch worker in backend itself. I will try
>> to fix same.
>
> I have fixed it now as follows
>
> +Datum
> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
> +{
> +   pid_t   pid;
> +
> +   init_apw_shmem();
> +   pid = apw_state->bgworker_pid;
> +   if (pid != InvalidPid)
> +   ereport(ERROR,
> +   (errmsg("autoprewarm worker is already running under PID %d",
> +   pid)));
> +
> +   autoprewarm_dump_launcher();
> +   PG_RETURN_VOID();
> +}
>
> In backend itself, we shall check if an autoprewarm worker is running
> then only start the server. There is a possibility if this function is
> executed concurrently when there is no worker already running (Which I
> think is not a normal usage) then both call will say it has
> successfully launched the worker even though only one could have
> successfully done that (other will log and silently die).

Why can't we close this remaining race condition?  Basically, if we
just perform all of the actions in this function under the lock and
autoprewarm_dump_launcher waits till the autoprewarm worker has
initialized the bgworker_pid, then there won't be any remaining race
condition.  I think if we don't close this race condition, it will be
unpredictable whether the user will get the error or there will be
only a server log for the same.

> I think that
> is okay as the objective was to get one worker up and running.
>

You are right that the objective will be met, but still, I feel the
behavior of this API will be unpredictable which in my opinion should
be fixed.  If it is really not possible or extremely difficult to fix
this behavior, then I think we should update the docs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi column range partition table

2017-07-03 Thread Amit Langote
On 2017/07/03 14:00, Amit Langote wrote:
> On 2017/07/03 2:15, Dean Rasheed wrote:
>> On 30 June 2017 at 10:04, Ashutosh Bapat
>>  wrote:
>>> On Fri, Jun 30, 2017 at 1:36 PM, Amit Langote
>>>  wrote:

 Alright, I spent some time implementing a patch to allow specifying
 -infinity and +infinity in arbitrary ways.  Of course, it prevents
 nonsensical inputs with appropriate error messages.
>>>
>>> I don't think -infinity and +infinity are the right terms. For a
>>> string or character data type there is no -infinity and +infinity.
>>> Similarly for enums. We need to extend UNBOUNDED somehow to indicate
>>> the end of a given type in the given direction. I thought about
>>> UNBOUNDED LEFT/RIGHT but then whether LEFT indicates -ve side or +side
>>> would cause confusion. Also LEFT/RIGHT may work for a single
>>> dimensional datatype but not for multi-dimensional spaces. How about
>>> MINIMUM/MAXIMUM or UNBOUNDED MIN/MAX to indicate the extremities.
>>>
>>
>> Yes, I think you're right. Also, some datatypes include values that
>> are equal to +/-infinity, which would then behave differently from
>> unbounded as range bounds, so it wouldn't be a good idea to overload
>> that term.
> 
> Agree with you both that using (+/-) infinity may not be a good idea after
> all.
> 
>> My first thought was UNBOUNDED ABOVE/BELOW, because that matches the
>> terminology already in use of upper and lower bounds.
> 
> I was starting to like the Ashutosh's suggested UNBOUNDED MIN/MAX syntax,
> but could you clarify your comment that ABOVE/BELOW is the terminology
> already in use of upper and lower bounds?  I couldn't find ABOVE/BELOW in
> our existing syntax anywhere that uses the upper/lower bound notion, so
> was confused a little bit.
> 
> Also, I assume UNBOUNDED ABOVE signifies positive infinity and vice versa.

Anyway, here's the revised version of the syntax patch that implements
ABOVE/BELOW extension to UNBOUNDED specification.

0001 is the patch that Dean posted [1] as a replacement for what I earlier
posted for simplifying range partition overlap check.

0002 is the UNBOUNDED syntax extension patch.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAEZATCVcBCBZsMcHj37TF%2BdcsjCtKZdZ_FAaJjaFMvfoXRqZMg%40mail.gmail.com
From e99ee071125b0026e69c5a49cee1865bf380883b Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 3 Jul 2017 10:52:45 +0900
Subject: [PATCH 1/2] Dean's patch to simply range partition overlap check

---
 src/backend/catalog/partition.c| 90 --
 src/test/regress/expected/create_table.out |  2 +-
 2 files changed, 38 insertions(+), 54 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 7da2058f15..96760a0f05 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -745,78 +745,62 @@ check_new_partition_bound(char *relname, Relation parent,
if (partdesc->nparts > 0)
{
PartitionBoundInfo boundinfo = 
partdesc->boundinfo;
-   int off1,
-   off2;
-   boolequal = false;
+   int offset;
+   boolequal;
 
Assert(boundinfo && boundinfo->ndatums 
> 0 &&
   boundinfo->strategy == 
PARTITION_STRATEGY_RANGE);
 
/*
-* Firstly, find the greatest range 
bound that is less
-* than or equal to the new lower bound.
+* Test whether the new lower bound 
(which is treated
+* inclusively as part of the new 
partition) lies inside an
+* existing partition, or in a gap.
+*
+* If it's in a gap, the next index 
value will be -1 (the
+* lower bound of the next partition).  
This is also true
+* if there is no next partition, since 
the index array is
+* initialised with an extra -1 at the 
end.
+*
+* Note that this also allows for the 
possibility that the
+* new lower bound equals an existing 
upper bound.
 */
-   off1 =