Re: [HACKERS] Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().

2017-06-27 Thread Thomas Munro
On Wed, Aug 24, 2016 at 2:58 AM, Robert Haas  wrote:
> Now, for bigger segment sizes, I think there actually could be a
> little bit of a noticeable performance hit here, because it's not just
> about total elapsed time.  Even if the code eventually touches all of
> the memory, it might not touch it all before starting to fire up
> workers or whatever else it wants to do with the DSM segment.  But I'm
> thinking we still need to bite the bullet and pay the expense, because
> crash-and-restart cycles are *really* bad.

Here is a new rebased version of this patch, primarily to poke this
thread as an unresolved question.  This patch is not committable as is
though: I discovered that parallel query can cause fallocate to return
with errno == EINTR.  I haven't yet investigated whether fallocate is
supposed to be restartable, or signals should be blocked, or something
else is wrong.  Another question is whether the call to ftruncate() is
actually necessary before the call to fallocate().

Unfounded speculation: fallocate() might actually *improve*
performance of DSM segments if your access pattern involves random
access (just to pick an example out of the air, something like...
building a hash table), since it's surely easier to allocate a big
contiguous chunk than a squillion random pages most of which divide an
existing hole into two smaller holes...

-- 
Thomas Munro
http://www.enterprisedb.com


fallocate-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Masahiko Sawada
On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek
 wrote:
>
> On 27/06/17 10:51, Masahiko Sawada wrote:
>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada  
>> wrote:
>>
>> I've reviewed this patch briefly.
>
> Thanks!
>
>>
>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>>  }
>>
>>  /*
>> + * Request worker to be stopped on commit.
>> + */
>> +void
>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
>> +{
>> +   LogicalRepWorkerId *wid;
>> +   MemoryContext old;
>> +
>> +   old = MemoryContextSwitchTo(TopTransactionContext);
>> +
>> +   wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
>> +
>> +   /*
>> +   wid = MemoryContextAlloc(TopTransactionContext,
>> +
>> sizeof(LogicalRepWorkerId));
>> +   */
>> +   wid->subid = subid;
>> +   wid->relid = relid;
>> +
>> +   on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
>> +
>> +   MemoryContextSwitchTo(old);
>> +}
>>
>> logicalrep_worker_stop_at_commit() has a problem that new_list()
>> called by lappend() allocates the memory from current memory context.
>> It should switch the memory context and then allocate the memory for
>> wid and append it to the list.
>>

Thank you for updating the patch!

>
> Right, fixed (I see you did that locally as well based on the above
> excerpt ;) ).

Oops, yeah that's my test code.

>> 
>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>>  void
>>  AtEOXact_ApplyLauncher(bool isCommit)
>>  {
>> -   if (isCommit && on_commit_launcher_wakeup)
>> -   ApplyLauncherWakeup();
>> +   ListCell *lc;
>> +
>> +   if (isCommit)
>> +   {
>> +   foreach (lc, on_commit_stop_workers)
>> +   {
>> +   LogicalRepWorkerId *wid = lfirst(lc);
>> +   logicalrep_worker_stop(wid->subid, wid->relid);
>> +   }
>> +
>> +   if (on_commit_launcher_wakeup)
>> +   ApplyLauncherWakeup();
>>
>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
>> support the prepared transaction. Since we allocate the list
>> on_commit_stop_workers in TopTransactionContext the postgres crashes
>> if we execute any query after prepared transaction that removes
>> subscription relation state. Also after fixed this issue, we still
>> need to something: the list of on_commit_stop_workers is not
>> associated the prepared transaction.  A query next to "preapre
>> transaction" tries to stop workers at the commit. There was similar
>> discussion before.
>>
>
> Hmm, good point. I think for now it makes sense to simply don't allow
> PREPARE for transactions that manipulate workers similar to what we do
> when there are exported snapshots. Done it that way in attached.

Agreed. Should we note that in docs?

>
>> 
>> +
>> +   ensure_transaction();
>> +
>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>> +
>> rstate->relid, rstate->state,
>> +
>> rstate->lsn);
>>
>>
>> Should we commit the transaction if we started a new transaction
>> before update the subscription relation state, or it could be deadlock
>> risk.
>
> We only lock the whole subscription (and only conflicting things are
> DROP and ALTER SUBSCRIPTION), not individual subscription-relation
> mapping so it doesn't seem to me like there is any higher risk of
> deadlocks than anything else which works with multiple tables (or
> compared to previous behavior).
>

I'm concerned that a lock for whole subscription could be conflicted
between ALTER SUBSCRIPTION, table sync worker and apply worker:

Please imagine the following case.

1. The apply worker updates a subscription relation state R1 of
subscription S1.
-> It acquires AccessShareLock on S1, and keep holding.
2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire
AccessExclusiveLock on S1.
-> But it waits for the apply worker to release the lock.
3. The apply worker calls wait_for_relation_state_change(relid,
SUBREL_STATE_SYNCDONE) and waits for the table sync worker
for R2 to change its status.
-> Note that the apply worker is still holding AccessShareLock on S1
4. The table sync worker tries to update its status to SYNCDONE
-> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock
   on S1 but waits for it. *deadlock*

To summary, because the apply worker keeps holding AccessShareLock on
S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for
the apply worker, and then the table sync worker also waits for the
ALTER SUBSCRIPTION in order to change its status. And the apply worker
waits for the table sync worker to change its status.

I encountered the similar case once on my environment. But since it
completely depends on timing I don't have the concrete reproducing
steps.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE 

Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Noah Misch
On Fri, Jun 23, 2017 at 09:42:10PM -0400, Peter Eisentraut wrote:
> On 6/21/17 22:47, Peter Eisentraut wrote:
> > On 6/20/17 22:44, Noah Misch wrote:
> >>> A patch has been posted, and it's being reviewed.  Next update Monday.
> >>
> >> This PostgreSQL 10 open item is past due for your status update.  Kindly 
> >> send
> >> a status update within 24 hours, and include a date for your subsequent 
> >> status
> >> update.  Refer to the policy on open item ownership:
> >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> > 
> > I'm not sure how to proceed here.  Nobody is else seems terribly
> > interested, and this is really a minor issue, so I don't want to muck
> > around with the locking endlessly.  Let's say, if there are no new
> > insights by Friday, I'll pick one of the proposed patches or just close
> > it without any patch.
> 
> After some comments, a new patch has been posted, and I'll give until
> Monday for review.

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-27 Thread Haribabu Kommi
On Tue, Jun 27, 2017 at 11:53 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila 
> wrote:
>
>> On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
>>  wrote:
>> > On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas 
>> wrote:
>> >>
>> >> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>> >>  wrote:
>> >> > Open Items:
>> >> >
>> >> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> >> > HeapTuple and HeapScanDesc, So these scans are directly operating
>> >> > on those structures and providing the result.
>> >> >
>> >> > These scan types may not be applicable to different storage formats.
>> >> > So how to handle them?
>> >>
>> >> I think that BitmapHeapScan, at least, is applicable to any table AM
>> >> that has TIDs.   It seems to me that in general we can imagine three
>> >> kinds of table AMs:
>> >>
>> >> 1. Table AMs where a tuple can be efficiently located by a real TID.
>> >> By a real TID, I mean that the block number part is really a block
>> >> number and the item ID is really a location within the block.  These
>> >> are necessarily quite similar to our current heap, but they can change
>> >> the tuple format and page format to some degree, and it seems like in
>> >> many cases it should be possible to plug them into our existing index
>> >> AMs without too much heartache.  Both index scans and bitmap index
>> >> scans ought to work.
>> >
>> >
>> > If #1 is only about changing tuple and page formats, then could be much
>> > simpler than the patch upthread?  We can implement "page format access
>> > methods" with routines for insertion, update, pruning and deletion of
>> tuples
>> > *in particular page*.  There is no need to redefine high-level logic for
>> > scanning heap, doing updates and so on...
>>
>> If you are changing tuple format then you do need to worry about
>> places wherever we are using HeapTuple like TupleTableSlots,
>> Visibility routines, etc. (just think if somebody changes tuple
>> header, then all such places are susceptible to change).
>
>
> Agree.  I think that we can consider pluggable tuple format as an
> independent feature which is desirable to have before pluggable storages.
> For example, I believe some FDWs could already have benefit from pluggable
> tuple format.
>

Accepting multiple tuple format is possible with complete replacement of
HeapTuple with TupleTableSlot or something like value/null array
instead of a single memory chunk tuple data.

Currently I am working on it to replace the HeapTuple with TupleTableSlot
in the upper layers once the tuples is returned from the scan. In most of
the
places where the HeapTuple is present, either replace it with TupleTableSlot
or change it to StorageTuple (void *).

I am yet to evaluate whether it is possible to support as an independent
feature
without the need of some heap format function to understand the tuple format
in every place.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Pluggable storage

2017-06-27 Thread Haribabu Kommi
On Wed, Jun 28, 2017 at 12:00 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila 
> wrote:
>
>> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
>>  wrote:
>> > On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <
>> kommi.harib...@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <
>> alvhe...@2ndquadrant.com>
>> >> wrote:
>> >>>
>> >>> I have sent the partial patch I have to Hari Babu Kommi.  We expect
>> that
>> >>> he will be able to further this goal some more.
>> >>
>> >>
>> >> Thanks Alvaro for sharing your development patch.
>> >>
>> >> Most of the patch design is same as described by Alvaro in the first
>> mail
>> >> [1].
>> >> I will detail the modifications, pending items and open items (needs
>> >> discussion)
>> >> to implement proper pluggable storage.
>> >>
>> >> Here I attached WIP patches to support pluggable storage. The patch
>> series
>> >> are may not work individually. Still so many things are under
>> development.
>> >> These patches are just to share the approach of the current
>> development.
>> >>
>> >> Some notable changes that I did to make the patch work:
>> >>
>> >> 1. Added storageam handler to the slot, this is because not all places
>> >> the relation is not available in handy.
>> >> 2. Retained the minimal Tuple in the slot, as this is used in HASH
>> join.
>> >> As per the first version, I feel it is fine to allow creating HeapTuple
>> >> format data.
>> >>
>> >> Thanks everyone for sharing their ideas in the developer's
>> unconference at
>> >> PGCon Ottawa.
>> >>
>> >> Pending items:
>> >>
>> >> 1. Replacement of Tuple with slot in Trigger functionality
>> >> 2. Replacement of Tuple with Slot from storage handler functions.
>> >> 3. Remove/minimize the use of HeapTuple as a Datum.
>> >> 4. Replace all references of HeapScanDesc with StorageScanDesc
>> >> 5. Planner changes to consider the relation storage during the
>> planning.
>> >> 6. Any planner changes based on the discussion of open items?
>> >> 7. some Executor changes to consider the storage advantages?
>> >>
>> >> Open Items:
>> >>
>> >> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> >> HeapTuple and HeapScanDesc, So these scans are directly operating
>> >> on those structures and providing the result.
>> >
>> >
>> > What about vacuum?  I see vacuum is untouched in the patchset and it is
>> not
>> > mentioned in this discussion.
>> > Do you plan to override low-level function like heap_page_prune(),
>> > lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
>> > Or do you plan to let pluggable storage implement its own high-level
>> vacuum
>> > algorithm?
>> >
>>
>> Probably, some other algorithm for vacuum.  I am not sure current
>> vacuum with its parameters can be used so easily.  One thing that
>> might need some thoughts is that is it sufficient to say that keep
>> autovacuum as off and call some different API for places where the
>> vacuum can be invoked manually like Vacuum command to the developer
>> implementing some different strategy for vacuum or we need something
>> more as well.
>
>
> What kind of other vacuum algorithm do you expect?  It would be rather
> easier to understand if we would have examples.
>
> For me, changing of vacuum algorithm is not needed for just heap page
> format change.  Existing vacuum algorithm could just call page format API
> functions for manipulating individual pages.
>
> Changing of vacuum algorithm might be needed for more invasive changes
> than just heap page format.  However, we should first understand what these
> changes could be and how are they consistent with rest of API design.
>

Yes, I agree that we need some changes in the vacuum to handle the
pluggable storage.
Currently I didn't fully checked the changes that are needed in vacuum, but
I feel the low level changes of the function are enough, and also there
should be
some option from storage handler to decide whether it needs a vacuum or not?
Based on this flag, the vacuum may be skipped on those tables. So these
handlers
no need to register those API's.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Pluggable storage

2017-06-27 Thread Haribabu Kommi
On Thu, Jun 22, 2017 at 5:47 AM, Robert Haas  wrote:

> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>  wrote:
> > Open Items:
> >
> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> > HeapTuple and HeapScanDesc, So these scans are directly operating
> > on those structures and providing the result.
> >
> > These scan types may not be applicable to different storage formats.
> > So how to handle them?
>
> I think that BitmapHeapScan, at least, is applicable to any table AM
> that has TIDs.   It seems to me that in general we can imagine three
> kinds of table AMs:
>
> 1. Table AMs where a tuple can be efficiently located by a real TID.
> By a real TID, I mean that the block number part is really a block
> number and the item ID is really a location within the block.  These
> are necessarily quite similar to our current heap, but they can change
> the tuple format and page format to some degree, and it seems like in
> many cases it should be possible to plug them into our existing index
> AMs without too much heartache.  Both index scans and bitmap index
> scans ought to work.
>
> 2. Table AMs where a tuple has some other kind of locator.  For
> example, imagine an index-organized table where the locator is the
> primary key, which is a bit like what Alvaro had in mind for indirect
> indexes.  If the locator is 6 bytes or less, it could potentially be
> jammed into a TID, but I don't think that's a great idea.  For things
> like int8 or numeric, it won't work at all.  Even for other things,
> it's going to cause problems because the bit patterns won't be what
> the code is expecting; e.g. bitmap scans care about the structure of
> the TID, not just how many bits it is.  (Due credit: Somebody, maybe
> Alvaro, pointed out this problem before, at PGCon.)  For these kinds
> of tables, larger modifications to the index AMs are likely to be
> necessary, at least if we want a really general solution, or maybe we
> should have separate index AMs - e.g. btree for traditional TID-based
> heaps, and generic_btree or indirect_btree or key_btree or whatever
> for heaps with some other kind of locator.  It's not too hard to see
> how to make index scans work with this sort of structure but it's very
> unclear to me whether, or how, bitmap scans can be made to work.
>
> 3. Table AMs where a tuple doesn't really have a locator at all.  In
> these cases, we can't support any sort of index AM at all.  When the
> table is queried, there's really nothing the core system can do except
> ask the table AM for a full scan, supply the quals, and hope the table
> AM has some sort of smarts that enable it to optimize somehow.  For
> example, you can imagine converting cstore_fdw into a table AM of this
> sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole
> chunks to be proven uninteresting and skipped.  (You could use chunk
> number + offset to turn this into a table AM of the previous type if
> you wanted to support secondary indexes; not sure if that'd be useful,
> but it'd certainly be harder.)
>
> I'm more interested in #1 than in #3, and more interested in #3 than
> #2, but other people may have different priorities.


Hi Robert,

Thanks for the details and your opinion.
I also agree that option#1 is better to do first.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] pg_serial early wraparound

2017-06-27 Thread Thomas Munro
On Sat, Mar 25, 2017 at 7:27 AM, Thomas Munro
 wrote:
> On Sat, Mar 25, 2017 at 3:11 AM, Anastasia Lubennikova
>  wrote:
>> You claim that SLRUs now support five digit segment name, while in slru.h
>> at current master I see the following:
>>
>>  * Note: slru.c currently assumes that segment file names will be four hex
>>  * digits.  This sets a lower bound on the segment size (64K transactions
>>  * for 32-bit TransactionIds).
>>  */

I've now complained about that comment in a separate thread.

> It's not urgent, it's just cleanup work, so I've now moved it to the
> next commitfest.  I will try to figure out a new way to demonstrate
> that it works correctly without having to ask a review[er] to disable
> any assertions.  Thanks again.

Here's a rebased batch.

-- 
Thomas Munro
http://www.enterprisedb.com


ssi-slru-wraparound-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-06-27 Thread Thomas Munro
On Mon, May 22, 2017 at 6:39 AM, Thomas Munro
 wrote:
> On Thu, Apr 27, 2017 at 11:03 AM, Thomas Munro
>  wrote:
>> On Thu, Apr 27, 2017 at 5:13 AM, Oleg Golovanov  wrote:
>>> Can you actualize your patch set? The error got from
>>> 0010-hj-parallel-v12.patch.
>>
>> I really should get around to setting up a cron job to tell me about
>> that.  Here's a rebased version.
>
> Rebased.

Rebased for the recent re-indent and shm_toc API change; no functional
changes in this version.

(I have a new patch set in the pipeline adding the skew optimisation
and some other things, more on that soon.)

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-shared-hash-v15.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in comment in xlog.c: ReadRecord

2017-06-27 Thread Amit Langote
Attached fixes $SUBJECT.

s/fetch_ckpt/fetching_ckpt/g

Thanks,
Amit
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 0a6314a642..5b6cec8dee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4221,10 +4221,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr 
RecPtr, int emode,
 * pg_wal, so we are presumably now consistent.
 *
 * We require that there's at least some valid WAL 
present in
-* pg_wal, however (!fetch_ckpt). We could recover 
using the WAL
-* from the archive, even if pg_wal is completely 
empty, but we'd
-* have no idea how far we'd have to replay to reach 
consistency.
-* So err on the safe side and give up.
+* pg_wal, however (!fetching_ckpt).  We could recover 
using the
+* WAL from the archive, even if pg_wal is completely 
empty, but
+* we'd have no idea how far we'd have to replay to 
reach
+* consistency.  So err on the safe side and give up.
 */
if (!InArchiveRecovery && ArchiveRecoveryRequested &&
!fetching_ckpt)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] transition table behavior with inheritance appears broken

2017-06-27 Thread Andrew Gierth
> "Noah" == Noah Misch  writes:

 Noah> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is
 Noah> long past due for your status update.  Please reacquaint yourself
 Noah> with the policy on open item ownership[1] and then reply
 Noah> immediately.  If I do not hear from you by 2017-06-28 06:00 UTC,
 Noah> I will transfer this item to release management team ownership
 Noah> without further notice.

Sorry for the lack of updates. I need to sleep now, but I will send a
proper status update by 1800 UTC (1900 BST) today (28th).

-- 
Andrew.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SERIALIZABLE with parallel query

2017-06-27 Thread Thomas Munro
On Tue, Apr 4, 2017 at 8:25 AM, Thomas Munro
 wrote:
> ... but considering that these data structures may
> finish up being redesigned as part of the GSoC project[1], it may be
> best to wait and see where that goes before doing anything.  I'll
> follow developments there, and if this patch remains relevant I'll
> plan to do some more work on it including testing (possibly with the
> RUBiS benchmark from Kevin and Dan's paper since it seems the most
> likely to be able to really use parallelism) for PG11 CF1.

I've been keeping one eye on the GSoC project.  That patch changes the
inConflicts and outConflicts data structures, but not the locking
protocol.  This patch works by introducing per-SERIALIZABLEXACT
locking in the places where the code currently assumes that the
current backend is the only one that could modify a shared data
structure (namely MySerializableXact->predicateLocks), so that
MySerializableXact can be shared with workers.  There doesn't seem to
be any incompatibility or dependency so far, so here's a rebased
patch.  Testing needed.

-- 
Thomas Munro
http://www.enterprisedb.com


ssi-parallel-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion not used when doing a recovery_target PITR restore?

2017-06-27 Thread Andres Freund
On 2017-06-28 06:04:23 +0900, Michael Paquier wrote:
> On Wed, Jun 28, 2017 at 3:44 AM, Andres Freund  wrote:
> > I'm far from convinced by this.  By now WAL replay with checkpointer,
> > bgwriter, etc. active is actually *more* tested than the cases without
> > it. The likelihood of bugs is higher in the less frequently exercised
> > paths, and given that replication exercises the situation with all those
> > processes active on a continuous basis, I'm fairly unconvinced by your
> > argument.
> 
> Crash recovery is the last thing where failures should never happen.
> Don't you think that it should remain simple as it has been designed
> originally? It seems to me that the argument for keeping things simple
> has higher priority than performance in being able to reconnect by
> delaying the checkpoint.

You seem to completely argue besides my point that the replication path
is *more* robust by now?  And there's plenty scenarios where a faster
startup is quite crucial for performance. The difference between an
immediate shutdown + recovery without checkpoint to a fast shutdown can
be very large, and that matters a lot for faster postgres updates etc.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion not used when doing a recovery_target PITR restore?

2017-06-27 Thread Michael Paquier
On Wed, Jun 28, 2017 at 3:44 AM, Andres Freund  wrote:
> I'm far from convinced by this.  By now WAL replay with checkpointer,
> bgwriter, etc. active is actually *more* tested than the cases without
> it. The likelihood of bugs is higher in the less frequently exercised
> paths, and given that replication exercises the situation with all those
> processes active on a continuous basis, I'm fairly unconvinced by your
> argument.

Crash recovery is the last thing where failures should never happen.
Don't you think that it should remain simple as it has been designed
originally? It seems to me that the argument for keeping things simple
has higher priority than performance in being able to reconnect by
delaying the checkpoint.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-27 Thread Tom Lane
Andres Freund  writes:
> On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
>> If we decide that it has to wait for v11,
>> I'd address Jeff's complaint by hacking the loop behavior in
>> test_postmaster_connection, which'd be ugly but not many lines of code.

> Basically increasing the wait time over time?

I was thinking of just dropping back to once-per-second after a couple
of seconds, with something along the lines of this in place of the
existing sleep at the bottom of the loop:

if (i >= 2 * WAITS_PER_SEC)
{
pg_usleep(USEC_PER_SEC);
i += WAITS_PER_SEC - 1;
}
else
pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);


> The 8-space thing in multiple places is a bit ugly.  How about having a
> a bunch of constants declared in one place? Alternatively make it
> something like: status: $c, where $c is a one character code for the
> various states?

Yeah, we could add the string values as macros in miscadmin.h, perhaps.
I don't like the single-character idea --- if we do expand the number of
things reported this way, that could get restrictive.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-27 Thread Andres Freund
Hi,

On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
> Here's a draft patch for that.  I quite like the results --- this seems
> way simpler and more reliable than what pg_ctl has done up to now.

Yea, I like that too.


> However, it's certainly arguable that this is too much change for an
> optional post-beta patch.

Yea, I think there's a valid case to be made for that. I'm still
inclined to go along with this, it seems we're otherwise just adding
complexity we're going to remove in a bit again.


> If we decide that it has to wait for v11,
> I'd address Jeff's complaint by hacking the loop behavior in
> test_postmaster_connection, which'd be ugly but not many lines of code.

Basically increasing the wait time over time?


> Note that I followed the USE_SYSTEMD patch's lead as to where to report
> postmaster state changes.  Arguably, in the standby-server case, we
> should not report the postmaster is ready until we've reached consistency.
> But that would require additional signaling from the startup process
> to the postmaster, so it seems like a separate change if we want it.

I suspect we're going to want to add more states to this over time, but
as you say, there's no need to hurry.


>   /*
> +  * Report postmaster status in the postmaster.pid file, to allow pg_ctl 
> to
> +  * see what's happening.  Note that all strings written to the status 
> line
> +  * must be the same length, per comments for AddToDataDirLockFile().  We
> +  * currently make them all 8 bytes, padding with spaces.
> +  */
> + AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "starting");

The 8-space thing in multiple places is a bit ugly.  How about having a
a bunch of constants declared in one place? Alternatively make it
something like: status: $c, where $c is a one character code for the
various states?


> @@ -1149,8 +1149,9 @@ TouchSocketLockFiles(void)
>   *
>   * Note: because we don't truncate the file, if we were to rewrite a line
>   * with less data than it had before, there would be garbage after the last
> - * line.  We don't ever actually do that, so not worth adding another kernel
> - * call to cover the possibility.
> + * line.  While we could fix that by adding a truncate call, that would make
> + * the file update non-atomic, which we'd rather avoid.  Therefore, callers
> + * should endeavor never to shorten a line once it's been written.
>   */
>  void
>  AddToDataDirLockFile(int target_line, const char *str)
> @@ -1193,19 +1194,26 @@ AddToDataDirLockFile(int target_line, co
>   srcptr = srcbuffer;
>   for (lineno = 1; lineno < target_line; lineno++)
>   {
> - if ((srcptr = strchr(srcptr, '\n')) == NULL)
> - {
> - elog(LOG, "incomplete data in \"%s\": found only %d 
> newlines while trying to add line %d",
> -  DIRECTORY_LOCK_FILE, lineno - 1, target_line);
> - close(fd);
> - return;
> - }
> - srcptr++;
> + char   *eol = strchr(srcptr, '\n');
> +
> + if (eol == NULL)
> + break;  /* not enough lines in 
> file yet */
> + srcptr = eol + 1;
>   }
>   memcpy(destbuffer, srcbuffer, srcptr - srcbuffer);
>   destptr = destbuffer + (srcptr - srcbuffer);
>  
>   /*
> +  * Fill in any missing lines before the target line, in case lines are
> +  * added to the file out of order.
> +  */
> + for (; lineno < target_line; lineno++)
> + {
> + if (destptr < destbuffer + sizeof(destbuffer))
> + *destptr++ = '\n';
> + }
> +
> + /*
>* Write or rewrite the target line.
>*/
>   snprintf(destptr, destbuffer + sizeof(destbuffer) - destptr,
> "%s\n", str);

Not this patches fault, and shouldn't be changed now, but this is a
fairly weird way to manage and update this file.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Abbreviated keys in nbtree internal pages

2017-06-27 Thread Peter Geoghegan
Over on the "memory layouts for binary search in nbtree" thread, I
described a plausible way of implementing abbreviated keys for nbtree
internal pages [1]. I've been going on about this technique for a long
time, but the insight that we can do it by reusing an itemId's lp_len
field is a new one. I'm encouraged by the fact that I got a POC patch
to work quite easily.

This optimization is all about reducing the number of cache misses
within _bt_binsrch(). We'd do this for internal pages only. The big
idea is that binary searches may only have to access the ItemId array
on internal pages, which is often less than 1/8 of the size of the
page, and sometimes only 1/16 (consider internal page fillfactor).

Areas of concern/interest include:

* How to implement an encoding scheme to make the best of the 15 bits
we'd have for this. That could involve hard trade-offs for something
like numeric. Note that I'm not interested in making any encoding
scheme that is adaptive -- simplicity demands that we implement
totally generic encoding schemes, I think. We'd also make the
comparisons totally generic unsigned integer comparisons, that are
hard coded within _bt_compare().

15 bits is more than it sounds. It would be useless for sorting, but
within B-Trees things are quite different. You can amortize the cost
of generating the key over a very long period of time, and abbreviated
keys are only needed for internal pages (typically perhaps 1 in 50 -
200 leaf tuples for text). You only need to generate a new abbreviated
key during a leaf page split (page splits of internal pages just reuse
the existing one). And, most importantly, internal pages naturally
represent dispersed values. You've an excellent chance at resolving
comparisons with only one byte abbreviated key comparisons within the
root page. 15 bits gets you surprisingly far.

* What other code this optimization might break.

* Workloads helped by the optimization, that should be tested.

* Whether it's okay that some workloads will not be helped at all. The
hope here is that branch prediction and the fact that we're not
accessing stuff not already in a cache line means those cases don't
hurt either.

* We'll want to do this for text's default B-Tree opclass, with a real
collation, since that is where it would really matter. I think there
are concerns about the stability of something like a strxfrm() blob
over and above the concerns we have for sort support. You don't just
have to worry about the stability of the behavior of collations -- you
also have to worry about the stability of actual blobs on disk, and
that a technical enhancement to the collator algorithm could break
things that wouldn't break sort support.

ICU has facilities for this, and the ICU talks about storing strxfrm()
blobs on disk for a long time, but the details would need to be worked
out.

[1] 
postgr.es/m/cah2-wzmw9ljtfzp7uy4chsf3nh0ym-_pow3lxrh18orlgeu...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-27 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> On 2017-06-26 17:38:03 -0400, Tom Lane wrote:
>>> Hm.  Take that a bit further, and we could drop the connection probes
>>> altogether --- just put the whole responsibility on the postmaster to
>>> show in the pidfile whether it's ready for connections or not.

>> Yea, that seems quite appealing, both from an architectural, simplicity,
>> and log noise perspective. I wonder if there's some added reliability by
>> the connection probe though? Essentially wondering if it'd be worthwhile
>> to keep a single connection test at the end. I'm somewhat disinclined
>> though.

> I agree --- part of the appeal of this idea is that there could be a net
> subtraction of code from pg_ctl.  (I think it wouldn't have to link libpq
> anymore at all, though maybe I forgot something.)  And you get rid of a
> bunch of can't-connect failure modes, eg kernel packet filter in the way,
> which probably outweighs any hypothetical reliability gain from confirming
> the postmaster state the old way.

Here's a draft patch for that.  I quite like the results --- this seems
way simpler and more reliable than what pg_ctl has done up to now.
However, it's certainly arguable that this is too much change for an
optional post-beta patch.  If we decide that it has to wait for v11,
I'd address Jeff's complaint by hacking the loop behavior in
test_postmaster_connection, which'd be ugly but not many lines of code.

Note that I followed the USE_SYSTEMD patch's lead as to where to report
postmaster state changes.  Arguably, in the standby-server case, we
should not report the postmaster is ready until we've reached consistency.
But that would require additional signaling from the startup process
to the postmaster, so it seems like a separate change if we want it.

Thoughts?

regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2874f63..38f534f 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*** PostmasterMain(int argc, char *argv[])
*** 1341,1346 
--- 1341,1354 
  #endif
  
  	/*
+ 	 * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
+ 	 * see what's happening.  Note that all strings written to the status line
+ 	 * must be the same length, per comments for AddToDataDirLockFile().  We
+ 	 * currently make them all 8 bytes, padding with spaces.
+ 	 */
+ 	AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "starting");
+ 
+ 	/*
  	 * We're ready to rock and roll...
  	 */
  	StartupPID = StartupDataBase();
*** pmdie(SIGNAL_ARGS)
*** 2608,2613 
--- 2616,2624 
  			Shutdown = SmartShutdown;
  			ereport(LOG,
  	(errmsg("received smart shutdown request")));
+ 
+ 			/* Report status */
+ 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "stopping");
  #ifdef USE_SYSTEMD
  			sd_notify(0, "STOPPING=1");
  #endif
*** pmdie(SIGNAL_ARGS)
*** 2663,2668 
--- 2674,2682 
  			Shutdown = FastShutdown;
  			ereport(LOG,
  	(errmsg("received fast shutdown request")));
+ 
+ 			/* Report status */
+ 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "stopping");
  #ifdef USE_SYSTEMD
  			sd_notify(0, "STOPPING=1");
  #endif
*** pmdie(SIGNAL_ARGS)
*** 2727,2732 
--- 2741,2749 
  			Shutdown = ImmediateShutdown;
  			ereport(LOG,
  	(errmsg("received immediate shutdown request")));
+ 
+ 			/* Report status */
+ 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "stopping");
  #ifdef USE_SYSTEMD
  			sd_notify(0, "STOPPING=1");
  #endif
*** reaper(SIGNAL_ARGS)
*** 2872,2877 
--- 2889,2896 
  			ereport(LOG,
  	(errmsg("database system is ready to accept connections")));
  
+ 			/* Report status */
+ 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "ready   ");
  #ifdef USE_SYSTEMD
  			sd_notify(0, "READY=1");
  #endif
*** sigusr1_handler(SIGNAL_ARGS)
*** 5005,5014 
  		if (XLogArchivingAlways())
  			PgArchPID = pgarch_start();
  
! #ifdef USE_SYSTEMD
  		if (!EnableHotStandby)
  			sd_notify(0, "READY=1");
  #endif
  
  		pmState = PM_RECOVERY;
  	}
--- 5024,5041 
  		if (XLogArchivingAlways())
  			PgArchPID = pgarch_start();
  
! 		/*
! 		 * If we aren't planning to enter hot standby mode later, treat
! 		 * RECOVERY_STARTED as meaning we're out of startup, and report status
! 		 * accordingly.
! 		 */
  		if (!EnableHotStandby)
+ 		{
+ 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "standby ");
+ #ifdef USE_SYSTEMD
  			sd_notify(0, "READY=1");
  #endif
+ 		}
  
  		pmState = PM_RECOVERY;
  	}
*** sigusr1_handler(SIGNAL_ARGS)
*** 5024,5029 
--- 5051,5058 
  		ereport(LOG,
  (errmsg("database system is ready to accept read only connections")));
  
+ 		/* Report status */
+ 		AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, "ready   ");
  #ifdef USE_SYSTEMD
  		sd_notify(0, "READY=1");
  #endif
diff --git 

Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Andres Freund
On 2017-06-27 11:13:38 -0700, Peter Geoghegan wrote:
> On Tue, Jun 27, 2017 at 11:04 AM, Andres Freund  wrote:
> >
> > On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote:
> >> I looked at this again recently. I wrote a patch to prove to myself
> >> that we can fairly easily reclaim 15 bits from every nbtree internal
> >> page ItemId, and put an abbreviated key there instead.
> >
> > Interesting. Not sure however that really addresses my layout / cache
> > efficiency concern? Or is that just a largely independent optimization,
> > except it's affecting the same area of code?
> 
> Well, you'd only do this on internal pages, which are a tiny minority
> of the total, and yet are where the majority of binary searches for an
> index scan occur in practice. The optimization has the effect of
> making the binary search only need to access the much smaller ItemId
> array in that best case. In the best case, where you resolve all
> comparisons on internal pages, you still have to get the index tuple
> that you need to follow the TID of to go to a page on the next level
> down, once the binary search for an internal page actually finds it.
> But that's all.
> 
> In the best case, and maybe the average case, this could be highly
> effective, I think. There would definitely be cases where the
> optimization wouldn't help at all, but hopefully it would also not
> hurt.

In other words, it's an independent optimization.  That's cool, but I'd
rather talk about it in an independent thread, to avoid conflating the
issues.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion not used when doing a recovery_target PITR restore?

2017-06-27 Thread Andres Freund
On 2017-06-23 10:56:07 +0900, Michael Paquier wrote:
> > And even there it might actually be
> > a pretty good idea to not force a full checkpoint - getting up fast
> > after a crash is kinda important..
> 
> But not that. Crash recovery is designed to be simple and robust, with
> only the postmaster and the startup processes running when doing so.
> Not having the startup process doing by itself checkpoints would
> require the need of the bgwriter, which increases the likelihood of
> bugs. In short, I don't think that improving performance is the matter
> for crash recovery, robustness and simplicity are.

I'm far from convinced by this.  By now WAL replay with checkpointer,
bgwriter, etc. active is actually *more* tested than the cases without
it. The likelihood of bugs is higher in the less frequently exercised
paths, and given that replication exercises the situation with all those
processes active on a continuous basis, I'm fairly unconvinced by your
argument.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Peter Geoghegan
On Tue, Jun 27, 2017 at 11:05 AM, Andres Freund  wrote:
>> Did you ever try running a pgbench SELECT benchmark, having modified
>> things such that all PKs are on columns that are not of type
>> int4/int8, but rather are of type numeric? It's an interesting
>> experiment, that I've been meaning to re-run on a big box.
>
>> Obviously this will be slower than an equivalent plain pgbench SELECT,
>> but the difference may be smaller than you expect.
>
> I'm not sure what that has to do with the topic?

It suggests that cache misses are much more important than anything
else. Even with collated text, the difference is not so large IIRC.
Though, it was noticeably larger.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Peter Geoghegan
On Tue, Jun 27, 2017 at 11:04 AM, Andres Freund  wrote:
>
> On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote:
>> I looked at this again recently. I wrote a patch to prove to myself
>> that we can fairly easily reclaim 15 bits from every nbtree internal
>> page ItemId, and put an abbreviated key there instead.
>
> Interesting. Not sure however that really addresses my layout / cache
> efficiency concern? Or is that just a largely independent optimization,
> except it's affecting the same area of code?

Well, you'd only do this on internal pages, which are a tiny minority
of the total, and yet are where the majority of binary searches for an
index scan occur in practice. The optimization has the effect of
making the binary search only need to access the much smaller ItemId
array in that best case. In the best case, where you resolve all
comparisons on internal pages, you still have to get the index tuple
that you need to follow the TID of to go to a page on the next level
down, once the binary search for an internal page actually finds it.
But that's all.

In the best case, and maybe the average case, this could be highly
effective, I think. There would definitely be cases where the
optimization wouldn't help at all, but hopefully it would also not
hurt.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Andres Freund
On 2016-05-19 19:38:02 -0700, Peter Geoghegan wrote:
> On Wed, May 18, 2016 at 6:25 AM, Andres Freund  wrote:
> > currently we IIRC use linearly sorted datums for the search in
> > individual btree nodes.  Not surprisingly that's often one of the
> > dominant entries in profiles. We could probably improve upon that by
> > using an order more optimized for efficient binary search.
> 
> Did you ever try running a pgbench SELECT benchmark, having modified
> things such that all PKs are on columns that are not of type
> int4/int8, but rather are of type numeric? It's an interesting
> experiment, that I've been meaning to re-run on a big box.

> Obviously this will be slower than an equivalent plain pgbench SELECT,
> but the difference may be smaller than you expect.

I'm not sure what that has to do with the topic?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Andres Freund
Hi,

On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote:
> I looked at this again recently. I wrote a patch to prove to myself
> that we can fairly easily reclaim 15 bits from every nbtree internal
> page ItemId, and put an abbreviated key there instead.

Interesting. Not sure however that really addresses my layout / cache
efficiency concern? Or is that just a largely independent optimization,
except it's affecting the same area of code?


> Can you suggest a workload/hardware to assess the benefits of an
> optimization like this, Andres? Is there a profile available somewhere
> in the archives that shows many cache misses within _bt_binsrch()?

I don't quite remember what triggered my report, but it's quite commonly
visible in any workloads that have btrees above toy sizes, but still
fitting in shared_buffers. Obviously you need somehting where btree
lookups are a significant share of the time, but that's easy enough with
index nested loop joins and such.  I do recall seeing it recently-ish in
a number of tpc-h queries.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory layouts for binary search in nbtree

2017-06-27 Thread Peter Geoghegan
On Thu, May 19, 2016 at 7:28 PM, Peter Geoghegan  wrote:
> Abbreviated keys in indexes are supposed to help with this. Basically,
> the ItemId array is made to be interlaced with small abbreviated keys
> (say one or two bytes), only in the typically less than 1% of pages
> that are internal (leaf page binary searches don't change).

I looked at this again recently. I wrote a patch to prove to myself
that we can fairly easily reclaim 15 bits from every nbtree internal
page ItemId, and put an abbreviated key there instead. The patch has
nbtree tell PageAdditem() to store an abbreviated key within lp_len
(actually, we call PageAddItemAbbrev() now). I didn't realize that
stealing lp_len might be feasible until recently, but it appears to be
-- this is a lot simpler than other approaches might be. I already
implemented a rudimentary, POC encoding scheme for int4, but text is
the datatype that I'd expect to see a real benefit for.

While this POC patch of mine could easily have bugs, it is at least
true that "make check-world" passes for me. For nbtree, reclaiming the
lp_len space was just a matter of teaching a small number of existing
places to go to the IndexTuple to get a tuple's length, rather than
trusting an ItemId's lp_len field (that is, rather than calling
ItemIdGetLength()). Most nbtree related code already gets the length
from the index tuple header today, since it's pretty rare to care
about the length of a tuple but not its contents. This did require
updating some generic routines within bufpage.c, too, but that wasn't
so bad.

Can you suggest a workload/hardware to assess the benefits of an
optimization like this, Andres? Is there a profile available somewhere
in the archives that shows many cache misses within _bt_binsrch()?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Petr Jelinek

On 27/06/17 10:51, Masahiko Sawada wrote:
> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada  
> wrote:
> 
> I've reviewed this patch briefly.

Thanks!

> 
> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>  }
> 
>  /*
> + * Request worker to be stopped on commit.
> + */
> +void
> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
> +{
> +   LogicalRepWorkerId *wid;
> +   MemoryContext old;
> +
> +   old = MemoryContextSwitchTo(TopTransactionContext);
> +
> +   wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
> +
> +   /*
> +   wid = MemoryContextAlloc(TopTransactionContext,
> +
> sizeof(LogicalRepWorkerId));
> +   */
> +   wid->subid = subid;
> +   wid->relid = relid;
> +
> +   on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
> +
> +   MemoryContextSwitchTo(old);
> +}
> 
> logicalrep_worker_stop_at_commit() has a problem that new_list()
> called by lappend() allocates the memory from current memory context.
> It should switch the memory context and then allocate the memory for
> wid and append it to the list.
> 

Right, fixed (I see you did that locally as well based on the above
excerpt ;) ).

> 
> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>  void
>  AtEOXact_ApplyLauncher(bool isCommit)
>  {
> -   if (isCommit && on_commit_launcher_wakeup)
> -   ApplyLauncherWakeup();
> +   ListCell *lc;
> +
> +   if (isCommit)
> +   {
> +   foreach (lc, on_commit_stop_workers)
> +   {
> +   LogicalRepWorkerId *wid = lfirst(lc);
> +   logicalrep_worker_stop(wid->subid, wid->relid);
> +   }
> +
> +   if (on_commit_launcher_wakeup)
> +   ApplyLauncherWakeup();
> 
> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
> support the prepared transaction. Since we allocate the list
> on_commit_stop_workers in TopTransactionContext the postgres crashes
> if we execute any query after prepared transaction that removes
> subscription relation state. Also after fixed this issue, we still
> need to something: the list of on_commit_stop_workers is not
> associated the prepared transaction.  A query next to "preapre
> transaction" tries to stop workers at the commit. There was similar
> discussion before.
> 

Hmm, good point. I think for now it makes sense to simply don't allow
PREPARE for transactions that manipulate workers similar to what we do
when there are exported snapshots. Done it that way in attached.

> 
> +
> +   ensure_transaction();
> +
> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
> +
> rstate->relid, rstate->state,
> +
> rstate->lsn);
> 
> 
> Should we commit the transaction if we started a new transaction
> before update the subscription relation state, or it could be deadlock
> risk.

We only lock the whole subscription (and only conflicting things are
DROP and ALTER SUBSCRIPTION), not individual subscription-relation
mapping so it doesn't seem to me like there is any higher risk of
deadlocks than anything else which works with multiple tables (or
compared to previous behavior).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 88da433110aa3bde3dcce33ebf62d41a08c191b9 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 24 Jun 2017 19:38:21 +0200
Subject: [PATCH] Rework subscription worker and relation status handling

---
 src/backend/access/transam/xact.c   |   9 +
 src/backend/catalog/pg_subscription.c   | 137 ++--
 src/backend/commands/subscriptioncmds.c |  98 -
 src/backend/replication/logical/launcher.c  | 309 
 src/backend/replication/logical/tablesync.c |  97 +
 src/backend/replication/logical/worker.c|  23 ++-
 src/backend/utils/cache/catcache.c  |   6 +-
 src/include/catalog/pg_subscription_rel.h   |   6 +-
 src/include/replication/logicallauncher.h   |   1 +
 src/include/replication/worker_internal.h   |   6 +-
 10 files changed, 393 insertions(+), 299 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b0aa69f..322502d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2277,6 +2277,15 @@ PrepareTransaction(void)
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot PREPARE a transaction that has exported snapshots")));
 
+	/*
+	 * Similar to above, don't allow PREPARE but for transaction that kill
+	 * logical replication, workers.
+	 */
+	if (XactManipulatesLogicalReplicationWorkers())
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
+
 	/* Prevent 

Re: [HACKERS] lag(bigint,int,int), etc?

2017-06-27 Thread Merlin Moncure
On Tue, Jun 27, 2017 at 10:12 AM, Colin 't Hart  wrote:
> On 27 Jun 2017, at 17:06, Merlin Moncure  wrote:
>>
>>> On Tue, Jun 27, 2017 at 10:01 AM, Colin 't Hart  
>>> wrote:
>>> Hi,
>>>
>>> The following rather contrived example illustrates that lag(), lead()
>>> (and probably other functions) can't automatically cast an integer to
>>> a bigint:
>>>
>>> select lag(sum,1,0) over () from (select sum(generate_series) over
>>> (order by generate_series) from generate_series(1,10)) x;
>>> ERROR:  function lag(bigint, integer, integer) does not exist
>>> LINE 1: select lag(sum,1,0) over () from (select sum(generate_series...
>>>   ^
>>> HINT:  No function matches the given name and argument types. You
>>> might need to add explicit type casts.
>>>
>>>
>>> I guess this is because the lag() and lead() functions take any type,
>>> and hence the default must be of the same type.
>>> This had me stumped for a few while until I realised that the types
>>> were different.
>>>
>>> Would there be any way to implement an automatic conversion?
>>>
>>> On the off-chance that this is actually a bug, this is on 9.6.3, but
>>> it also occurs on 9.3.17
>>
>> Why not cast the arguments?  The first and the third argument have to
>> be the same, and the second argument is always int.
>>
>> merlin
>
> I know that I can cast. I'm wondering if it would be possible/desirable to 
> implement automatic casting. Automatic casting works already for functions 
> defined to take bigint and you pass in an integer. But not for these 
> functions that take any type.

Right.  If you've got 2+ types being passed for 'any', which argument
should you get?  It's ambiguous, so the type rules into 'any' taking
functions are stricter than for regular functions.  Casting behaviors
more complex than they look on the surface and changes to make them
more flexible are typically difficult to make work.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql - add special variable to reflect the last query status

2017-06-27 Thread Fabien COELHO


Hello Pavel,


We can introduce macro SetVariableBool(vars, varname, bool) instead

 SetVariable(pset.vars, "ERROR", "FALSE");


I checked source code, and it requires little bit more harder refactoring
because now we have SetVariableBool - what is unhappy name, because it
initialize variable to ON value. It is question what is better name?


The boolean values (on/off 1/0 true/false...) accepted for pg settings is 
probably convenient but also somehow fuzzy.


From a programming point of view, I like booleans to have either true or 

false values, and nothing else.

I agree that the existing "SetVariableBool" function is a misnommer, it 
should be "SetVariableOn" given what it does, and it is not what we need.


Here is a v4 which attempts to extend & reuse the function. People might 
be surprised that TRUE is used where ON was used before, so I'm not sure.



I found more interesting issue - the code of  SetResultVariables is
partially redundant with AcceptResult - maybe the switch there can be
shared.


I agree that there is some common structure, but ISTM that the 
AcceptResult function is called in a variety of situation where variables 
are not to be set (eg "internal" queries, not user provided queries), so I 
thought it best to keep the two apart.


--
Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 9faa365..bc9a2e4 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3452,6 +3452,35 @@ bar
   
 
   
+   ERROR
+   
+
+ Whether the last query failed, as a boolean.
+
+   
+  
+
+  
+   ERROR_CODE
+   
+
+ The error code associated to the last query, or
+ 0 if no error occured.
+
+   
+  
+
+  
+   ERROR_MESSAGE
+   
+
+ If the last query failed, the associated error message,
+ otherwise an empty string.
+
+   
+  
+
+  
 FETCH_COUNT
 
 
@@ -3656,6 +3685,24 @@ bar
   
 
   
+   STATUS
+   
+
+ Text status of the last query.
+
+   
+  
+
+  
+   ROW_COUNT
+   
+
+ How many rows were returned or affected by the last query.
+
+   
+  
+
+  
 QUIET
 
 
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 044cdb8..3abb3e7 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1213,6 +1213,62 @@ PrintQueryResults(PGresult *results)
 	return success;
 }
 
+/*
+ * Set special variables for "front door" queries
+ * - STATUS: last query status
+ * - ERROR: TRUE/FALSE, whether an error occurred
+ * - ERROR_CODE: code if an error occured, or "0"
+ * - ERROR_MESSAGE: message if an error occured, or ""
+ * - ROW_COUNT: how many rows were returned or affected, or "0"
+ */
+static void
+SetResultVariables(PGresult *results)
+{
+	bool			success;
+	ExecStatusType	restat = PQresultStatus(results);
+	char 		   *code = PQresultErrorField(results, PG_DIAG_SQLSTATE);
+	char 		   *mesg = PQresultErrorField(results, PG_DIAG_MESSAGE_PRIMARY);
+
+	SetVariable(pset.vars, "STATUS", PQresStatus(restat) + strlen("PGRES_"));
+	SetVariable(pset.vars, "ERROR_CODE", code ? code : "0");
+	SetVariable(pset.vars, "ERROR_MESSAGE", mesg ? mesg : "");
+
+	/* check all possible PGRES_ */
+	switch (restat)
+	{
+		case PGRES_EMPTY_QUERY:
+		case PGRES_TUPLES_OK:
+		case PGRES_COMMAND_OK:
+		case PGRES_COPY_OUT:
+		case PGRES_COPY_IN:
+		case PGRES_COPY_BOTH:
+		case PGRES_SINGLE_TUPLE:
+			success = true;
+			break;
+		case PGRES_BAD_RESPONSE:
+		case PGRES_NONFATAL_ERROR:
+		case PGRES_FATAL_ERROR:
+			success = false;
+			break;
+		default:
+			/* dead code */
+			success = false;
+			psql_error("unexpected PQresultStatus: %d\n", restat);
+			break;
+	}
+
+	if (success)
+	{
+		char   *ntuples = PQcmdTuples(results);
+		SetVariable(pset.vars, "ROW_COUNT", *ntuples ? ntuples : "0");
+	}
+	else
+	{
+		SetVariable(pset.vars, "ROW_COUNT", "0");
+	}
+
+	SetVariableBool(pset.vars, "ERROR", !success);
+}
 
 /*
  * SendQuery: send the query string to the backend
@@ -1346,6 +1402,9 @@ SendQuery(const char *query)
 			elapsed_msec = INSTR_TIME_GET_MILLISEC(after);
 		}
 
+		/* set special variables to reflect the result status */
+		SetResultVariables(results);
+
 		/* but printing results isn't: */
 		if (OK && results)
 			OK = PrintQueryResults(results);
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 7f76797..4ee2855 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -163,7 +163,7 @@ main(int argc, char *argv[])
 	SetVariable(pset.vars, "VERSION", PG_VERSION_STR);
 
 	/* Default values for variables (that don't match the result of \unset) */
-	SetVariableBool(pset.vars, "AUTOCOMMIT");
+	SetVariableBool(pset.vars, "AUTOCOMMIT", true);
 	SetVariable(pset.vars, "PROMPT1", DEFAULT_PROMPT1);
 	

Re: [HACKERS] CREATE COLLATION definitional questions for ICU

2017-06-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 6/25/17 11:45, Tom Lane wrote:
>> * Should not the FROM code path copy the old collation's version?
>> It seems a little bit weird that "cloning" a collation takes the
>> liberty of installing a new version.

> I think this is working correctly.  Specifying the version explicitly is
> really only useful for pg_upgrade, because it needs to reproduce the
> state that is on disk.  When you create a new collation, you want to
> work with the collation version that the currently running software
> provides.

Hm.  I certainly buy that for the "long form" of CREATE COLLATION,
but it seems to me that the charter of CREATE COLLATION FROM is to
clone the source collation's properties exactly.  The C.C. ref page
says that in so many words:

   The name of an existing collation to copy.  The new collation
   will have the same properties as the existing one, but it
   ^
   will be an independent object.

Moreover, if you insist on defining it this way, it's going to limit
our freedom of action in future.  It's possible that, either in some
future version of ICU or for some other provider, there could be more
than one version of a collation simultaneously available.  In that
scenario, if an existing collation object referred to one of those
available versions, it would be quite surprising for CREATE COLLATION
FROM to silently substitute another one.  But we'd be kind of stuck
with doing that if we allow this precedent to be set in v10.

Lastly, I'm not seeing the use-case for having CREATE COLLATION magically
make a working collation from a broken one.  Wouldn't you normally
expect to need to do ALTER COLLATION REFRESH on an obsolete collation
before doing anything with it?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lag(bigint,int,int), etc?

2017-06-27 Thread Colin 't Hart
On 27 Jun 2017, at 17:06, Merlin Moncure  wrote:
> 
>> On Tue, Jun 27, 2017 at 10:01 AM, Colin 't Hart  wrote:
>> Hi,
>> 
>> The following rather contrived example illustrates that lag(), lead()
>> (and probably other functions) can't automatically cast an integer to
>> a bigint:
>> 
>> select lag(sum,1,0) over () from (select sum(generate_series) over
>> (order by generate_series) from generate_series(1,10)) x;
>> ERROR:  function lag(bigint, integer, integer) does not exist
>> LINE 1: select lag(sum,1,0) over () from (select sum(generate_series...
>>   ^
>> HINT:  No function matches the given name and argument types. You
>> might need to add explicit type casts.
>> 
>> 
>> I guess this is because the lag() and lead() functions take any type,
>> and hence the default must be of the same type.
>> This had me stumped for a few while until I realised that the types
>> were different.
>> 
>> Would there be any way to implement an automatic conversion?
>> 
>> On the off-chance that this is actually a bug, this is on 9.6.3, but
>> it also occurs on 9.3.17
> 
> Why not cast the arguments?  The first and the third argument have to
> be the same, and the second argument is always int.
> 
> merlin

I know that I can cast. I'm wondering if it would be possible/desirable to 
implement automatic casting. Automatic casting works already for functions 
defined to take bigint and you pass in an integer. But not for these functions 
that take any type.

/Colin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lag(bigint,int,int), etc?

2017-06-27 Thread Merlin Moncure
On Tue, Jun 27, 2017 at 10:01 AM, Colin 't Hart  wrote:
> Hi,
>
> The following rather contrived example illustrates that lag(), lead()
> (and probably other functions) can't automatically cast an integer to
> a bigint:
>
> select lag(sum,1,0) over () from (select sum(generate_series) over
> (order by generate_series) from generate_series(1,10)) x;
> ERROR:  function lag(bigint, integer, integer) does not exist
> LINE 1: select lag(sum,1,0) over () from (select sum(generate_series...
>^
> HINT:  No function matches the given name and argument types. You
> might need to add explicit type casts.
>
>
> I guess this is because the lag() and lead() functions take any type,
> and hence the default must be of the same type.
> This had me stumped for a few while until I realised that the types
> were different.
>
> Would there be any way to implement an automatic conversion?
>
> On the off-chance that this is actually a bug, this is on 9.6.3, but
> it also occurs on 9.3.17

Why not cast the arguments?  The first and the third argument have to
be the same, and the second argument is always int.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] lag(bigint,int,int), etc?

2017-06-27 Thread Colin 't Hart
Hi,

The following rather contrived example illustrates that lag(), lead()
(and probably other functions) can't automatically cast an integer to
a bigint:

select lag(sum,1,0) over () from (select sum(generate_series) over
(order by generate_series) from generate_series(1,10)) x;
ERROR:  function lag(bigint, integer, integer) does not exist
LINE 1: select lag(sum,1,0) over () from (select sum(generate_series...
   ^
HINT:  No function matches the given name and argument types. You
might need to add explicit type casts.


I guess this is because the lag() and lead() functions take any type,
and hence the default must be of the same type.
This had me stumped for a few while until I realised that the types
were different.

Would there be any way to implement an automatic conversion?

On the off-chance that this is actually a bug, this is on 9.6.3, but
it also occurs on 9.3.17

Thanks,

Colin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE COLLATION definitional questions for ICU

2017-06-27 Thread Peter Eisentraut
On 6/25/17 11:45, Tom Lane wrote:
> * Should not the FROM code path copy the old collation's version?
> It seems a little bit weird that "cloning" a collation takes the
> liberty of installing a new version.

I think this is working correctly.  Specifying the version explicitly is
really only useful for pg_upgrade, because it needs to reproduce the
state that is on disk.  When you create a new collation, you want to
work with the collation version that the currently running software
provides.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-27 Thread Alexander Korotkov
On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila 
wrote:

> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
>  wrote:
> > On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <
> kommi.harib...@gmail.com>
> > wrote:
> >>
> >> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <
> alvhe...@2ndquadrant.com>
> >> wrote:
> >>>
> >>> I have sent the partial patch I have to Hari Babu Kommi.  We expect
> that
> >>> he will be able to further this goal some more.
> >>
> >>
> >> Thanks Alvaro for sharing your development patch.
> >>
> >> Most of the patch design is same as described by Alvaro in the first
> mail
> >> [1].
> >> I will detail the modifications, pending items and open items (needs
> >> discussion)
> >> to implement proper pluggable storage.
> >>
> >> Here I attached WIP patches to support pluggable storage. The patch
> series
> >> are may not work individually. Still so many things are under
> development.
> >> These patches are just to share the approach of the current development.
> >>
> >> Some notable changes that I did to make the patch work:
> >>
> >> 1. Added storageam handler to the slot, this is because not all places
> >> the relation is not available in handy.
> >> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
> >> As per the first version, I feel it is fine to allow creating HeapTuple
> >> format data.
> >>
> >> Thanks everyone for sharing their ideas in the developer's unconference
> at
> >> PGCon Ottawa.
> >>
> >> Pending items:
> >>
> >> 1. Replacement of Tuple with slot in Trigger functionality
> >> 2. Replacement of Tuple with Slot from storage handler functions.
> >> 3. Remove/minimize the use of HeapTuple as a Datum.
> >> 4. Replace all references of HeapScanDesc with StorageScanDesc
> >> 5. Planner changes to consider the relation storage during the planning.
> >> 6. Any planner changes based on the discussion of open items?
> >> 7. some Executor changes to consider the storage advantages?
> >>
> >> Open Items:
> >>
> >> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> >> HeapTuple and HeapScanDesc, So these scans are directly operating
> >> on those structures and providing the result.
> >
> >
> > What about vacuum?  I see vacuum is untouched in the patchset and it is
> not
> > mentioned in this discussion.
> > Do you plan to override low-level function like heap_page_prune(),
> > lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
> > Or do you plan to let pluggable storage implement its own high-level
> vacuum
> > algorithm?
> >
>
> Probably, some other algorithm for vacuum.  I am not sure current
> vacuum with its parameters can be used so easily.  One thing that
> might need some thoughts is that is it sufficient to say that keep
> autovacuum as off and call some different API for places where the
> vacuum can be invoked manually like Vacuum command to the developer
> implementing some different strategy for vacuum or we need something
> more as well.


What kind of other vacuum algorithm do you expect?  It would be rather
easier to understand if we would have examples.

For me, changing of vacuum algorithm is not needed for just heap page
format change.  Existing vacuum algorithm could just call page format API
functions for manipulating individual pages.

Changing of vacuum algorithm might be needed for more invasive changes than
just heap page format.  However, we should first understand what these
changes could be and how are they consistent with rest of API design.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Pluggable storage

2017-06-27 Thread Alexander Korotkov
On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila 
wrote:

> On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
>  wrote:
> > On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas 
> wrote:
> >>
> >> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
> >>  wrote:
> >> > Open Items:
> >> >
> >> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> >> > HeapTuple and HeapScanDesc, So these scans are directly operating
> >> > on those structures and providing the result.
> >> >
> >> > These scan types may not be applicable to different storage formats.
> >> > So how to handle them?
> >>
> >> I think that BitmapHeapScan, at least, is applicable to any table AM
> >> that has TIDs.   It seems to me that in general we can imagine three
> >> kinds of table AMs:
> >>
> >> 1. Table AMs where a tuple can be efficiently located by a real TID.
> >> By a real TID, I mean that the block number part is really a block
> >> number and the item ID is really a location within the block.  These
> >> are necessarily quite similar to our current heap, but they can change
> >> the tuple format and page format to some degree, and it seems like in
> >> many cases it should be possible to plug them into our existing index
> >> AMs without too much heartache.  Both index scans and bitmap index
> >> scans ought to work.
> >
> >
> > If #1 is only about changing tuple and page formats, then could be much
> > simpler than the patch upthread?  We can implement "page format access
> > methods" with routines for insertion, update, pruning and deletion of
> tuples
> > *in particular page*.  There is no need to redefine high-level logic for
> > scanning heap, doing updates and so on...
>
> If you are changing tuple format then you do need to worry about
> places wherever we are using HeapTuple like TupleTableSlots,
> Visibility routines, etc. (just think if somebody changes tuple
> header, then all such places are susceptible to change).


Agree.  I think that we can consider pluggable tuple format as an
independent feature which is desirable to have before pluggable storages.
For example, I believe some FDWs could already have benefit from pluggable
tuple format.


> Similarly,
> if the page format is changed you need to consider all page scan API's
> like heapgettup_pagemode/heapgetpage.


If page format is changed, then heapgettup_pagemode/heapgetpage should use
appropriate API functions for manipulating page items.  I'm very afraid of
overriding whole heapgettup_pagemode/heapgetpage and monstrous functions
like heap_update without understanding of clear use-case.  It's definitely
not needed for changing heap page format.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] pgjdbc logical replication client throwing exception

2017-06-27 Thread sanyam jain
Hi,

I suspect its happening because of delay in status update by client but even 
after trying forceUpdateStatus its quitting prematurely.


Thanks,

Sanyam Jain


From: sanyam jain 
Sent: Tuesday, June 27, 2017 6:47:55 AM
To: Peter Eisentraut; Pg Hackers
Subject: Re: [HACKERS] pgjdbc logical replication client throwing exception


Hi,

>What does the server log say?  If nothing interesting, turn up debugging.

I receive the following Log on server
LOG:  could not send data to client: Broken pipe

Thanks,
Sanyam Jain


Re: [HACKERS] Same expression more than once in partition key

2017-06-27 Thread Robert Haas
On Fri, Jun 23, 2017 at 4:04 PM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> We also allow the same column more than once in an index.  We probably
>> don't have to be more strict here.
>
> There actually are valid uses for the same column more than once in
> an index, eg if you use a different operator class for each instance.
> I find it hard to envision a similar use-case in partitioning though.

Maybe you already realize this, but partitioning, like index creation,
allows an opclass to be specified:

rhaas=# create table foo (a text, b text) partition by range (a
text_ops, b text_pattern_ops);
CREATE TABLE

I don't really see anybody wanting to do that, but I don't really see
anyone wanting to do with an index either.

My inclination is to reject this patch as not solving any actual problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-27 Thread Amit Kapila
On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
 wrote:
> On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi 
> wrote:
>>
>> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera 
>> wrote:
>>>
>>> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
>>> he will be able to further this goal some more.
>>
>>
>> Thanks Alvaro for sharing your development patch.
>>
>> Most of the patch design is same as described by Alvaro in the first mail
>> [1].
>> I will detail the modifications, pending items and open items (needs
>> discussion)
>> to implement proper pluggable storage.
>>
>> Here I attached WIP patches to support pluggable storage. The patch series
>> are may not work individually. Still so many things are under development.
>> These patches are just to share the approach of the current development.
>>
>> Some notable changes that I did to make the patch work:
>>
>> 1. Added storageam handler to the slot, this is because not all places
>> the relation is not available in handy.
>> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
>> As per the first version, I feel it is fine to allow creating HeapTuple
>> format data.
>>
>> Thanks everyone for sharing their ideas in the developer's unconference at
>> PGCon Ottawa.
>>
>> Pending items:
>>
>> 1. Replacement of Tuple with slot in Trigger functionality
>> 2. Replacement of Tuple with Slot from storage handler functions.
>> 3. Remove/minimize the use of HeapTuple as a Datum.
>> 4. Replace all references of HeapScanDesc with StorageScanDesc
>> 5. Planner changes to consider the relation storage during the planning.
>> 6. Any planner changes based on the discussion of open items?
>> 7. some Executor changes to consider the storage advantages?
>>
>> Open Items:
>>
>> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> HeapTuple and HeapScanDesc, So these scans are directly operating
>> on those structures and providing the result.
>
>
> What about vacuum?  I see vacuum is untouched in the patchset and it is not
> mentioned in this discussion.
> Do you plan to override low-level function like heap_page_prune(),
> lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
> Or do you plan to let pluggable storage implement its own high-level vacuum
> algorithm?
>

Probably, some other algorithm for vacuum.  I am not sure current
vacuum with its parameters can be used so easily.  One thing that
might need some thoughts is that is it sufficient to say that keep
autovacuum as off and call some different API for places where the
vacuum can be invoked manually like Vacuum command to the developer
implementing some different strategy for vacuum or we need something
more as well.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-27 Thread Amit Kapila
On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
 wrote:
> On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas  wrote:
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>>  wrote:
>> > Open Items:
>> >
>> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> > HeapTuple and HeapScanDesc, So these scans are directly operating
>> > on those structures and providing the result.
>> >
>> > These scan types may not be applicable to different storage formats.
>> > So how to handle them?
>>
>> I think that BitmapHeapScan, at least, is applicable to any table AM
>> that has TIDs.   It seems to me that in general we can imagine three
>> kinds of table AMs:
>>
>> 1. Table AMs where a tuple can be efficiently located by a real TID.
>> By a real TID, I mean that the block number part is really a block
>> number and the item ID is really a location within the block.  These
>> are necessarily quite similar to our current heap, but they can change
>> the tuple format and page format to some degree, and it seems like in
>> many cases it should be possible to plug them into our existing index
>> AMs without too much heartache.  Both index scans and bitmap index
>> scans ought to work.
>
>
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread?  We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*.  There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...
>

If you are changing tuple format then you do need to worry about
places wherever we are using HeapTuple like TupleTableSlots,
Visibility routines, etc. (just think if somebody changes tuple
header, then all such places are susceptible to change).  Similarly,
if the page format is changed you need to consider all page scan API's
like heapgettup_pagemode/heapgetpage.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup fails on Windows when using tablespace mapping

2017-06-27 Thread Michael Paquier
On Tue, Jun 27, 2017 at 7:46 PM, Ashutosh Sharma  wrote:
> I am still seeing the issue with the attached patch. I had a quick
> look into the patch. It seems to me like you have canonicalized the
> tablespace path to convert win32 slashes to unix type of slashes but
> that is not being passed to strcmp() function and probably that could
> be the reason why the issue is still existing. Thanks.
>
> for (cell = tablespace_dirs.head; cell; cell = cell->next)
> -   if (strcmp(dir, cell->old_dir) == 0)
> +   if (strcmp(canon_dir, cell->old_dir) == 0)

Thanks. I had the correct version on my Windows box actually, just
messed up the attachment.
-- 
Michael
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 3ad06995ec..f7c1cc826a 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -295,6 +295,11 @@ tablespace_list_append(const char *arg)
 		exit(1);
 	}
 
+	/*
+	 * Comparisons done with those values should involve similarly
+	 * canonicalized path values. This is particularly sensitive on
+	 * Windows where path values may not necessarily use Unix slashes.
+	 */
 	canonicalize_path(cell->old_dir);
 	canonicalize_path(cell->new_dir);
 
@@ -1279,9 +1284,14 @@ static const char *
 get_tablespace_mapping(const char *dir)
 {
 	TablespaceListCell *cell;
+	charcanon_dir[MAXPGPATH];
+
+	/* Canonicalize path for comparison consistency */
+	strcpy(canon_dir, dir);
+	canonicalize_path(canon_dir);
 
 	for (cell = tablespace_dirs.head; cell; cell = cell->next)
-		if (strcmp(dir, cell->old_dir) == 0)
+		if (strcmp(canon_dir, cell->old_dir) == 0)
 			return cell->new_dir;
 
 	return dir;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_terminate_backend can terminate background workers and autovacuum launchers

2017-06-27 Thread Yugo Nagata
On Fri, 23 Jun 2017 19:43:35 -0400
Stephen Frost  wrote:

> Alvaro, all,
> 
> * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> > Yugo Nagata wrote:
> > 
> > > I tried to make it. Patch attached. 
> > > 
> > > It is easy and simple. Although I haven't tried for autovacuum worker,
> > > I confirmed I can change other process' parameters without affecting 
> > > others.
> > > Do you want this in PG?
> > 
> > One advantage of this gadget is that you can signal backends that you
> > own without being superuser, so +1 for the general idea.  (Please do
> > create a fresh thread so that this can be appropriately linked to in
> > commitfest app, though.)
> 
> Well, that wouldn't quite work with the proposed patch because the
> proposed patch REVOKE's execute from public...

Yes. It is intentional to revoke execute from public because we need
to change configuration before signaling other backend and it needs
superuser privilege.
 
> I'm trying to figure out how it's actually useful to be able to signal a
> backend that you own about a config change that you *aren't* able to
> make without being a superuser..  Now, if you could tell the other
> backend to use a new value for some USERSET GUC, then *that* would be
> useful and interesting.
> 
> I haven't got any particularly great ideas for how to make that happen
> though.
> 
> > You need a much bigger patch for the autovacuum worker.  The use case I
> > had in mind is that you have a maintenance window and can run fast
> > vacuuming during it, but if the table is large and vacuum can't finish
> > within that window then you want vacuum to slow down, without stopping
> > it completely.  But implementing this requires juggling the
> > stdVacuumCostDelay/Limit values during the reload, which are currently
> > read at start of vacuuming only; the working values are overwritten from
> > those during a rebalance.
> 
> Being able to change those values during a vacuuming run would certainly
> be useful, I've had cases where I would have liked to have been able to
> do just exactly that.  That's largely independent of this though.
> 
> Thanks!
> 
> Stephen


-- 
Yugo Nagata 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_terminate_backend can terminate background workers and autovacuum launchers

2017-06-27 Thread Yugo Nagata
On Sat, 24 Jun 2017 08:09:52 +0900
Michael Paquier  wrote:

> On Sat, Jun 24, 2017 at 5:07 AM, Alvaro Herrera
>  wrote:
> > Yugo Nagata wrote:
> >
> >> I tried to make it. Patch attached.
> >>
> >> It is easy and simple. Although I haven't tried for autovacuum worker,
> >> I confirmed I can change other process' parameters without affecting 
> >> others.
> >> Do you want this in PG?
> 
> Just browsing the patch...
> 
> +if (r == SIGNAL_BACKEND_NOSUPERUSER)
> +ereport(WARNING,
> +(errmsg("must be a superuser to terminate superuser
> process")));
> +
> +if (r == SIGNAL_BACKEND_NOPERMISSION)
> +ereport(WARNING,
> + (errmsg("must be a member of the role whose process
> is being terminated or member of pg_signal_backend")));
> Both messages are incorrect. This is not trying to terminate a process.

I'll fix this.

> 
> +Datum
> +pg_reload_conf_pid(PG_FUNCTION_ARGS)
> I think that the naming is closer to pg_reload_backend.
> 
> Documentation is needed as well.

Agree with pg_reload_backend and I'll write the documetation.

> 
> > One advantage of this gadget is that you can signal backends that you
> > own without being superuser, so +1 for the general idea.  (Please do
> > create a fresh thread so that this can be appropriately linked to in
> > commitfest app, though.)
> 
> That would be nice.

Sure. I'll create a new thread and attach the new patch to it.

> 
> > You need a much bigger patch for the autovacuum worker.  The use case I
> > had in mind is that you have a maintenance window and can run fast
> > vacuuming during it, but if the table is large and vacuum can't finish
> > within that window then you want vacuum to slow down, without stopping
> > it completely.  But implementing this requires juggling the
> > stdVacuumCostDelay/Limit values during the reload, which are currently
> > read at start of vacuuming only; the working values are overwritten from
> > those during a rebalance.
> 
> Yeah, that's independent from the patch above.

Agree. It would be another feature from pg_reload_backend and
I think it would be implmented as another patch.



> -- 
> Michael
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Yugo Nagata 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup fails on Windows when using tablespace mapping

2017-06-27 Thread Ashutosh Sharma
Hi,

On Tue, Jun 27, 2017 at 10:13 AM, Michael Paquier
 wrote:
> On Tue, Jun 27, 2017 at 12:13 PM, Michael Paquier
>  wrote:
>> At quick glance, I think that this should definitely be a client-only
>> fix. One reason is that pg_basebackup should be able to work with past
>> servers. A second is that this impacts as well the backend code, and
>> readlink may not return an absolute path. At least that's true for
>> posix's version, Postgres uses its own wrapper with junction points..
>
> The problem is in pg_basebackup.c's get_tablespace_mapping(), which
> fails to provide a correct comparison for the directory given by
> caller. In your case the caller of get_tablespace_mapping() uses
> backslashes as directory value (path value received from backend), and
> the tablespace mapping uses slashes as canonicalize_path has been
> called once on it. Because of this inconsistency, the directory of the
> original tablespace is used, causing the backup to fail as it should.
> A simple fix is to make sure that the comparison between both things
> is consistent by using canonicalize_path on the directory value given
> by caller.
>
> Attached is a patch. I am parking that in the next CF so as this does
> not get forgotten as a bug fix. Perhaps a committer will show up
> before. Or not.

I am still seeing the issue with the attached patch. I had a quick
look into the patch. It seems to me like you have canonicalized the
tablespace path to convert win32 slashes to unix type of slashes but
that is not being passed to strcmp() function and probably that could
be the reason why the issue is still existing. Thanks.

for (cell = tablespace_dirs.head; cell; cell = cell->next)
-   if (strcmp(dir, cell->old_dir) == 0)
+   if (strcmp(canon_dir, cell->old_dir) == 0)

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GSoC 2017: weekly progress reports (week 4) and patch for hash index

2017-06-27 Thread Shubham Barai
Project: Explicitly support predicate locks in index AMs besides b-tree

Hi,

During this week, I continued my work on predicate locking in the hash
index and created a patch for it. As I have written in my proposal for the
hash index, every scan operation acquires a predicate lock on the primary
page of the bucket.
And hash indexes are used for equality comparison. So, this can still
generate false positives when a scan operation and an insert operation are
trying to access the same bucket but are looking for different tuples.
Let's see that with an example.

setup:

create table hash_tbl(id int4, p integer);

create index hash_pointidx on hash_tbl using hash(p);

insert into hash_tbl (id, p)
select g, g*2 from generate_series(1, 1000) g;

read operation:
select * from hash_tbl where p=78988658;

insert operation:
insert into hash_tbl values(, 546789888);

If these two hash keys (78988658 and 546789888) mapped to the same bucket,
this will result in false serialization failure.
Please correct me if this assumption about false positives is wrong.


In summary, I have done following things in this week.

1)  found appropriate places in the hash index AM to insert calls to
existing functions (PredicateLockPage(),  CheckForSerializableConflictIn()
...etc)

2) created tests to check serialization failures and false positives

3) tested my patch on the current head



Regards,

Shubham




   Sent with Mailtrack



Predicate-Locking-in-hash-index_3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-06-27 Thread Masahiko Sawada
On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada  wrote:
> On Sun, Jun 25, 2017 at 7:35 PM, Petr Jelinek
>  wrote:
>> (was away for a while, got some time now for this again)
>>
>> On 22/06/17 04:43, Peter Eisentraut wrote:
>>> The alternative is that we use the LockSharedObject() approach that was
>>> already alluded to, like in the attached patch.  Both approaches would
>>> work equally fine AFAICT.
>>
>> I agree, but I think we need bigger overhaul of the locking/management
>> in general. So here is patch which does much more changes.
>>
>> The patch does several important things (in no particular order):
>> - Split SetSubscriptionRelState into AddSubscriptionRelState and
>> UpdateSubscriptionRelState for the reasons said upstream, it's cleaner,
>> there is no half-broken upsert logic and it has proper error checking
>> for each action.
>>
>> - Do LockSharedObject in ALTER SUBSCRIPTION, DROP SUBSCRIPTION (this one
>> is preexisting but mentioning it for context), SetSubscriptionRelState,
>> AddSubscriptionRelState, and in the logicalrep_worker_launch. This means
>> we use granular per object locks to deal with concurrency.
>>
>> - Because of above, the AccessExclusiveLock on pg_subscription is no
>> longer needed, just normal RowExlusiveLock is used now.
>>
>> - logicalrep_worker_stop is also simplified due to the proper locking
>>
>> - There is new interface logicalrep_worker_stop_at_commit which is used
>> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION and by transactional
>> variant of DROP SUBSCRIPTION to only kill workers at the end of transaction.
>>
>> - Locking/reading of subscription info is unified between DROP and ALTER
>> SUBSCRIPTION commands.
>>
>> - DROP SUBSCRIPTION will kill all workers associated with subscription,
>> not just apply.
>>
>> - The sync worker checks during startup if the relation is still subscribed.
>>
>> - The sync worker will exit when waiting for apply and apply has shut-down.
>>
>> - The log messages around workers and removed or disabled subscription
>> are now more consistent between startup and normal runtime of the worker.
>>
>> - Some code deduplication and stylistic changes/simplification in
>> related areas.
>>
>> - Fixed catcache's IndexScanOK() handling of the subscription catalog.
>>
>> It's bit bigger patch but solves issues from multiple threads around
>> handling of ALTER/DROP subscription.
>>
>> A lot of the locking that I added is normally done transparently by
>> dependency handling, but subscriptions and subscription relation status
>> do not use that much as it was deemed to bloat pg_depend needlessly
>> during the original patch review (it's also probably why this has
>> slipped through).
>>
>

I've reviewed this patch briefly.

@@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
 }

 /*
+ * Request worker to be stopped on commit.
+ */
+void
+logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
+{
+   LogicalRepWorkerId *wid;
+   MemoryContext old;
+
+   old = MemoryContextSwitchTo(TopTransactionContext);
+
+   wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
+
+   /*
+   wid = MemoryContextAlloc(TopTransactionContext,
+
sizeof(LogicalRepWorkerId));
+   */
+   wid->subid = subid;
+   wid->relid = relid;
+
+   on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
+
+   MemoryContextSwitchTo(old);
+}

logicalrep_worker_stop_at_commit() has a problem that new_list()
called by lappend() allocates the memory from current memory context.
It should switch the memory context and then allocate the memory for
wid and append it to the list.


@@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
 void
 AtEOXact_ApplyLauncher(bool isCommit)
 {
-   if (isCommit && on_commit_launcher_wakeup)
-   ApplyLauncherWakeup();
+   ListCell *lc;
+
+   if (isCommit)
+   {
+   foreach (lc, on_commit_stop_workers)
+   {
+   LogicalRepWorkerId *wid = lfirst(lc);
+   logicalrep_worker_stop(wid->subid, wid->relid);
+   }
+
+   if (on_commit_launcher_wakeup)
+   ApplyLauncherWakeup();

Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
support the prepared transaction. Since we allocate the list
on_commit_stop_workers in TopTransactionContext the postgres crashes
if we execute any query after prepared transaction that removes
subscription relation state. Also after fixed this issue, we still
need to something: the list of on_commit_stop_workers is not
associated the prepared transaction.  A query next to "preapre
transaction" tries to stop workers at the commit. There was similar
discussion before.


+
+   ensure_transaction();
+
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
+
rstate->relid, rstate->state,
+
rstate->lsn);


Re: [HACKERS] Setting pd_lower in GIN metapage

2017-06-27 Thread Amit Langote
On 2017/06/27 10:22, Michael Paquier wrote:
> On Mon, Jun 26, 2017 at 4:11 PM, Masahiko Sawada  
> wrote:
>> Thank you for the patches! I checked additional patches for brin and
>> spgist. They look good to me.
> 
> Last versions are still missing something: brin_mask() and spg_mask()
> can be updated so as mask_unused_space() is called for meta pages.
> Except that the patches look to be on the right track.

Thanks for the review.

I updated brin_mask() and spg_mask() in the attached updated patches so
that they consider meta pages as containing unused space.

> By the way, as this is an optimization and not an actual bug, could
> you add this patch to the next commit fest? I don't think that this
> should get into 10. The focus is to stabilize things lately.

Sure, done.

Thanks,
Amit
From b05760739a36f3247019e75a9029ece7ab0bb09c Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 23 Jun 2017 11:20:41 +0900
Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage.

---
 src/backend/access/gin/ginutil.c |  7 +++
 src/backend/access/gin/ginxlog.c | 23 ---
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 91e4a8cf70..52de599b29 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -372,6 +372,13 @@ GinInitMetabuffer(Buffer b)
metadata->nDataPages = 0;
metadata->nEntries = 0;
metadata->ginVersion = GIN_CURRENT_VERSION;
+
+   /*
+* Set pd_lower just past the end of the metadata.  This is not 
essential
+* but it makes the page look compressible to xlog.c.
+*/
+   ((PageHeader) page)->pd_lower =
+   ((char *) metadata + sizeof(GinMetaPageData)) - (char 
*) page;
 }
 
 /*
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 7ba04e324f..6d2e528852 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -514,7 +514,7 @@ ginRedoUpdateMetapage(XLogReaderState *record)
Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO);
metapage = BufferGetPage(metabuffer);
 
-   GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer));
+   GinInitMetabuffer(metabuffer);
memcpy(GinPageGetMeta(metapage), >metadata, 
sizeof(GinMetaPageData));
PageSetLSN(metapage, lsn);
MarkBufferDirty(metabuffer);
@@ -656,7 +656,7 @@ ginRedoDeleteListPages(XLogReaderState *record)
Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO);
metapage = BufferGetPage(metabuffer);
 
-   GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer));
+   GinInitMetabuffer(metabuffer);
 
memcpy(GinPageGetMeta(metapage), >metadata, 
sizeof(GinMetaPageData));
PageSetLSN(metapage, lsn);
@@ -776,18 +776,11 @@ gin_mask(char *pagedata, BlockNumber blkno)
mask_page_hint_bits(page);
 
/*
-* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. 
Hence,
-* we need to apply masking for those pages.
+* For GIN_DELETED page, the page is initialized to empty. Hence, mask
+* the page content.
 */
-   if (opaque->flags != GIN_META)
-   {
-   /*
-* For GIN_DELETED page, the page is initialized to empty. 
Hence, mask
-* the page content.
-*/
-   if (opaque->flags & GIN_DELETED)
-   mask_page_content(page);
-   else
-   mask_unused_space(page);
-   }
+   if (opaque->flags & GIN_DELETED)
+   mask_page_content(page);
+   else
+   mask_unused_space(page);
 }
-- 
2.11.0

From 3f197e13dcef968e8f0a5ff17cf0c328458cde34 Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 26 Jun 2017 15:13:32 +0900
Subject: [PATCH 2/3] Set pd_lower correctly in the BRIN index metapage

---
 src/backend/access/brin/brin_pageops.c | 7 +++
 src/backend/access/brin/brin_xlog.c| 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/brin/brin_pageops.c 
b/src/backend/access/brin/brin_pageops.c
index 80f803e438..8762356433 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -491,6 +491,13 @@ brin_metapage_init(Page page, BlockNumber pagesPerRange, 
uint16 version)
 * revmap page to be created when the index is.
 */
metadata->lastRevmapPage = 0;
+
+   /*
+* Set pd_lower just past the end of the metadata.  This is to log full
+* page image of metapage in xloginsert.c.
+*/
+   ((PageHeader) page)->pd_lower =
+   ((char *) metadata + sizeof(BrinMetaPageData)) - (char *) page;
 }
 
 /*
diff --git a/src/backend/access/brin/brin_xlog.c 

Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-06-27 Thread Amit Langote
On 2017/06/26 18:44, Kyotaro HORIGUCHI wrote:
> At Mon, 26 Jun 2017 18:16:42 +0900, Amit Langote wrote:
>>
>> I recall I had proposed a fix for the same thing some time ago [1].
> 
> Wow. About 1.5 years ago and stuck? I have a concrete case where
> this harms  so I'd like to fix that this time. How can we move on?

Agreed that this should be fixed.

Your proposed approach #1 to recheck the inheritance after obtaining the
lock on the child table seems to be a good way forward.

Approach #2 of reordering locking is a simpler solution, but does not
completely prevent the problem, because DROP TABLE child can still cause
it to occur, as you mentioned.

>>> The cause is that NO INHERIT doesn't take an exlusive lock on the
>>> parent. This allows expand_inherited_rtentry to add the child
>>> relation into appendrel after removal from the inheritance but
>>> still exists.
>>
>> Right.
>>
>>> I see two ways to fix this.
>>>
>>> The first patch adds a recheck of inheritance relationship if the
>>> corresponding attribute is missing in the child in
>>> make_inh_translation_list(). The recheck is a bit complex but it
>>> is not performed unless the sequence above is happen. It checks
>>> duplication of relid (or cycles in inheritance) following
>>> find_all_inheritors (but doing a bit different) but I'm not sure
>>> it is really useful.
>>
>> I had proposed adding a syscache on pg_inherits(inhrelid, inhparent), but
>> I guess your hash table based solution will do the job as far as
>> performance of this check is concerned, although I haven't checked the
>> code closely.
> 
> The hash table is not crucial in the patch. Substantially it just
> recurs down looking up pg_inherits for the child. I considered
> the new index but abandoned because I thought that such case
> won't occur so frequently.

Agreed.  BTW, the hash table in patch #1 does not seem to be really
helpful.  In the while loop in is_descendant_of_internal(), does
hash_search() ever return found = true?  AFAICS, it does not.

>>> The second patch lets ALTER TABLE NO INHERIT to acquire locks on
>>> the parent first.
>>>
>>> Since the latter has a larger impact on the current behavior and
>>> we already treat "DROP TABLE child" case in the similar way, I
>>> suppose that the first approach would be preferable.
>>
>> That makes sense.
>>
>> BTW, in the partitioned table case, the parent is always locked first
>> using an AccessExclusiveLock.  There are other considerations in that case
>> such as needing to recreate the partition descriptor upon termination of
>> inheritance (both the DETACH PARTITION and also DROP TABLE child cases).
> 
> Apart from the degree of concurrency, if we keep parent->children
> order of locking, such recreation does not seem to be
> needed. Maybe I'm missing something.

Sorry to have introduced that topic in this thread, but I will try to
explain anyway why things are the way they are currently:

Once a table is no longer a partition of the parent (detached or dropped),
we must make sure that the next commands in the transaction don't see it
as one.  That information is currently present in the relcache
(rd_partdesc), which is used by a few callers, most notably the
tuple-routing code.  Next commands must recreate the entry so that the
correct thing happens based on the updated information.  More precisely,
we must invalidate the current entry.  RelationClearRelation() will either
delete the entry or rebuild it.  If it's being referenced somewhere, it
will be rebuilt.  The place holding the reference may also be looking at
the content of rd_partdesc, which we don't require them to make a copy of,
so we must preserve its content while other fields of RelationData are
being read anew from the catalog.  We don't have to preserve it if there
has been any change (partition added/dropped), but to make such a change
one would need to take a strong enough lock on the relation (parent).  We
assume here that anyone who wants to reference rd_partdesc takes at least
AccessShareLock lock on the relation, and anyone who wants to change its
content must take a lock that will conflict with it, so
AccessExclusiveLock.  Note that in all of this, we are only talking about
one relation, that is the parent, so parent -> child ordering of taking
locks may be irrelevant.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql - add special variable to reflect the last query status

2017-06-27 Thread Pavel Stehule
Hi

2017-06-19 5:55 GMT+02:00 Pavel Stehule :

>
>
> 2017-06-17 7:58 GMT+02:00 Fabien COELHO :
>
>>
>> I have not any other comments. The implementation is trivial. I rerun all
>>> tests and tests passed.
>>>
>>> I'll mark this patch as ready for commiters.
>>>
>>
>> Oops, I just noticed a stupid confusion on my part which got through, I
>> was setting "ERROR" as "success", inverting the expected boolean value.
>>
>> Here is a fixed version.
>
>
> I missed it too.
>
> We can introduce macro SetVariableBool(vars, varname, bool) instead
>
>  SetVariable(pset.vars, "ERROR", "FALSE");
>

I checked source code, and it requires little bit more harder refactoring
because now we have SetVariableBool - what is unhappy name, because it
initialize variable to ON value. It is question what is better name?

I found more interesting issue - the code of  SetResultVariables is
partially redundant with AcceptResult - maybe the switch there can be
shared.

Regards

Pavel

>
> Regards
>
> Pavel
>
>>
>>
>> --
>> Fabien.
>
>
>


Re: [HACKERS] Modifing returning value of PQgetvalue.

2017-06-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 6/24/17 06:31, Dmitry Igrishin wrote:
>> PQgetvalue returns a value of type char* (without const). But the
>> documentation says:
>> "The pointer returned by PQgetvalue points to storage that is part of
>> the PGresult structure. /One should not modify the data it points to/"
>> (my italics). Could someone tell me please, what wrong with modifing
>> arbitrary character of the data pointed by PQgetvalue's returning value?
>> Or why this restriction is documented? Thanks.

> This is just how the API is defined.  It could be defined differently,
> but it is not.

To enlarge on that: it might well work today.  But if we change the
code in future so that it does not work, we will make no apologies.

There's at least one case where you could have a problem today.
All null entries in a PGresult share the same pointer to an empty
string, so that if you were to modify that string, it would affect
what's seen for other "null" values.  If the API allowed modification
of data values, that would be a bug.  But it doesn't, and so it isn't,
and we wouldn't look kindly on complaints about it.

Ideally, PQgetvalue should be redefined to return "const char *".
I don't really see us doing that anytime soon though.  It would
result in a lot of compile warnings for perfectly-safe client code.
I'm sure this API would look different if it weren't older than
universal availability of "const" :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgjdbc logical replication client throwing exception

2017-06-27 Thread sanyam jain
Hi,

>What does the server log say?  If nothing interesting, turn up debugging.

I receive the following Log on server
LOG:  could not send data to client: Broken pipe

Thanks,
Sanyam Jain


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-27 Thread Mithun Cy
On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown  wrote:

>
> I also think pg_prewarm.dump_interval should be renamed to
> pg_prewarm.autoprewarm_interval.

Thanks, I have changed it to pg_prewarm.autoprewarm_interval.

>>
>> * In the documentation, don't say "This is a SQL callable function
>> to".  This is a list of SQL-callable functions, so each thing in
>> the list is one.  Just delete this from the beginning of each
>> sentence.


> One thing I couldn't quite make sense of is:
>
> "The autoprewarm process will start loading blocks recorded in
> $PGDATA/autoprewarm.blocks until there is a free buffer left in the
> buffer pool."
>
> Is this saying "until there is a single free buffer remaining in
> shared buffers"?  I haven't corrected or clarified this as I don't
> understand it.

Sorry, that was a typo I wanted to say  until there is no free buffer
left. Fixed in autoprewarm_16.patch.

>
> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
> detect an autoprewarm process already running.  I'd want this to
> return NULL or an error if called for a 2nd time.

We log instead of error as we try to check only after launching the
worker and inside worker. One solution could be as similar to
autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
memory and check if we can launch worker in backend itself. I will try
to fix same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-27 Thread Mithun Cy
Thanks, Robert, I have tried to fix all of you comments and merged to
fixes suggested by Thom in patch 15.

On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas  wrote:
>
> * I suggest renaming launch_autoprewarm_dump() to
> autoprewarm_start_worker().  I think that will be clearer.  Remember
> that user-visible names, internal names, and the documentation should
> all match.

-- Fixed as suggested.

>
> * I think the GUC could be pg_prewarm.autoprewarm rather than
> pg_prewarm.enable_autoprewarm.  It's shorter and, I think, no less
> clear.

-- I have made GUC name as autoprewarm.

>
> * In the documentation, don't say "This is a SQL callable function
> to".  This is a list of SQL-callable functions, so each thing in
> the list is one.  Just delete this from the beginning of each
> sentence.

-- Fixed, Thom has provided the fix and I have merged same to my patch.

> * The reason for the AT_PWARM_* naming is not very obvious.  Does AT
> mean "at" or "auto" or something else?  How about
> AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY,
> AUTOPREWARM_INTERVAL_DEFAULT?

-- Fixed as suggested. The AUTOPREWARM_INTERVAL_DISABLED is removed
now as suggested by below comments.

>
> * Instead of defining apw_sigusr1_handler, I think you could just use
> procsignal_sigusr1_handler.  Instead of defining apw_sigterm_handler,
> perhaps you could just use die().  got_sigterm would go away, and
> you'd just CHECK_FOR_INTERRUPTS().

-- Hi have registered procsignal_sigusr1_handler instead of
apw_sigusr1_handler. But I have some doubts about using die instead of
apw_sigterm_handler in main autoprewarm worker. On shutdown(sigterm)
we should dump and then exit, so doing a CHECK_FOR_INTERRUPTS() we
might miss dumping the buffer contents. I think I need to modify some
server code in ProcessInterrupts to handle this, please let me know if
I am wrong about this.
For per-database prewarm worker, this seems right so I am registering
die for SIGTERM and calling CHECK_FOR_INTERRUPTS(). Also for
autoprewarm_dump_now().

>
> * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse
> reset_apw_state(), which might be better named detach_apw_shmem().
> Similarly, init_apw_state() could be init_apw_shmem().

-- Fixed.

> * Instead of load_one_database(), I suggest
> autoprewarm_database_main().  That is more parallel to
> autoprewarm_main(), which you have elsewhere, and makes it more
> obvious that it's the main entrypoint for some background worker.

-- Fixed.

> * Instead of launch_and_wait_for_per_database_worker(), I suggest
> autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I
> suggest autoprewarm_buffers().   The motivation for changing prewarm
> to autoprewarm is that we want the names here to be clearly distinct
> from the other parts of pg_prewarm that are not related to
> autoprewarm.  The motivation for changing buffer_pool to buffers is
> just that it's a little shorter.  Personally I  also like the sound it
> of it better, but YMMV.

-- Fixed as suggested. I have renamed as suggested.

> * prewarm_buffer_pool() ends with a useless return statement.  I
> suggest removing it.

-- Sorry Fixed.

>
> * Instead of creating our own buffering system via buffer_file_write()
> and buffer_file_flush(), why not just use the facilities provided by
> the operating system?  fopen() et. al. provide buffering, and we have
> AllocateFile() to provide a FILE *; it's just like
> OpenTransientFile(), which you are using, but you'll get the buffering
> stuff for free.  Maybe there's some reason why this won't work out
> nicely, but off-hand it seems like it might.  It looks like you are
> already using AllocateFile() to read the dump, so using it to write
> the dump as well seems like it would be logical.

-- Now using AllocateFile().

> * I think that it would be cool if, when autoprewarm completed, it
> printed a message at LOG rather than DEBUG1, and with a few more
> details, like "autoprewarm successfully prewarmed %d of %d
> previously-loaded blocks".  This would require some additional
> tracking that you haven't got right now; you'd have to keep track not
> only of the number of blocks read from the file but how many of those
> some worker actually loaded.  You could do that with an extra counter
> in the shared memory area that gets incremented by the per-database
> workers.
>
> * dump_block_info_periodically() calls ResetLatch() immediately before
> WaitLatch; that's backwards.  See the commit message for commit
> 887feefe87b9099c2967ec31ce20df4dfa9b and the comments it added to
> the top of latch.h for details on how to do this correctly.

-- Sorry Fixed.

> * dump_block_info_periodically()'s main loop is a bit confusing.  I
> think that after calling dump_now(true) it should just "continue",
> which will automatically retest got_sigterm.  You could rightly object
> to that plan on the grounds that we then wouldn't recheck got_sighup
> promptly, but you can fix