date:20150804

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Ildus Kurbangaliev


On 08/03/2015 04:25 PM, Ildus Kurbangaliev wrote:

On 07/28/2015 10:28 PM, Heikki Linnakangas wrote:

On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote:

Hello.
In the attached patch I've made a refactoring for tranches.
The prefix for them was extended,  and I've did a split of LWLockAssign
to two
functions (one with tranche and second for user defined LWLocks).


This needs some work in order to be maintainable:

* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept 
in sync with the list of individual locks in lwlock.h. Sooner or 
later someone will add an LWLock and forget to update the 
names-array. That needs to be made less error-prone, so that the 
names are maintained in the same place as the #defines. Perhaps 
something like rmgrlist.h.


* The base tranches are a bit funny. They all have the same 
array_base, pointing to MainLWLockArray. If there are e.g. 5 clog 
buffer locks, I would expect the T_NAME() to return ClogBufferLocks 
for all of them, and T_ID() to return numbers between 0-4. But in 
reality, T_ID() will return something like 55-59.


Instead of passing a tranche-id to LWLockAssign(), I think it would 
be more clear to have a new function to allocate a contiguous block 
of lwlocks as a new tranche. It could then set the base correctly.


* Instead of having LWLOCK_INDIVIDUAL_NAMES to name individual 
locks, how about just giving each one of them a separate tranche?


* User manual needs to be updated to explain the new column in 
pg_stat_activity.


- Heikki



Hello. Thanks for review. I attached new version of patch.

It adds new field in pg_stat_activity that shows current wait in backend.

I've did a some refactoring LWLocks tranche mechanism. In lwlock.c only
invididual and user defined LWLocks is creating, other LWLocks created by
modules who need them. I think that is more logical (user know about 
module,

not module about of all users). It also simplifies LWLocks acquirement.

Now each individual LWLock and other groups of LWLocks have their 
tranche, and can
be easily identified. If somebody will add new individual LWLock and 
forget to add
its name, postgres will show a message. Individual LWLocks still 
allocated in

one array but tranches for them point to particular index in main array.

Sample:

b1=# select pid, wait_event from pg_stat_activity; \watch 1

 pid  |   wait_event
--+-
 7722 | Storage: READ
 7653 |
 7723 | Network: WRITE
 7725 | Network: READ
 7727 | Locks: Transaction
 7731 | Storage: READ
 7734 | Network: READ
 7735 | Storage: READ
 7739 | LWLocks: WALInsertLocks
 7738 | Locks: Transaction
 7740 | LWLocks: BufferMgrLocks
 7741 | Network: READ
 7742 | Network: READ
 7743 | Locks: Transaction


Attatched new version of patch with some small fixes in code

--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..2e4e67e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -630,6 +630,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  entryTrue if this backend is currently waiting on a lock/entry
 /row
 row
+ entrystructfieldwait_event//entry
+ entrytypetext//entry
+ entryName of a current wait event in backend/entry
+/row
+row
  entrystructfieldstate//entry
  entrytypetext//entry
  entryCurrent overall state of this backend.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3a58f1e..10c25cf 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -457,7 +457,8 @@ CLOGShmemInit(void)
 {
 	ClogCtl-PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(ClogCtl, CLOG Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  CLogControlLock, pg_clog);
+  CLogControlLock, pg_clog,
+  CLogBufferLocks);
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5ad35c0..dd7578f 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -466,7 +466,8 @@ CommitTsShmemInit(void)
 
 	CommitTsCtl-PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, CommitTs Ctl, CommitTsShmemBuffers(), 0,
-  CommitTsControlLock, pg_commit_ts);
+  CommitTsControlLock, pg_commit_ts,
+  CommitTSBufferLocks);
 
 	commitTsShared = ShmemInitStruct(CommitTs shared,
 	 sizeof(CommitTimestampShared),
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1933a87..b905c59 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1842,10 +1842,12 @@ MultiXactShmemInit(void)
 
 	SimpleLruInit(MultiXactOffsetCtl,
   MultiXactOffset Ctl, NUM_MXACTOFFSET_BUFFERS, 0,
-  MultiXactOffsetControlLock, pg_multixact/offsets);
+

Re: [HACKERS] GROUP BY before JOIN

2015-08-04 Thread Ashutosh Bapat

On Tue, Aug 4, 2015 at 4:00 PM, David Rowley david.row...@2ndquadrant.com
wrote:

 On 4 August 2015 at 21:56, Ashutosh Bapat ashutosh.ba...@enterprisedb.com
  wrote:

 This looks like one example of general problem of finding optimal order
 for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A,
 B), sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like
 join, grouping, aggregation, projection, union, intersection, limit,
 ordering etc. and A, B, C, D are tables. Given such a representation, how
 do we find an optimal order of operation?

 In you example the query which is defined like GROUPBY(JOIN(A, B)) might
 have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
 is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
 optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
 groups i.e. UNION is distributive over GROUPBY.


 I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B).
 If the group by contained columns from relations A and B, then it wouldn't
 be possible to group each one individually. We would have to wait until the
 join was performed in that case.


Sorry for not being clear enough. GROUPBY(A) implies GROUPBY on relation A,
using some columns (unspecified but common to both A and B here). GROUPBY
is used as an operator over relations, in classic sense of operator. This
representation does not have any columns, join conditions specified for
simplicity. It specifies SQL operators on relations as operands.



 Right now, we use dynamic programming to find the optimal order for join
 tree (make_one_rel). In the language above, we find the optimal order for
 evaluating an expression consisting of JOIN operators by exploiting their
 commutative and associative properties. If we could extend that to SQL
 expression tree representing the query, we will be able to solve many of
 such re-ordering problems using current path infrastructure. The dynamic
 program for this would look something like below

 root = top operator of the query
 child = the operands of the root

 cost of query = minimum of
 1. cost of query without any mutation
 2. cost of root and its child operator commuted, if there is only one
 child for root operator and root operator and child operator are commutable
 3. cost of root distributed over children operands if there are multiple
 operands and root operator is distributive over child operators
 4. ...
 5. ...

 Each operator root will have a RelOptInfo to which we add many paths
 possible for performing that operation using different orders of
 evaluation. Current add_path should eliminate the expensive ones.


 So in a query such as:

  select s.product_id from sale s inner join product p on s.product_id =
 p.product_id group by s.product_id;

 Do you mean that there would be 2 RelOptInfos for the sale relation, one
 without the GROUP BY, and one with? That would likely solve the issue I
 have with join row estimates, but which of the 2 RelOptInfo would all of
 the Vars in the parse tree be pointing to?


This query would be represented as GROUPBY(JOIN(sales, product)), grouping
happens on s.product_id and JOIN on s.product_id = p.product_id. The
dynamic programming would consider GROUPBY(JOIN(sales, product)) and
JOIN(GROUPBY(sales), GROUPBY(product)) as possible evaluation orders, since
GROUPBY is distributive over JOIN in this case and choose the order with
the least cost.



 This is going to explode the number of path (plan) trees considered by
 planner, so we need to be judicious when to apply this technique or have a
 rigorous pruning technique to discard costly paths early enough. This
 approach might be better than trying to solve each re-ordering problem
 separately and might go well with upper planner pathification being
 discussed at
 http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com
 .


 It's certainly going to increase that, but I'm not sure if 'explode' is
 the best word as we at least require all of the rels seen in the GROUP BY
 expressions and aggregate function parameters to be joined before we can
 consider a GroupingPath. The patch
 uses bms_is_subset(root-minimumGroupingRels, rel-relids) to determine
 that.


 On Tue, Aug 4, 2015 at 1:57 PM, David Rowley 
 david.row...@2ndquadrant.com wrote:

 == Overview ==

 As of today we always perform GROUP BY at the final stage, after each
 relation in the query has been joined. This of course works, but it's not
 always the most optimal way of executing the query.

 Consider the following two relations:

 create table product (product_id int primary key, code varchar(32) not
 null);
 create table sale (sale_id int primary key, product_id int not null, qty
 int not null);
 create index on sale (product_id);

 Populated with:

 insert into product select x.x,'ABC' || x.x from generate_series(1,100)
 x(x);
 insert into sale select x.x,x.x%100+1, 10 from

Re: [HACKERS] [PATCH] postgres_fdw extension support

2015-08-04 Thread Paul Ramsey

Thanks so much Michael! Let me know when you have further feedback I should 
incorporate.

ATB,
P. 

-- 
http://postgis.net
http://cleverelephant.ca


On July 25, 2015 at 4:52:11 AM, Michael Paquier (michael.paqu...@gmail.com) 
wrote:

On Sat, Jul 25, 2015 at 2:19 AM, Paul Ramsey pram...@cleverelephant.ca wrote: 
 
 On Thu, Jul 23, 2015 at 7:48 AM, Paul Ramsey pram...@cleverelephant.ca 
 wrote:  
 Here's an updated patch that clears the cache on changes to foreign  
 wrappers and servers.  
  
 Any chance one of you folks could by my official commitfest reviewer?  
 Appreciate all the feedback so far!  
  
 https://commitfest.postgresql.org/6/304/  

That's an interesting topic, I will register as such.  
--  
Michael

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Ildus Kurbangaliev


On 08/04/2015 03:15 PM, Robert Haas wrote:

On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangashlinn...@iki.fi  wrote:

* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
with the list of individual locks in lwlock.h. Sooner or later someone will
add an LWLock and forget to update the names-array. That needs to be made
less error-prone, so that the names are maintained in the same place as the
#defines. Perhaps something like rmgrlist.h.

This is a good idea, but it's not easy to do in the style of
rmgrlist.h, because I don't believe there's any way to define a macro
that expands to a preprocessor directive.  Attached is a patch that
instead generates the list of macros from a text file, and also
generates an array inside lwlock.c with the lock names that gets used
by the Trace_lwlocks stuff where applicable.

Any objections to this solution to the problem?  If not, I'd like to
go ahead and push this much.  I can't test the Windows changes
locally, though, so it would be helpful if someone could check that
out.

In my latest patch I still have an array with names, but postgres will 
show a message
if somebody adds an individual LWLock and forgets to add its name. Code 
generation
is also a solution, and if commiters will support it I'll merge it to 
main patch.


--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs

On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote:


 The attached proof of concept patch greatly improves the bloat for both
 the insert and the update cases.  You need to turn on both features: adding
 the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
 The first of those two things could probably be adopted for real, but the
 second probably is not acceptable.  What is the right way to do this?
 Could a variant of RecordFreeIndexPage bubble the free space up the map
 immediately rather than waiting for a vacuum?  It would only have to move
 up until it found a page with freespace already recorded in it, which the
 vast majority of the time would mean observing up one level and then not
 writing to it, assuming the pending list pages remain well clustered.


You make a good case for action here since insert only tables with GIN
indexes on text are a common use case for GIN.

Why would vacuuming the FSM be unacceptable? With a
large gin_pending_list_limit it makes sense.

If it is unacceptable, perhaps we can avoid calling it every time, or
simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
80/20 heuristic for this case.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2015-08-04 Thread Shigeru Hanada

Hi Ashutosh,

Sorry for leaving the thread.

2015-07-20 16:09 GMT+09:00 Ashutosh Bapat ashutosh.ba...@enterprisedb.com:
 In find_user_mapping(), if the first cache search returns a valid tuple, it
 is checked twice for validity, un-necessarily. Instead if the first search
 returns a valid tuple, it should be returned immediately. I see that this
 code was copied from GetUserMapping(), where as well it had the same
 problem.

Oops.  I changed find_user_mapping to exit immediately when any valid
cache was found.

 In build_join_rel(), we check whether user mappings of the two joining
 relations are same. If the user mappings are same, doesn't that imply that
 the foreign server and local user are same too?

Yes, validity of umid is identical to serverid.  We can remove the
check for serverid for some cycles.  One idea is to put Assert for
serverid inside the if-statement block.

 Rest of the patch looks fine.

Thanks

-- 
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cost_agg() with AGG_HASHED does not account for startup costs

2015-08-04 Thread Tom Lane

David Rowley david.row...@2ndquadrant.com writes:
 During working on allowing the planner to perform GROUP BY before joining
 I've noticed that cost_agg() completely ignores input_startup_cost
 when aggstrategy == AGG_HASHED.

Isn't your proposed patch double-counting the input startup cost?
input_total_cost already includes that charge.  The calculation
reflects the fact that we have to read all of the input before we
can deliver any aggregated results, so the time to get the first
input row isn't really interesting.

If this were wrong, the PLAIN costing path would also be wrong, but
I don't think that either one is.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Heikki Linnakangas


On 08/04/2015 03:15 PM, Robert Haas wrote:

On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
with the list of individual locks in lwlock.h. Sooner or later someone will
add an LWLock and forget to update the names-array. That needs to be made
less error-prone, so that the names are maintained in the same place as the
#defines. Perhaps something like rmgrlist.h.


This is a good idea, but it's not easy to do in the style of
rmgrlist.h, because I don't believe there's any way to define a macro
that expands to a preprocessor directive.  Attached is a patch that
instead generates the list of macros from a text file, and also
generates an array inside lwlock.c with the lock names that gets used
by the Trace_lwlocks stuff where applicable.

Any objections to this solution to the problem?  If not, I'd like to
go ahead and push this much.  I can't test the Windows changes
locally, though, so it would be helpful if someone could check that
out.


A more low-tech solution would be to something like this in lwlocknames.c:

static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS];

/* Turn pointer into one of the LWLocks in main array into an index 
number */

#define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name

InitLWLockNames()
{
  NAME_LWLOCK(ShmemIndexLock, ShmemIndexLock);
  NAME_LWLOCK(OidGenLock, OidGenLock);
  ...
}

That would not be auto-generated, so you'd need to keep that list in 
sync with lwlock.h, but it would be much better than the original patch 
because if you forgot to add an entry in the names-array, the numbering 
of all the other locks would not go wrong. And you could have a runtime 
check that complains if there's an entry missing, like Ildus did in his 
latest patch.


I have no particular objection to your perl script either, though. I'll 
leave it up to you.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Heikki Linnakangas


On 08/04/2015 04:35 PM, Simon Riggs wrote:

This and the OP seem like 9.5 open items to me.


Why? This is nothing new in 9.5.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs

On 4 August 2015 at 14:55, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 08/04/2015 04:35 PM, Simon Riggs wrote:

 This and the OP seem like 9.5 open items to me.


 Why? This is nothing new in 9.5.


gin_pending_list_limit is new in 9.5

We're in Alpha, so if something has been added and isn't very usable, we
should change it while we can.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Andres Freund

On 2015-08-04 14:59:11 +0100, Simon Riggs wrote:
 On 4 August 2015 at 14:55, Heikki Linnakangas hlinn...@iki.fi wrote:
 
  On 08/04/2015 04:35 PM, Simon Riggs wrote:
 
  This and the OP seem like 9.5 open items to me.
 
 
  Why? This is nothing new in 9.5.
 
 
 gin_pending_list_limit is new in 9.5
 
 We're in Alpha, so if something has been added and isn't very usable, we
 should change it while we can.

The only thing that variable does is change what the pending size limit
is determined by. Previously it was work_mem, now it's
gin_pending_list_limit. Imo that has pretty much nothing to do with not
registering pages as free.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-08-04 Thread Amit Kapila

On Sun, Aug 2, 2015 at 8:06 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 Amit,

 Let me ask three more detailed questions.

 Why Funnel has a valid qual of the subplan?
 The 2nd argument of make_funnel() is qualifier of the subplan
 (PartialSeqScan) then it is initialized at ExecInitFunnel,
 but never executed on the run-time. Why does Funnel node has
 useless qualifier expression here (even though it is harmless)?


The idea is that if in some case the qualification can't be
pushed down (consider the case where qualification contains
parallel restricted functions (functions that can only be
executed in master backend)) and needs to be only executed
in master backend, then we need it in Funnel node, so that it
can be executed for tuples passed by worker backends.  It is
currently not used, but I think we should retain it as it is
because it can be used in some cases either as part of this
patch itself or in future.  As of now, it is used in other
places in patch (like during Explain) as well, although we
might want to optimize the same, but overall I think it is
required.

 Why Funnel delivered from Scan? Even though it constructs
 a compatible target-list with underlying partial-scan node,
 it does not require the node is also delivered from Scan.

It needs it's own target-list due to reason mentioned above
for qual and yet another reason is that the same is required
for FunnelState which inturn is required ScanSlot used to
retrieve tuples from workers.  Also it is not excatly same
as partialseqscan, because for the case when the partialseqscan
node is executed by worker, we modify the targetlist as well,
refer create_parallel_worker_plannedstmt().


 Does ExecFunnel() need to have a special code path to handle
 EvalPlanQual()? Probably, it just calls underlying node in the
 local context. ExecScan() of PartialSeqScan will check its
 qualifier towards estate-es_epqTuple[].


Isn't EvalPlanQual() called for modifytable node and which
won't be allowed in parallel mode, so I think EvalPlanQual()
is not required for ExecFunnel path.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-04 Thread Beena Emerson

Michael Paquier wrote:
 And this is the case of any format as well. String format validation
 for a GUC occurs when server is reloaded or restarted, one advantage
 of JSON is that the parser validator is already here, so we don't need
 to reinvent a new machinery for that.

IIUC correctly, we would also have to add additional code to check that that
given JSON has the required keys and entries. For ex: The group mentioned
in the s_s_names  should be definied in the groups section, etc.





-
Beena Emerson

--
View this message in context: 
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860758.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Simon Riggs

On 4 August 2015 at 05:56, Michael Paquier michael.paqu...@gmail.com
wrote:

 Hi all,

 As mentioned in the thread related to lowering locks of autovacuum
 reloptions in ALTER TABLE SET
 (
 http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com
 ),
 I have noticed the following code in
 AlterTableGetLockLevel@tablecmds.c:
 /*
  * Take the greatest lockmode from any subcommand
  */
 if (cmd_lockmode  lockmode)
 lockmode = cmd_lockmode;

 The thing is that, as mentioned by Alvaro and Andres on this thread,
 we have no guarantee that the different relation locks compared have a
 monotone hierarchy and we may finish by taking a lock that does not
 behave as you would like to. We are now lucky enough that ALTER TABLE
 only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and
 AccessExclusiveLock that actually have a hierarchy so this is not a
 problem yet.
 However it may become a problem if we add in the future more lock
 modes and that are used by ALTER TABLE.


Please provide the link to the discussion of this. I don't see a problem
here right now that can't be solved by saying

Assert(locklevel==ShareUpdateExclusiveLock ||
locklevelShareRowExclusiveLock);

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-04 Thread Michael Paquier

On Tue, Aug 4, 2015 at 3:27 PM, Beena Emerson memissemer...@gmail.com wrote:
 Michael Paquier wrote:
 And this is the case of any format as well. String format validation
 for a GUC occurs when server is reloaded or restarted, one advantage
 of JSON is that the parser validator is already here, so we don't need
 to reinvent a new machinery for that.

 IIUC correctly, we would also have to add additional code to check that that
 given JSON has the required keys and entries. For ex: The group mentioned
 in the s_s_names  should be definied in the groups section, etc.

Yep, true as well.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier

On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Michael Paquier wrote:

 As mentioned in the thread related to lowering locks of autovacuum
 reloptions in ALTER TABLE SET
 (http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com),
 I have noticed the following code in
 AlterTableGetLockLevel@tablecmds.c:
 /*
  * Take the greatest lockmode from any subcommand
  */
 if (cmd_lockmode  lockmode)
 lockmode = cmd_lockmode;

 The thing is that, as mentioned by Alvaro and Andres on this thread,
 we have no guarantee that the different relation locks compared have a
 monotone hierarchy and we may finish by taking a lock that does not
 behave as you would like to.

 Maybe the solution to this is to add the concept of addition of two
 lock modes, where the result is another lock mode that conflicts with
 any lock that would conflict with either of the two operand lock modes.
 For instance, if you add a lock mode to itself, you get the same lock
 mode; if you add anything to AccessExclusive, you get AccessExclusive;
 if you add anything to AccessShare, you end up with that other lock
 mode.  The most interesting case in our current lock table is if you
 add ShareUpdateExclusive to Share, where you end up with
 ShareRowExclusive.  In essence, holding the result of that operation is
 the same as holding both locks, which AFAICS is the behavior we want.

That's commutative, as this is basically looking at the conflict table
to get the union of the bits to indicate what are all the locks
conflicting with lock A and lock B, and then we select the lock on the
table that includes the whole union, with a minimum number of them.

Now, let's take for example this case with locks A, B, C, D:
- Lock A conflicts with ACD
- B with BCD
- C with itself
- D with itself
What would you choose as a result of add(C,D)? A or B? Or the super
lock conflicting with all of them?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Heikki Linnakangas


On 08/04/2015 08:03 AM, Jeff Janes wrote:

For a GIN index with fastupdate turned on, both the user backends and
autoanalyze routine will clear out the pending list, pushing the entries
into the normal index structure and deleting the pages used by the pending
list.  But those deleted pages will not get added to the freespace map
until a vacuum is done.  This leads to horrible bloat on insert only
tables, as it is never vacuumed and so the pending list space is never
reused.  And the pending list is very inefficient in space usage to start
with, even compared to the old style posting lists and especially compared
to the new compressed ones.  (If they were aggressively recycled, this
inefficient use wouldn't be much of a problem.)


Good point.


The attached proof of concept patch greatly improves the bloat for both the
insert and the update cases.  You need to turn on both features: adding the
pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).  The
first of those two things could probably be adopted for real, but the
second probably is not acceptable.  What is the right way to do this?
Could a variant of RecordFreeIndexPage bubble the free space up the map
immediately rather than waiting for a vacuum?  It would only have to move
up until it found a page with freespace already recorded in it, which the
vast majority of the time would mean observing up one level and then not
writing to it, assuming the pending list pages remain well clustered.


Yep, that sounds good.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-04 Thread Michael Paquier

On Tue, Aug 4, 2015 at 2:57 PM, Masahiko Sawada wrote:
 On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson wrote:
 Since there will not be many nesting and grouping, I still prefer new
 language to JSON.
 I understand one can easily, modify/add groups in JSON using in built
 functions but I think changes will not be done too often.


 If we decided to use dedicated language, the syntax checker for that
 language is needed, via SQL or something.

Well, sure, both approaches have downsides.

 Otherwise we will not be able to know whether the parsing that value
 will be done correctly, until reloading or restarting server.

And this is the case of any format as well. String format validation
for a GUC occurs when server is reloaded or restarted, one advantage
of JSON is that the parser validator is already here, so we don't need
to reinvent a new machinery for that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: SCRAM authentication

2015-08-04 Thread Michael Paquier

On Mon, Mar 30, 2015 at 7:52 PM, Heikki Linnakangas hlinn...@iki.fi wrote:

 There have been numerous threads on replacing our MD5 authentication
 method, so I started hacking on that to see what it might look like. Just
 to be clear, this is 9.6 material. Attached is a WIP patch series that adds
 support for SCRAM. There's no need to look at the details yet, but it
 demonstrates what the protocol changes and the code structure would be like.

 I'm not wedded to SCRAM - SRP or JPAKE or something else might be better.
 But replacing the algorithm, or adding more of them, should be
 straightforward with this.


Agreed. We need such a facility.


 There is no negotiation of the authentication mechanism. SCRAM is just
 added as a new one, alongside all the existing ones. If the server requests
 SCRAM authentication, but the client doesn't support it, the attempt will
 fail. We might want to do something about that, to make the transition
 easier, but it's an orthogonal feature and not absolutely required.

 There are four patches in the series. The first two are just refactoring:
 moving the SHA-1 implementation from pgcrypto to src/common, and some
 refactoring in src/backend/auth.c that IMHO would make sense anyway.


The two first patches of the series look good to me.


 Patches three and four are the interesting ones:


I have not looked in details yet at number implementing SCRAM.


 3. Allow storing multiple verifiers in pg_authid
 

 Replace the pg_authid.rolpassword text field with an array, and rename it
 to 'rolverifiers'. This allows storing multiple password hashes: an MD5
 hash for MD5 authentication, and a SCRAM salt and stored key for SCRAM
 authentication, etc. Each element in the array is a string that begins with
 the method's name. For example md5:MD5 hash, or password:plaintext.

 For dump/reload, and for clients that wish to create the hashes in the
 client-side, there is a new option to CREATE/ALTER USER commands: PASSWORD
 VERIFIERS '{ ... }', that allows replacing the array.

 The old ENCRYPTED/UNENCRYPTED PASSWORD 'foo' options are still supported
 for backwards-compatibility, but it's not clear what it should mean now.

 TODO:

 * Password-checking hook needs to be redesigned, to allow for more kinds
 of hashes.

 * With CREATE USER PASSWORD 'foo', which hashes/verifiers should be
 generated by default? We currently have a boolean password_encryption
 setting for that. Needs to be a list.


I have been looking more in depths at this one, which adds essential
infrastructure to support multiple authentication hashes for more
protocols. Here are some comments:
- Docs are missing (not a big issue for a WIP)
- Instead of an array that has an identified embedded, let's add a new
catalog pg_authid_hashes that stores all the hashes for a user (idea by
Heikki):
-- hashrol, role Oid associated with the hash
-- hashmet, hash method
-- hashval, value of the hash
- New password-checking hook (contrib/passwordcheck will need a refresh).
As of now, we have that:
void (*check_password_hook_type)
(const char *username,
 const char *password,
 int password_type,
 Datum validuntil_time,
 bool validuntil_null);
We need to switch to something that checks a list of hashes:
void (*check_password_hook_type)
(const char *username,
 list *passwd,
 Datum validuntil_time,
 bool validuntil_null);
passwd is a structure containing the password type and the hash value.
Password type can then be plain (or password to match pg_hba.conf) or
md5 for now.
- When password_encryption is switched to a list, true means md5, and false
means plain. At the addition of SCRAM, we could think harder the default
value, true may be worth meaning md5,scram.
- For CREATE ROLE/ALTER ROLE, it is necessary to be able to define the list
of hashes that need to be generated, with something like that for example:
[ ENCRYPTED [(md5[, scram])] | UNENCRYPTED ] PASSWORD 'password'
When UNENCRYPTED is used, we could simply store the password as plain. When
only ENCRYPTED is used, we store it for all the methods available, except
plain. ENCRYPTED and plain are not allowed combinations.
- Also, do we really want an option at SQL level to allow storing custom
hashes generated on client side as a first step? We could have something
like WITH (md5 = 'blah', scram = 'blah2') appended after PASSWORD for
example.
- rolpassword is removed from pg_authid.

I am willing to write a patch for the next CF following more or less those
lines, depending of course on the outcome of the discussion we can have
here, so feel free to comment.

I'll have a look more in-depth at the scram patch as well.
Regards,
-- 
Michael

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier

On Tue, Aug 4, 2015 at 3:35 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Please provide the link to the discussion of this. I don't see a problem
 here right now that can't be solved by saying

Thread:
http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com

Particularly those messages:
http://www.postgresql.org/message-id/20150731022857.gc11...@alap3.anarazel.de
http://www.postgresql.org/message-id/20150731200012.gc2...@postgresql.org
http://www.postgresql.org/message-id/CAB7nPqSK-hSZG7T1tAJ_=HEYsi6P1ejgX2x5LW3LYXJ7=9c...@mail.gmail.com

 Assert(locklevel==ShareUpdateExclusiveLock ||
 locklevelShareRowExclusiveLock);

Yep, true as things stand now. But this would get broken if we add a
new lock level between ShareRowExclusiveLock and AccessExclusiveLock
that does not respect the current monotone hierarchy between those.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Andres Freund

On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:
 On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote:
  On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote:
  For instance, if you told me to choose between ShareLock and
  ShareUpdateExclusiveLock I wouldn't know which one is strongest.  I
  don't it's sensible to have the lock mode compare primitive
 honestly.
  I don't have any great ideas to offer ATM sadly.
 
 Yes, the thing is that lowering the lock levels is good for
 concurrency, but the non-monotony of the lock levels makes it
 impossible to choose an intermediate state correctly.
 
  How about simply acquiring all the locks individually of they're different 
  types? These few acquisitions won't matter.
 
 As long as this only applies on master, this may be fine... We could
 basically pass a LOCKMASK to the multiple layers of tablecmds.c
 instead of LOCKMODE to track all the locks that need to be taken, and
 all the relations open during operations.

This sounds far too complicated to me. Just LockRelationOid() the
relation with the appropriate level everytime you pass through the
function?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] GROUP BY before JOIN

2015-08-04 Thread David Rowley

== Overview ==

As of today we always perform GROUP BY at the final stage, after each
relation in the query has been joined. This of course works, but it's not
always the most optimal way of executing the query.

Consider the following two relations:

create table product (product_id int primary key, code varchar(32) not
null);
create table sale (sale_id int primary key, product_id int not null, qty
int not null);
create index on sale (product_id);

Populated with:

insert into product select x.x,'ABC' || x.x from generate_series(1,100)
x(x);
insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100)
x(x);

Here we have a table with 100 products and another with 1 million sales
records.

If we wanted to see the total sales for each product we may write a query
such as:

select s.product_id,sum(qty) as qty from sale s inner join product p on
s.product_id = p.product_id group by s.product_id;

If we look at the explain for this query we can see that the grouping
happened after the join took place:

QUERY PLAN
--
 HashAggregate
   Group Key: s.product_id
   -  Hash Join
 Hash Cond: (s.product_id = p.product_id)
 -  Seq Scan on sale s
 -  Hash
   -  Seq Scan on product p

The join here would product exactly 1 million rows, then hashaggregate will
group 1 million rows.

If it were possible for PostgreSQL to perform the GROUP BY before the join
we could change that to Grouping 1 million rows, then joining 100 rows.

Of course we could have written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id;

Which would do exactly that, and in this particular case it is much faster,
but it's not all rainbows and unicorns, as if the join happened to filter
out many of the groups, then it might not be a win at all.

Consider:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id where p.code = 'ABC1' group by s.product_id;

Since the product.code has no equivalence member in the sale table the
predicate cannot be applied to the sale table, so in this case if we'd
written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id
where p.code = 'ABC1';

We'd have thrown away 99% of the groups created in the subquery.

== Proposal ==

I've been working on allowing the planner to properly cost which option is
better and choose to perform the GROUP BY at estimated most efficient join
level.

So far I've not gotten as far as supporting queries which contain aggregate
functions, as I need the ability to combine aggregate states (Simon and I
proposed a patch for that here https://commitfest.postgresql.org/5/131/).

Performance of the above test case is looking quite good so far:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

-- Master = 312.294 ms
-- Patched = 95.328 ms

So a 327% performance increase in this particular case.

== Implementation ==

I've had to make some choices about how exactly to implement this feature.
As I explain above, we can't just always do the grouping at the first
possible opportunity due to the possibility of joins eliminating groups and
needless groups being created.

To support this in the planner I've invented a GroupingPath type and I've
also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to
RelOptInfo. This part I wasn't too sure about and I wondered if these
GroupingPaths should just be added to the normal list with add_path(), but
since these paths perform more work than normal paths that would give them
an unfair disadvantage and they could be thrown out. These paths are only
possibly cheaper after subsequent JOIN has taken place.

There's also an issue with row estimates being wrong on joinrels, and the
row estimate is using the base rel's row count rather than the number of
groups. As yet I'm unsure the best way to fix this.

I wanted to post this early so as to register the fact that I'm working on
this, but I'm also posting in hope to get some early feedback on what I
have so far.

Of course there's lots more work to do here, aggregates need to be
supported, functional dependencies and transitive closures will also need
to be detected in a more complete implementation.

Here's what I have so far (attached)

Comments are most welcome. Thanks.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


groupbeforejoin_v0.1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Amit Langote

On 2015-08-04 PM 05:58, Geoff Winkless wrote:
 
 Although it seems Amit has defined the problem better than I could, so
 this is a bit late to the party (!), yes, the table had been ALTERed after
 it was created (looking back through the history, that modification
 included at least one DROP COLUMN).
 

It seems using any columns that used to be after a dropped columns cause
EXCLUDE pseudo-relation to misbehave. For example, I observed another symptom:

test=# CREATE TABLE upsert_fail_test(a int, b int, c int, d smallint);
CREATE TABLE

test=# ALTER TABLE upsert_fail_test DROP b;
ALTER TABLE

test=# ALTER TABLE upsert_fail_test ADD PRIMARY KEY (a, c, d);
ALTER TABLE

test=# INSERT INTO  upsert_fail_test(a, c, d) VALUES (1, 2, 3) ON CONFLICT
(a, c, d) DO UPDATE SET c = EXCLUDED.c;
INSERT 0 1

test=# INSERT INTO  upsert_fail_test(a, c, d) VALUES (1, 2, 3) ON CONFLICT
(a, c, d) DO UPDATE SET c = EXCLUDED.c;
ERROR:  null value in column c violates not-null constraint
DETAIL:  Failing row contains (1, null, 3).

Or, the EXCLUDED pseudo-rel failed to deliver '2' produced by the subplan
and instead produced a 'null' which I guess was caused by the dropped
column 'b'.

Perhaps, it may have to do with how EXCLUDED pseudo-rel's targetlist is
manipulated through parse-plan stage?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-04 Thread Beena Emerson

Robert Haas wrote:
Maybe shoehorning this into the GUC mechanism is the wrong thing, and
what we really need is a new config file for this.  The information
we're proposing to store seems complex enough to justify that.


I think the consensus is that JSON is better.
And using a new file with multi line support would be good.

Name of the file: how about pg_syncinfo.conf? 


Backward compatibility: synchronous_standby_names will be supported.
synchronous_standby_names='pg_syncinfo' indicates use of new file.


JSON format:
It would contain 2 main keys: sync_info and  groups
The sync_info would consist of quorum/priority with the count and
nodes/group with the group name or node list.
The optional groups key would list out all the group mentioned within
sync_info along with the node list.


Ex:
1.
{
sync_info:
{
quorum:2,
nodes:
[
node1, node2, node3
]
}
}

2.
{
sync_info:
{
quorum:2,
nodes:
[
{priority:1,group:cluster1},
{quorum:2,group: cluster2},
node99
]
},
groups:
{
cluster1:[node11,node12],
cluster2:[node21,node22,node23]
}
}

Thoughts?



-
Beena Emerson

--
View this message in context: 
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860791.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-08-04 Thread Peter Geoghegan

On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 Yeah, something like that. To paraphrase, if I'm now understanding it
 correctly, Peter's idea is:

 When all the tuples have been fed to tuplesort, and it's time to perform the
 sort, quicksort all the tuples currently in the heap, ignoring the run
 numbers, and turn the resulting array into another tape. That tape is
 special: it's actually stored completely in memory. It is merged with the
 real tapes when tuples are returned from the tuplesort, just like regular
 tapes in TSS_FINALMERGE.

Yeah. I imagine that we'll want to put memory prefetch hints for the
new case, since I've independently shown that that works well for the
in-memory case, which this can be very close to.

My next patch will also include quicksorting of runs after we give up
on heapification (after there is more than one run and it is
established that we cannot use my quicksort with spillover
optimization, so there are two or more real runs on tape). Once
there is clearly not going to be one huge run (which can happen due to
everything largely being in order, even when work_mem is small), and
once incrementally spilling does not end in time to do a quicksort
with spillover, then the replacement selection thing isn't too
valuable. Especially with large memory sizes but memory bandwidth +
latency as a bottleneck, which is the norm these days.

This seems simpler than my earlier idea of reusing half the memtuples
array only, and resorting the entire array each time, to have
something that consistently approximates replacement selection but
with quicksorting + batching, which I discussed before.

I have this working, and it takes about a good chunk of the runtime
off a sort that merges 3 runs on one reasonable case tested where
work_mem was 300MB. It went from about 56.6 seconds with master to
35.8 seconds with this new approach when tested just now (this
approach saves no writing of tuples, so it's not as effective as the
original quicksort with spillover patch can be, but covers a
fundamentally different case). I just need to clean up the patch, and
see if I missed any further optimizations, but this feels like the way
forward multi-run wise. I think it's worth giving up on replacement
selection style runs after the first run is produced, because that's
where the benefits are, if anywhere.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Autonomous Transaction is back

2015-08-04 Thread Rajeev rastogi

On 03 August 2015 18:40, Merlin Moncure [mailto:mmonc...@gmail.com] Wrote:
On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi
rajeev.rast...@huawei.com wrote:
 On 31 July 2015 23:10, Robert Haas Wrote:
I think we're going entirely down the wrong path here.  Why is it ever
useful for a backend's lock requests to conflict with themselves, even
with autonomous transactions?  That seems like an artifact of somebody
else's implementation that we should be happy we don't need to copy.

 IMHO, since most of the locking are managed at transaction level not
backend level and we consider main  autonomous transaction to be
independent transaction, then practically they may conflict right.
 It is also right as you said that there is no as such useful use-cases
where autonomous transaction conflicts with main (parent) transaction.
But we cannot take it for granted as user might make a mistake. So at-
least we should have some mechanism to handle this rare case, for which
as of now I think throwing error from autonomous transaction as one of
the solution. Once error thrown from autonomous transaction, main
transaction may continue as it is (or abort main transaction also??).

hm.  OK, what's the behavior of:

BEGIN
  UPDATE foo SET x = x + 1 WHERE foo_id = 1;

  BEGIN WITH AUTONOMOUS TRANSACTION
UPDATE foo SET x = x + 1 WHERE foo_id = 1;
  END;

  RAISE EXCEPTION ...;
EXCEPTION ...

END;

It should throw an error (or something equivalent) as the second update will 
wait for record lock to get released, which in this case will not happen till 
second update finishes. So catch 22.

Also,
*) What do the other candidate implementations do?  IMO, compatibility
should be the underlying design principle.


Oracle throws error in such case. But we can decide on what behavior we want to 
keep.

*) What will the SQL only feature look like?

Similar to PL as mentioned in your example, we can provide the SQL only 
feature also.

*) Is the SPI interface going to be extended to expose AT?

I don’t think at this point that there is any need of exposing SPI interface 
for this.

Thanks and Regards,
Kumar Rajeev Rastogi

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GROUP BY before JOIN

2015-08-04 Thread Ashutosh Bapat

This looks like one example of general problem of finding optimal order for
SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A, B),
sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like
join, grouping, aggregation, projection, union, intersection, limit,
ordering etc. and A, B, C, D are tables. Given such a representation, how
do we find an optimal order of operation?

In you example the query which is defined like GROUPBY(JOIN(A, B)) might
have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
groups i.e. UNION is distributive over GROUPBY.

Right now, we use dynamic programming to find the optimal order for join
tree (make_one_rel). In the language above, we find the optimal order for
evaluating an expression consisting of JOIN operators by exploiting their
commutative and associative properties. If we could extend that to SQL
expression tree representing the query, we will be able to solve many of
such re-ordering problems using current path infrastructure. The dynamic
program for this would look something like below

root = top operator of the query
child = the operands of the root

cost of query = minimum of
1. cost of query without any mutation
2. cost of root and its child operator commuted, if there is only one child
for root operator and root operator and child operator are commutable
3. cost of root distributed over children operands if there are multiple
operands and root operator is distributive over child operators
4. ...
5. ...

Each operator root will have a RelOptInfo to which we add many paths
possible for performing that operation using different orders of
evaluation. Current add_path should eliminate the expensive ones.

This is going to explode the number of path (plan) trees considered by
planner, so we need to be judicious when to apply this technique or have a
rigorous pruning technique to discard costly paths early enough. This
approach might be better than trying to solve each re-ordering problem
separately and might go well with upper planner pathification being
discussed at
http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com
.

On Tue, Aug 4, 2015 at 1:57 PM, David Rowley david.row...@2ndquadrant.com
wrote:

 == Overview ==

 As of today we always perform GROUP BY at the final stage, after each
 relation in the query has been joined. This of course works, but it's not
 always the most optimal way of executing the query.

 Consider the following two relations:

 create table product (product_id int primary key, code varchar(32) not
 null);
 create table sale (sale_id int primary key, product_id int not null, qty
 int not null);
 create index on sale (product_id);

 Populated with:

 insert into product select x.x,'ABC' || x.x from generate_series(1,100)
 x(x);
 insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100)
 x(x);

 Here we have a table with 100 products and another with 1 million sales
 records.

 If we wanted to see the total sales for each product we may write a query
 such as:

 select s.product_id,sum(qty) as qty from sale s inner join product p on
 s.product_id = p.product_id group by s.product_id;

 If we look at the explain for this query we can see that the grouping
 happened after the join took place:

 QUERY PLAN
 --
  HashAggregate
Group Key: s.product_id
-  Hash Join
  Hash Cond: (s.product_id = p.product_id)
  -  Seq Scan on sale s
  -  Hash
-  Seq Scan on product p

 The join here would product exactly 1 million rows, then hashaggregate
 will group 1 million rows.

 If it were possible for PostgreSQL to perform the GROUP BY before the join
 we could change that to Grouping 1 million rows, then joining 100 rows.

 Of course we could have written the query as:

 select s.product_id,qty from (select product_id,sum(qty) as qty from sale
 group by product_id) s inner join product p on s.product_id = p.product_id;

 Which would do exactly that, and in this particular case it is much
 faster, but it's not all rainbows and unicorns, as if the join happened to
 filter out many of the groups, then it might not be a win at all.

 Consider:

 select s.product_id from sale s inner join product p on s.product_id =
 p.product_id where p.code = 'ABC1' group by s.product_id;

 Since the product.code has no equivalence member in the sale table the
 predicate cannot be applied to the sale table, so in this case if we'd
 written the query as:

 select s.product_id,qty from (select product_id,sum(qty) as qty from sale
 group by product_id) s inner join product p on s.product_id = p.product_id
 where p.code = 'ABC1';

 We'd have thrown away 99% of the groups created in the subquery.

 == Proposal ==

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs

On 4 August 2015 at 09:39, Simon Riggs si...@2ndquadrant.com wrote:

 On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote:


 The attached proof of concept patch greatly improves the bloat for both
 the insert and the update cases.  You need to turn on both features: adding
 the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
 The first of those two things could probably be adopted for real, but the
 second probably is not acceptable.  What is the right way to do this?
 Could a variant of RecordFreeIndexPage bubble the free space up the map
 immediately rather than waiting for a vacuum?  It would only have to move
 up until it found a page with freespace already recorded in it, which the
 vast majority of the time would mean observing up one level and then not
 writing to it, assuming the pending list pages remain well clustered.


 You make a good case for action here since insert only tables with GIN
 indexes on text are a common use case for GIN.

 Why would vacuuming the FSM be unacceptable? With a
 large gin_pending_list_limit it makes sense.

 If it is unacceptable, perhaps we can avoid calling it every time, or
 simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
 80/20 heuristic for this case.


Couple of questions here...

* the docs say it's desirable to have pending-list cleanup occur in the
background, but there is no way to invoke that, except via VACUUM. I think
we need a separate function to be able to call this as a background action.
If we had that, we wouldn't need much else, would we?

* why do we have two parameters: gin_pending_list_limit and fastupdate?
What happens if we set gin_pending_list_limit but don't set fastupdate?

* how do we know how to set that parameter? Is there a way of knowing
gin_pending_list_limit has been reached?

This and the OP seem like 9.5 open items to me.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Tom Lane

Takashi Horikawa t-horik...@aj.jp.nec.com writes:
 Why does this cause a core dump?  We could consider fixing whatever
 the problem is rather than capping the value.

 As far as I experiment with my own evaluation environment using 
 PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the patch 
 attached.

I'm unsure whether this represents a complete fix ... but even if it does,
it would be awfully easy to re-introduce similar bugs in future code
changes, and who would notice?  Josh's approach of restricting the buffer
size seems a lot more robust.

If there were any practical use-case for such large WAL buffers then it
might be worth spending some effort/risk here.  But AFAICS, there is not.
Indeed, capping wal_buffers might be argued to be a good thing in itself
because it would prevent users from wasting shared memory foolishly.

So my vote is for the original approach.  (I've not read Josh's patch,
so there might be something wrong with it in detail, but I like the
basic approach.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Andres Freund

On 2015-08-04 09:49:58 -0400, Tom Lane wrote:
 Takashi Horikawa t-horik...@aj.jp.nec.com writes:
  Why does this cause a core dump?  We could consider fixing whatever
  the problem is rather than capping the value.
 
  As far as I experiment with my own evaluation environment using 
  PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the 
  patch 
  attached.
 
 I'm unsure whether this represents a complete fix ... but even if it does,
 it would be awfully easy to re-introduce similar bugs in future code
 changes, and who would notice?  Josh's approach of restricting the buffer
 size seems a lot more robust.
 
 If there were any practical use-case for such large WAL buffers then it
 might be worth spending some effort/risk here.  But AFAICS, there is not.
 Indeed, capping wal_buffers might be argued to be a good thing in itself
 because it would prevent users from wasting shared memory foolishly.
 
 So my vote is for the original approach.  (I've not read Josh's patch,
 so there might be something wrong with it in detail, but I like the
 basic approach.)

+1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Fabrízio de Royes Mello

On Tue, Aug 4, 2015 at 5:55 AM, Andres Freund and...@anarazel.de wrote:

 On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:
  On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote:
   On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote:
   For instance, if you told me to choose between ShareLock and
   ShareUpdateExclusiveLock I wouldn't know which one is strongest.  I
   don't it's sensible to have the lock mode compare primitive
  honestly.
   I don't have any great ideas to offer ATM sadly.
  
  Yes, the thing is that lowering the lock levels is good for
  concurrency, but the non-monotony of the lock levels makes it
  impossible to choose an intermediate state correctly.
  
   How about simply acquiring all the locks individually of they're
different types? These few acquisitions won't matter.
 
  As long as this only applies on master, this may be fine... We could
  basically pass a LOCKMASK to the multiple layers of tablecmds.c
  instead of LOCKMODE to track all the locks that need to be taken, and
  all the relations open during operations.

 This sounds far too complicated to me. Just LockRelationOid() the
 relation with the appropriate level everytime you pass through the
 function?

Hi all,

IMHO is more simply we just fallback to AccessExclusiveLock if there are
different lockmodes in reloptions as Michael suggested before.

Look at the new version attached.

Regards,


*** This work is funded by Zenvia Mobile Results (http://www.zenvia.com.br)

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1c1c181..ad985cd 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -543,6 +543,10 @@ ALTER TABLE ALL IN TABLESPACE replaceable class=PARAMETERname/replaceable
   of commandALTER TABLE/ that forces a table rewrite.
  /para
 
+ para
+  Changing autovacuum storage parameters acquires a literalSHARE UPDATE EXCLUSIVE/literal lock.
+ /para
+
  note
   para
While commandCREATE TABLE/ allows literalOIDS/ to be specified
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 180f529..e0f9f09 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			autovacuum_enabled,
 			Enables autovacuum in this relation,
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
 		},
 		true
 	},
@@ -65,7 +66,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			user_catalog_table,
 			Declare a table as an additional catalog table, e.g. for the purpose of logical replication,
-			RELOPT_KIND_HEAP
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
 		},
 		false
 	},
@@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			fastupdate,
 			Enables \fast update\ feature for this GIN index,
-			RELOPT_KIND_GIN
+			RELOPT_KIND_GIN,
+			AccessExclusiveLock
 		},
 		true
 	},
@@ -81,7 +84,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			security_barrier,
 			View acts as a row security barrier,
-			RELOPT_KIND_VIEW
+			RELOPT_KIND_VIEW,
+			AccessExclusiveLock
 		},
 		false
 	},
@@ -95,7 +99,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs table pages only to this percentage,
-			RELOPT_KIND_HEAP
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
 		},
 		HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100
 	},
@@ -103,7 +108,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs btree index pages only to this percentage,
-			RELOPT_KIND_BTREE
+			RELOPT_KIND_BTREE,
+			AccessExclusiveLock
 		},
 		BTREE_DEFAULT_FILLFACTOR, BTREE_MIN_FILLFACTOR, 100
 	},
@@ -111,7 +117,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs hash index pages only to this percentage,
-			RELOPT_KIND_HASH
+			RELOPT_KIND_HASH,
+			AccessExclusiveLock
 		},
 		HASH_DEFAULT_FILLFACTOR, HASH_MIN_FILLFACTOR, 100
 	},
@@ -119,7 +126,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs gist index pages only to this percentage,
-			RELOPT_KIND_GIST
+			RELOPT_KIND_GIST,
+			AccessExclusiveLock
 		},
 		GIST_DEFAULT_FILLFACTOR, GIST_MIN_FILLFACTOR, 100
 	},
@@ -127,7 +135,8 @@ static relopt_int intRelOpts[] =
 		{
 			fillfactor,
 			Packs spgist index pages only to this percentage,
-			RELOPT_KIND_SPGIST
+			RELOPT_KIND_SPGIST,
+			AccessExclusiveLock
 		},
 		SPGIST_DEFAULT_FILLFACTOR, SPGIST_MIN_FILLFACTOR, 100
 	},
@@ -135,7 +144,8 @@ static relopt_int intRelOpts[] =
 		{
 			autovacuum_vacuum_threshold,
 			Minimum number of tuple updates or deletes prior to vacuum,
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
+

Re: [HACKERS] Sharing aggregate states between different aggregate functions

2015-08-04 Thread Heikki Linnakangas


On 08/03/2015 08:53 AM, David Rowley wrote:

Attached is a delta patched which is based
on sharing_aggstate-heikki-2.patch to fix up the out-dated comments and
also a few more test scenarios which test the sharing works with matching
INITCOND and that it does not when they don't match.

What do you think?


I committed this, after some more cleanup of the comments. Thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-08-04 Thread Heikki Linnakangas


On 08/03/2015 11:36 PM, Robert Haas wrote:

On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan p...@heroku.com wrote:

When it's time to drain the heap, in performsort, divide the array into two
arrays, based on the run number of each tuple, and then quicksort the arrays
separately. The first array becomes the in-memory tail of the current tape,
and the second array becomes the in-memory tail of the next tape.

You wouldn't want to actually allocate two arrays and copy SortTuples
around, but keep using the single large array, just logically divided into
two. So the bookkeeping isn't trivial, but seems doable.


You're talking about a new thing here, that happens when it is
necessary to dump everything and do a conventional merging of on-tape
runs. IOW, we cannot fit a significant fraction of overall tuples in
memory, and we need much of the memtuples array for the next run
(usually this ends as a TSS_FINALMERGE). That being the case
(...right?),


I don't think that's what Heikki is talking about.  He can correct me
if I'm wrong, but what I think he's saying is that we should try to
exploit the fact that we've already determined which in-memory tuples
can be part of the current run and which in-memory tuples must become
part of the next run.  Suppose half the tuples in memory can become
part of the current run and the other half must become part of the
next run.  If we throw all of those tuples into a single bucket and
quicksort it, we're losing the benefit of the comparisons we did to
figure out which tuples go in which runs.  Instead, we could quicksort
the current-run tuples and, separately, quick-sort the next-run
tuples.  Ignore the current-run tuples completely until the tape is
empty, and then merge them with any next-run tuples that remain.


Yeah, something like that. To paraphrase, if I'm now understanding it 
correctly, Peter's idea is:


When all the tuples have been fed to tuplesort, and it's time to perform 
the sort, quicksort all the tuples currently in the heap, ignoring the 
run numbers, and turn the resulting array into another tape. That tape 
is special: it's actually stored completely in memory. It is merged with 
the real tapes when tuples are returned from the tuplesort, just like 
regular tapes in TSS_FINALMERGE.


And my idea is:

When all the tuples have been fed to tuplesort, and it's time to perform 
the sort, take all the tuples in the heap belonging to currentRun, 
quicksort them, and make them part of the current tape. They're not just 
pushed to the tape as usual, however, but attached as in-memory tail of 
the current tape. The logical tape abstraction will return them after 
all the tuples already in the tape, as if they were pushed to the tape 
as usual. Then take all the remaining tuples in the heap (if any), 
belonging to next tape, and do the same for them. They become an 
in-memory tail of the next tape.



I'm not sure if there's any reason to believe that would be faster
than your approach.  In general, sorting is O(n lg n) so sorting two
arrays that are each half as large figures to be slightly faster than
sorting one big array.  But the difference may not amount to much.


Yeah, I don't think there's a big performance difference between the two 
approaches. I'm not wedded to either approach. Whichever approach we 
use, my main point was that it would be better to handle this in the 
logical tape abstraction. In my approach, you would have those 
in-memory tails attached to the last two tapes. In Peter's approach, 
you would have one tape that's completely in memory, backed by the 
array. In either case, the tapes would look like normal tapes to most of 
tuplesort.c. There would be no extra TSS state, it would be 
TSS_SORTEDONTAPE or TSS_FINALMERGE as usual.


The logical tape abstraction is currently too low-level for that. It's 
just a tape of bytes, and tuplesort.c determines where a tuple begins 
and ends. That needs to be changed so that the logical tape abstraction 
works tuple-at-a-time instead. For example, instead of 
LogicalTapeRead(N) which reads N bytes, you would have 
LogicalTapeReadTuple(), which reads next tuple, and returns its length 
and the tuple itself. But that would be quite sensible anyway.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Geoff Winkless

On 4 August 2015 at 09:30, Amit Langote langote_amit...@lab.ntt.co.jp
wrote:

 On 2015-08-04 AM 02:57, Peter Geoghegan wrote:
  On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj
 wrote:
  If I create a copy of the table using
 
  CREATE mytab (LIKE brokentab INCLUDING ALL);
  INSERT INTO mytab SELECT * FROM brokentab;
 
  Also, did you drop any columns from the original brokentab table
  where the bug can be reproduced?
 

 This seem to be the case. I could reproduce the reported problem:


Although it seems Amit has defined the problem better than I could, so
this is a bit late to the party (!), yes, the table had been ALTERed after
it was created (looking back through the history, that modification
included at least one DROP COLUMN).

Thanks

Geoff

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Amit Langote

On 2015-08-04 AM 02:57, Peter Geoghegan wrote:
 On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj wrote:
 If I create a copy of the table using

 CREATE mytab (LIKE brokentab INCLUDING ALL);
 INSERT INTO mytab SELECT * FROM brokentab;
 
 Also, did you drop any columns from the original brokentab table
 where the bug can be reproduced?
 

This seem to be the case. I could reproduce the reported problem:

test=# CREATE TABLE upsert_fail_test(a int, b int, c smallint);
CREATE TABLE

test=# ALTER TABLE upsert_fail_test DROP c;
ALTER TABLE

test=# ALTER TABLE upsert_fail_test ADD c smallint;
ALTER TABLE

test=# ALTER TABLE upsert_fail_test ADD PRIMARY KEY (a, b, c);
ALTER TABLE

test=# INSERT INTO  upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT
(a, b, c) DO UPDATE SET c = EXCLUDED.c;
INSERT 0 1

test=# INSERT INTO  upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT
(a, b, c) DO UPDATE SET b = EXCLUDED.b;
INSERT 0 1

test=# INSERT INTO  upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT
(a, b, c) DO UPDATE SET c = EXCLUDED.c;
ERROR:  attribute 3 has wrong type
DETAIL:  Table has type integer, but query expects smallint.

FWIW, I tried to look why that happens. It seems during set_plan_refs(),
fix_join_expr on the splan-onConflictSet targetlist using EXCLUDE
pseudo-rel's targetlist as inner targetlist causes columns c's varattno to
be changed to 3, whereas in the actual tuple descriptor it's 4 (dropped
and added). Eventually, ExecEvalScalarVar() complains when it finds attno
3 is a dropped attribute.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GROUP BY before JOIN

2015-08-04 Thread David Rowley

On 4 August 2015 at 21:56, Ashutosh Bapat ashutosh.ba...@enterprisedb.com
wrote:

 This looks like one example of general problem of finding optimal order
 for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A,
 B), sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like
 join, grouping, aggregation, projection, union, intersection, limit,
 ordering etc. and A, B, C, D are tables. Given such a representation, how
 do we find an optimal order of operation?

 In you example the query which is defined like GROUPBY(JOIN(A, B)) might
 have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
 is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
 optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
 groups i.e. UNION is distributive over GROUPBY.


I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B).
If the group by contained columns from relations A and B, then it wouldn't
be possible to group each one individually. We would have to wait until the
join was performed in that case.

Right now, we use dynamic programming to find the optimal order for join
 tree (make_one_rel). In the language above, we find the optimal order for
 evaluating an expression consisting of JOIN operators by exploiting their
 commutative and associative properties. If we could extend that to SQL
 expression tree representing the query, we will be able to solve many of
 such re-ordering problems using current path infrastructure. The dynamic
 program for this would look something like below

 root = top operator of the query
 child = the operands of the root

 cost of query = minimum of
 1. cost of query without any mutation
 2. cost of root and its child operator commuted, if there is only one
 child for root operator and root operator and child operator are commutable
 3. cost of root distributed over children operands if there are multiple
 operands and root operator is distributive over child operators
 4. ...
 5. ...

 Each operator root will have a RelOptInfo to which we add many paths
 possible for performing that operation using different orders of
 evaluation. Current add_path should eliminate the expensive ones.


So in a query such as:

 select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

Do you mean that there would be 2 RelOptInfos for the sale relation, one
without the GROUP BY, and one with? That would likely solve the issue I
have with join row estimates, but which of the 2 RelOptInfo would all of
the Vars in the parse tree be pointing to?


 This is going to explode the number of path (plan) trees considered by
 planner, so we need to be judicious when to apply this technique or have a
 rigorous pruning technique to discard costly paths early enough. This
 approach might be better than trying to solve each re-ordering problem
 separately and might go well with upper planner pathification being
 discussed at
 http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com
 .


It's certainly going to increase that, but I'm not sure if 'explode' is the
best word as we at least require all of the rels seen in the GROUP BY
expressions and aggregate function parameters to be joined before we can
consider a GroupingPath. The patch
uses bms_is_subset(root-minimumGroupingRels, rel-relids) to determine
that.


 On Tue, Aug 4, 2015 at 1:57 PM, David Rowley david.row...@2ndquadrant.com
  wrote:

 == Overview ==

 As of today we always perform GROUP BY at the final stage, after each
 relation in the query has been joined. This of course works, but it's not
 always the most optimal way of executing the query.

 Consider the following two relations:

 create table product (product_id int primary key, code varchar(32) not
 null);
 create table sale (sale_id int primary key, product_id int not null, qty
 int not null);
 create index on sale (product_id);

 Populated with:

 insert into product select x.x,'ABC' || x.x from generate_series(1,100)
 x(x);
 insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100)
 x(x);

 Here we have a table with 100 products and another with 1 million sales
 records.

 If we wanted to see the total sales for each product we may write a query
 such as:

 select s.product_id,sum(qty) as qty from sale s inner join product p on
 s.product_id = p.product_id group by s.product_id;

 If we look at the explain for this query we can see that the grouping
 happened after the join took place:

 QUERY PLAN
 --
  HashAggregate
Group Key: s.product_id
-  Hash Join
  Hash Cond: (s.product_id = p.product_id)
  -  Seq Scan on sale s
  -  Hash
-  Seq Scan on product p

 The join here would product exactly 1 million rows, then hashaggregate
 will group 1 million rows.

 If it were possible

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Robert Haas

On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
 with the list of individual locks in lwlock.h. Sooner or later someone will
 add an LWLock and forget to update the names-array. That needs to be made
 less error-prone, so that the names are maintained in the same place as the
 #defines. Perhaps something like rmgrlist.h.

This is a good idea, but it's not easy to do in the style of
rmgrlist.h, because I don't believe there's any way to define a macro
that expands to a preprocessor directive.  Attached is a patch that
instead generates the list of macros from a text file, and also
generates an array inside lwlock.c with the lock names that gets used
by the Trace_lwlocks stuff where applicable.

Any objections to this solution to the problem?  If not, I'd like to
go ahead and push this much.  I can't test the Windows changes
locally, though, so it would be helpful if someone could check that
out.

 * Instead of having LWLOCK_INDIVIDUAL_NAMES to name individual locks, how
 about just giving each one of them a separate tranche?

I don't think it's good to split things up to that degree;
standardizing on one name per fixed lwlock and one per tranche
otherwise seems like a good compromise to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 98b978f..d16ab10 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -106,7 +106,7 @@ endif
 endif # aix
 
 # Update the commonly used headers before building the subdirectories
-$(SUBDIRS:%=%-recursive): $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/probes.h
+$(SUBDIRS:%=%-recursive): $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/storage/lwlocknames.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/probes.h
 
 # run this unconditionally to avoid needing to know its dependencies here:
 submake-schemapg:
@@ -135,6 +135,9 @@ postgres.o: $(OBJS)
 parser/gram.h: parser/gram.y
 	$(MAKE) -C parser gram.h
 
+storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt
+	$(MAKE) -C storage/lmgr lwlocknames.h
+
 utils/fmgroids.h: utils/Gen_fmgrtab.pl catalog/Catalog.pm $(top_srcdir)/src/include/catalog/pg_proc.h
 	$(MAKE) -C utils fmgroids.h
 
@@ -165,6 +168,11 @@ $(top_builddir)/src/include/catalog/schemapg.h: catalog/schemapg.h
 	  cd '$(dir $@)'  rm -f $(notdir $@)  \
 	  $(LN_S) $$prereqdir/$(notdir $) .
 
+$(top_builddir)/src/include/storage/lwlocknames.h: storage/lmgr/lwlocknames.h
+	prereqdir=`cd '$(dir $)' /dev/null  pwd`  \
+	  cd '$(dir $@)'  rm -f $(notdir $@)  \
+	  $(LN_S) $$prereqdir/$(notdir $) .
+
 $(top_builddir)/src/include/utils/errcodes.h: utils/errcodes.h
 	prereqdir=`cd '$(dir $)' /dev/null  pwd`  \
 	  cd '$(dir $@)'  rm -f $(notdir $@)  \
@@ -192,6 +200,7 @@ distprep:
 	$(MAKE) -C bootstrap	bootparse.c bootscanner.c
 	$(MAKE) -C catalog	schemapg.h postgres.bki postgres.description postgres.shdescription
 	$(MAKE) -C replication	repl_gram.c repl_scanner.c
+	$(MAKE) -C storage	lwlocknames.h
 	$(MAKE) -C utils	fmgrtab.c fmgroids.h errcodes.h
 	$(MAKE) -C utils/misc	guc-file.c
 	$(MAKE) -C utils/sort	qsort_tuple.c
@@ -307,6 +316,7 @@ maintainer-clean: distclean
 	  catalog/postgres.shdescription \
 	  replication/repl_gram.c \
 	  replication/repl_scanner.c \
+	  storage/lmgr/lwlocknames.h \
 	  utils/fmgroids.h \
 	  utils/fmgrtab.c \
 	  utils/errcodes.h \
diff --git a/src/backend/storage/lmgr/.gitignore b/src/backend/storage/lmgr/.gitignore
new file mode 100644
index 000..9355cae
--- /dev/null
+++ b/src/backend/storage/lmgr/.gitignore
@@ -0,0 +1,2 @@
+/lwlocknames.c
+/lwlocknames.h
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e12a854..e941f2d 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -24,8 +24,17 @@ s_lock_test: s_lock.c $(top_builddir)/src/port/libpgport.a
 	$(CC) $(CPPFLAGS) $(CFLAGS) -DS_LOCK_TEST=1 $(srcdir)/s_lock.c \
 		$(TASPATH) -L $(top_builddir)/src/port -lpgport -o s_lock_test
 
+# see explanation in ../../parser/Makefile
+lwlocknames.c: lwlocknames.h ;
+
+lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl
+	$(PERL) $(srcdir)/generate-lwlocknames.pl $
+
 check: s_lock_test
 	./s_lock_test
 
-clean distclean maintainer-clean:
+clean distclean:
 	rm -f s_lock_test
+
+maintainer-clean: clean
+	rm -f lwlocknames.h
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
new

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Andres Freund

On 2015-08-04 11:29:39 -0400, Robert Haas wrote:
 On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 1. I got rid of all of the typecasts.  You're supposed to treat
 pg_atomic_u32 as a magic data type that is only manipulated via the
 primitives provided, not just cast back and forth between that and
 u32.

Absolutely. Otherwise no fallbacks can work.

 2. I got rid of the memory barriers.  System calls are full barriers,
 and so are compare-and-exchange operations.  Between those two facts,
 we should be fine without these.

Actually by far not all system calls are full barriers?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-08-04 Thread Robert Haas

On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 I think the real solution to this problem is to avoid use of
 GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
 as I can see, that should completely close the hole.  This requires
 patching IndexBuildHeapRangeScan() to allow for that.

 Actually I think there's another problem: if a transaction starts and
 inserts a tuple into the page range, then goes to sleep, and then
 another session does the summarization of the page range, session 1 is
 seen as in progress by session 2 (so the scan does not see the new
 tuple), but the placeholder tuple was not modified either because it was
 inserted later than the snapshot.  So the update is lost.

 I think the only way to close this hole is to have summarize_range()
 sleep until all open snapshots are gone after inserting the placeholder
 tuple and before acquiring the snapshot, similarly to how CREATE INDEX
 CONCURRENTLY does it.

That's gonna be really slow, though, right?  Even if you rework things
so that vacuum inserts all the placeholder tuples first, then waits,
then does all the summarization, that could easily turn a vacuum that
would have finished in a second into one that instead takes multiple
hours.  During that time an AV worker is pinned down, and all sorts of
badness can ensue.

Maybe I'm misunderstanding, but this whole thing seems like a pretty
serious problem for BRIN.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Andres Freund

On 2015-08-04 11:43:45 -0400, Robert Haas wrote:
 On Tue, Aug 4, 2015 at 11:33 AM, Andres Freund and...@anarazel.de wrote:
  Actually by far not all system calls are full barriers?
 
 How do we know which ones are and which ones are not?

Good question. Reading the source code of all implementations I suppose
:(

E.g. gettimeofday()/clock_gettime(), getpid() on linux aren't
barriers.

 I can't believe PGSemaphoreUnlock isn't a barrier.  That would be cruel.

Yea, I think that's a pretty safe bet. I mean even if you'd implement it
locklessly in the kernel, that'd still employ significant enough
barriers/atomic ops itself.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Robert Haas

On Sun, Aug 2, 2015 at 8:20 PM, Stephen Frost sfr...@snowman.net wrote:
 +1.

 I was doing testing the other day and ran into the pg_dump doesn't
 support shell types issue and it was annoyingly confusing.

Is anyone working on this?  Should it be added to the open items list?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs

On 4 August 2015 at 15:18, Andres Freund and...@anarazel.de wrote:

 On 2015-08-04 14:59:11 +0100, Simon Riggs wrote:
  On 4 August 2015 at 14:55, Heikki Linnakangas hlinn...@iki.fi wrote:
 
   On 08/04/2015 04:35 PM, Simon Riggs wrote:
  
   This and the OP seem like 9.5 open items to me.
  
  
   Why? This is nothing new in 9.5.
 
 
  gin_pending_list_limit is new in 9.5
 
  We're in Alpha, so if something has been added and isn't very usable, we
  should change it while we can.

 The only thing that variable does is change what the pending size limit
 is determined by. Previously it was work_mem, now it's
 gin_pending_list_limit. Imo that has pretty much nothing to do with not
 registering pages as free.


We've made a change to the way GIN fast update works and that needs to be
an effective change, not one that immediately triggers a secondary
annoyance for the user. I've asked some questions; they may turn into
actions, or not, but they are open items.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] Autonomous Transaction is back

2015-08-04 Thread Merlin Moncure

On Tue, Aug 4, 2015 at 4:12 AM, Rajeev rastogi
rajeev.rast...@huawei.com wrote:
 On 03 August 2015 18:40, Merlin Moncure [mailto:mmonc...@gmail.com] Wrote:
On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi
rajeev.rast...@huawei.com wrote:
 On 31 July 2015 23:10, Robert Haas Wrote:
I think we're going entirely down the wrong path here.  Why is it ever
useful for a backend's lock requests to conflict with themselves, even
with autonomous transactions?  That seems like an artifact of somebody
else's implementation that we should be happy we don't need to copy.

 IMHO, since most of the locking are managed at transaction level not
backend level and we consider main  autonomous transaction to be
independent transaction, then practically they may conflict right.
 It is also right as you said that there is no as such useful use-cases
where autonomous transaction conflicts with main (parent) transaction.
But we cannot take it for granted as user might make a mistake. So at-
least we should have some mechanism to handle this rare case, for which
as of now I think throwing error from autonomous transaction as one of
the solution. Once error thrown from autonomous transaction, main
transaction may continue as it is (or abort main transaction also??).

hm.  OK, what's the behavior of:

BEGIN
  UPDATE foo SET x = x + 1 WHERE foo_id = 1;

  BEGIN WITH AUTONOMOUS TRANSACTION
UPDATE foo SET x = x + 1 WHERE foo_id = 1;
  END;

  RAISE EXCEPTION ...;
EXCEPTION ...

END;

 It should throw an error (or something equivalent) as the second update will 
 wait for record lock to get released, which in this case will not happen till 
 second update finishes. So catch 22.

Yeah. Point being, from my point of view autonomous transactions have
to conflict with the master transaction (or any transaction really).
I agree the right course of action is to error out immediately...what
else could you do?  There isn't even a deadlock in the classic sense
and allowing control to continue would result in indeterminate
behavior FWICT.

Also,
*) What do the other candidate implementations do?  IMO, 
compatibilityshould be the underlying design principle.

 Oracle throws error in such case. But we can decide on what behavior we want 
 to keep.

gotcha.  makes sense.

*) What will the SQL only feature look like?

 Similar to PL as mentioned in your example, we can provide the SQL only 
 feature also.

*) Is the SPI interface going to be extended to expose AT?

 I don’t think at this point that there is any need of exposing SPI interface 
 for this.

Ok, how do AT work in a non-plpgsql (SQL only) scenario?  Are you
going to similarly extend BEGIN TRANSACTION?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas

On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I agree and modified the patch to use 32-bit atomics based on idea
 suggested by Robert and didn't modify lwlock.c.

 While looking at patch, I found that the way it was initialising the list
 to be empty was wrong, it was using pgprocno as 0 to initialise the
 list, as 0 is a valid pgprocno.  I think we should use a number greater
 that PROCARRAY_MAXPROC (maximum number of procs in proc
 array).

 Apart from above fix, I have modified src/backend/access/transam/README
 to include the information about the improvement this patch brings to
 reduce ProcArrayLock contention.

I spent some time looking at this patch today and found that a good
deal of cleanup work seemed to be needed.  Attached is a cleaned-up
version which makes a number of changes:

1. I got rid of all of the typecasts.  You're supposed to treat
pg_atomic_u32 as a magic data type that is only manipulated via the
primitives provided, not just cast back and forth between that and
u32.

2. I got rid of the memory barriers.  System calls are full barriers,
and so are compare-and-exchange operations.  Between those two facts,
we should be fine without these.

3. I removed the clearXid field you added to PGPROC.  Since that's
just a sentinel, I used nextClearXidElem for that purpose. There
doesn't seem to be a need for two fields.

4. I factored out the actual XID-clearing logic into a new function
ProcArrayEndTransactionInternal instead of repeating it twice. On the
flip side, I merged PushProcAndWaitForXidClear with
PopProcsAndClearXids and renamed the result to ProcArrayGroupClearXid,
since there seemed to be no need to separate them.

5. I renamed PROC_NOT_IN_PGPROCARRAY to INVALID_PGPROCNO, which I
think is more consistent with what we do elsewhere, and made it a
compile-time constant instead of a value that must be computed every
time it's used.

6. I overhauled the comments and README updates.

I'm not entirely happy with the name nextClearXidElem but apart from
that I'm fairly happy with this version.  We should probably test it
to make sure I haven't broken anything; on a quick 3-minute pgbench
test on cthulhu (128-way EDB server) with 128 clients I got 2778 tps
with master and 4330 tps with this version of the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index bc68b47..f6db580 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -252,6 +252,9 @@ implementation of this is that GetSnapshotData takes the ProcArrayLock in
 shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
 while clearing MyPgXact-xid at transaction end (either commit or abort).
+(To reduce context switching, when multiple transactions commit nearly
+simultaneously, we have one backend take ProcArrayLock and clear the XIDs
+of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
 latestCompletedXid variable.  This allows GetSnapshotData to use
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4f3c5c9..14d1219 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -167,6 +167,9 @@ static int KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 static TransactionId KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 static void KnownAssignedXidsReset(void);
+static void ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
+TransactionId latestXid);
+static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -370,7 +373,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	elog(LOG, failed to find proc %p in ProcArray, proc);
 }
 
-
 /*
  * ProcArrayEndTransaction -- mark a transaction as no longer running
  *
@@ -399,26 +401,18 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		Assert(TransactionIdIsValid(allPgXact[proc-pgprocno].xid));
 
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-		pgxact-xid = InvalidTransactionId;
-		proc-lxid = InvalidLocalTransactionId;
-		pgxact-xmin = InvalidTransactionId;
-		/* must be cleared with xid/xmin: */
-		pgxact-vacuumFlags = ~PROC_VACUUM_STATE_MASK;
-		pgxact-delayChkpt = false;		/* be sure this is cleared in abort */
-		proc-recoveryConflictPending = false;
-
-		/* Clear the subtransaction-XID cache too while holding the lock */
-		pgxact-nxids = 0;
-		pgxact-overflowed = false;
-
-		/* Also advance global latestCompletedXid while holding the lock */
-		if (TransactionIdPrecedes(ShmemVariableCache-latestCompletedXid,
-  latestXid))
-			ShmemVariableCache-latestCompletedXid =

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Tom Lane

Simon Riggs si...@2ndquadrant.com writes:
 On 4 August 2015 at 15:18, Andres Freund and...@anarazel.de wrote:
 The only thing that variable does is change what the pending size limit
 is determined by. Previously it was work_mem, now it's
 gin_pending_list_limit. Imo that has pretty much nothing to do with not
 registering pages as free.

 We've made a change to the way GIN fast update works and that needs to be
 an effective change, not one that immediately triggers a secondary
 annoyance for the user.

Said annoyance has been there since day one, no?

 I've asked some questions; they may turn into
 actions, or not, but they are open items.

If it's not a new problem introduced by 9.5, I don't think it's
appropriate to insist that 9.5 must solve it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 11:33 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-08-04 11:29:39 -0400, Robert Haas wrote:
 On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 1. I got rid of all of the typecasts.  You're supposed to treat
 pg_atomic_u32 as a magic data type that is only manipulated via the
 primitives provided, not just cast back and forth between that and
 u32.

 Absolutely. Otherwise no fallbacks can work.

 2. I got rid of the memory barriers.  System calls are full barriers,
 and so are compare-and-exchange operations.  Between those two facts,
 we should be fine without these.

 Actually by far not all system calls are full barriers?

How do we know which ones are and which ones are not?

I can't believe PGSemaphoreUnlock isn't a barrier.  That would be cruel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Amit Kapila

On Tue, Aug 4, 2015 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  I agree and modified the patch to use 32-bit atomics based on idea
  suggested by Robert and didn't modify lwlock.c.
 
  While looking at patch, I found that the way it was initialising the
list
  to be empty was wrong, it was using pgprocno as 0 to initialise the
  list, as 0 is a valid pgprocno.  I think we should use a number greater
  that PROCARRAY_MAXPROC (maximum number of procs in proc
  array).
 
  Apart from above fix, I have modified src/backend/access/transam/README
  to include the information about the improvement this patch brings to
  reduce ProcArrayLock contention.

 I spent some time looking at this patch today and found that a good
 deal of cleanup work seemed to be needed.  Attached is a cleaned-up
 version which makes a number of changes:

 1. I got rid of all of the typecasts.  You're supposed to treat
 pg_atomic_u32 as a magic data type that is only manipulated via the
 primitives provided, not just cast back and forth between that and
 u32.

 2. I got rid of the memory barriers.  System calls are full barriers,
 and so are compare-and-exchange operations.  Between those two facts,
 we should be fine without these.


I have kept barriers based on comments on top of atomic read, refer
below code:

 * No barrier semantics.
 */
STATIC_IF_INLINE uint32
pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)

Note - The function header comments on pg_atomic_read_u32 and
corresponding write call seems to be reversed, but that is something
separate.


 I'm not entirely happy with the name nextClearXidElem but apart from
 that I'm fairly happy with this version.  We should probably test it
 to make sure I haven't broken anything;

Okay, will look into it tomorrow.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Incorrect comment about abbreviated keys

2015-08-04 Thread Robert Haas

On Sun, Aug 2, 2015 at 5:11 AM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch fixes this issue. This was missed by
 78efd5c1edb59017f06ef96773e64e6539bfbc86

Committed and back-patched to 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Andres Freund

On 2015-08-04 21:20:20 +0530, Amit Kapila wrote:
 I have kept barriers based on comments on top of atomic read, refer
 below code:


  * No barrier semantics.
  */
 STATIC_IF_INLINE uint32
 pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
 
 Note - The function header comments on pg_atomic_read_u32 and
 corresponding write call seems to be reversed, but that is something
 separate.

Well, the question is whether you *need* barrier semantics in that
place. If you just have a retry loop around a compare/exchange there's
no point in having one, it'll just cause needless slowdown due to
another bus-lock.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier

On Wed, Aug 5, 2015 at 3:05 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Aug 4, 2015 at 2:41 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Yep, true as things stand now. But this would get broken if we add a
 new lock level between ShareRowExclusiveLock and AccessExclusiveLock
 that does not respect the current monotone hierarchy between those.

 But we're probably not going to do that, so it doesn't matter; and if
 we do do it, we can worry about it then.  I don't think this is worth
 getting concerned about now.

OK. Then let me suggest the attached Assert safeguard then. It ensures
that all the locks used follow a monotony hierarchy per definition of
what is on the conflict table. That looks like a cheap insurance...

In any case, this means as well that we should move on with the
current logic presented by Fabrizio on the other thread.
-- 
Michael
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 970abd4..3aff387 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3057,8 +3057,13 @@ AlterTableGetLockLevel(List *cmds)
 		}
 
 		/*
-		 * Take the greatest lockmode from any subcommand
+		 * Take the greatest lockmode from any subcommand following a monotone
+		 * hierarchy.
 		 */
+		Assert(cmd_lockmode == ShareUpdateExclusiveLock ||
+			   cmd_lockmode == ShareRowExclusiveLock ||
+			   cmd_lockmode == ExclusiveLock ||
+			   cmd_lockmode == AccessExclusiveLock);
 		if (cmd_lockmode  lockmode)
 			lockmode = cmd_lockmode;
 	}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-08-04 Thread David Rowley

On 29 July 2015 at 03:25, Andres Freund and...@anarazel.de wrote:

 On 2015-07-29 03:10:41 +1200, David Rowley wrote:
  timestamp_out() = 2015-07-29 02:24:33.34 in 3.506000
  timestamp_out_old() = 2015-07-29 02:24:33.034 in 64.518000
  timestamp_out_af() = 2015-07-29 02:24:33.034 in 2.981000
 
  timestamp_out_old is master's version, the timestamp_out_af() is yours,
 and
  timestamp_out() is my one. times are in seconds to perform 100 million
  calls.

 That looks good.

  So it appears your version is a bit faster than mine, but we're both
 about
  20 times faster than the current one.
  Also mine needs fixed up as the fractional part is not padded the same as
  yours, but I doubt that'll affect the performance by much.

 Worthwhile to finish that bit and try ;)

  My view: It's probably not worth going quite as far as you've gone for a
  handful of nanoseconds per call, but perhaps something along the lines of
  mine can be fixed up.

 Yes, I agreee that your's is probably going to be fast enough.

  Have you thought about what to do when HAVE_INT64_TIMESTAMP is not
 defined?

 I don't think it's actually important. The only difference vs float
 timestamps is that in the latter case we set fsecs to zero BC.

 Unless we want to slow down the common case it seems not unlikely that
 we're going to end up with a separate slow path anyway. E.g. neither
 your version nor mine handles 5 digit years (which is why I fell back to
 the slow path in that case in my patch).


It occurred to me that handling the 5 digit year is quite a simple change
to my code:

static char *
pg_uint2str_padding(char *str, unsigned int value, unsigned int padding)
{
char   *start = str;
char   *end = str[padding];
unsigned int num = value;
//Assert(padding  0);
*end = '\0';
while (padding--)
{
str[padding] = num % 10 + '0';
num /= 10;
}
/*
* if value was too big for the specified padding then rebuild the whole
* number again without padding. Conveniently pg_uint2str() does exactly
* this.
*/
if (num  0)
return pg_uint2str(str, value);

return end;
}

We simply just need to check if there was any 'num' left after consuming
the given space. If there's any left then just use pg_uint2str().
This keeps things very fast for the likely most common case where the year
is 4 digits long.

I've not thought about negative years. The whole function should perhaps
take signed ints instead of unsigned.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-08-04 Thread Amit Langote

On 2015-08-05 AM 06:11, Robert Haas wrote:
 On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
 langote_amit...@lab.ntt.co.jp wrote:
 On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:
 For postgres_fdw it's a boolean server-level option 'twophase_compliant'
 (suggestions for name welcome).


 How about just 'twophase'?
 
 How about two_phase_commit?
 

Much cleaner, +1

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Michael Paquier

On Wed, Aug 5, 2015 at 2:15 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Andres Freund wrote:
 On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:

  As long as this only applies on master, this may be fine... We could
  basically pass a LOCKMASK to the multiple layers of tablecmds.c
  instead of LOCKMODE to track all the locks that need to be taken, and
  all the relations open during operations.

 This sounds far too complicated to me. Just LockRelationOid() the
 relation with the appropriate level everytime you pass through the
 function?

 That opens up for lock escalation and deadlocks, doesn't it?  You are
 probably thinking that it's okay to ignore those but I don't necessarily
 agree with that.

Yes, I think so as long as we would try to take multiple locks... And
we go back to the lock hierarchy then, because there is no way to
define in some cases which lock is strictly stronger than the other.
Honestly I think that we should go with the version Fabrizio doing the
lock comparison, with a assert safeguard in AlterTableGetLockLevel.
The logic would get broken if ALTER TABLE begins to use a lock level
that cannot be ordered according to the others, but that's not the
case now.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier

On Wed, Aug 5, 2015 at 11:47 AM, Noah Misch n...@leadboat.com wrote:
 On Tue, Aug 04, 2015 at 07:35:43AM +0100, Simon Riggs wrote:
 On 4 August 2015 at 05:56, Michael Paquier michael.paqu...@gmail.com wrote:
  The thing is that, as mentioned by Alvaro and Andres on this thread,
  we have no guarantee that the different relation locks compared have a
  monotone hierarchy and we may finish by taking a lock that does not
  behave as you would like to. We are now lucky enough that ALTER TABLE
  only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and
  AccessExclusiveLock that actually have a hierarchy so this is not a
  problem yet.
  However it may become a problem if we add in the future more lock
  modes and that are used by ALTER TABLE.
 

 Please provide the link to the discussion of this. I don't see a problem
 here right now that can't be solved by saying

 Assert(locklevel==ShareUpdateExclusiveLock ||
 locklevelShareRowExclusiveLock);

 Agreed; that addresses the foreseeable future of this threat.

Some sub-commands are using ShareRowExclusiveLock... So this one is better :)
Assert(locklevel==ShareUpdateExclusiveLock || locklevel =
ShareRowExclusiveLock);
Or we simply list all the locks allowed individually... But that's a
minor point.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sharing aggregate states between different aggregate functions

2015-08-04 Thread David Rowley

On 5 August 2015 at 03:03, Heikki Linnakangas hlinn...@iki.fi wrote:

 On 08/03/2015 08:53 AM, David Rowley wrote:

 Attached is a delta patched which is based
 on sharing_aggstate-heikki-2.patch to fix up the out-dated comments and
 also a few more test scenarios which test the sharing works with matching
 INITCOND and that it does not when they don't match.

 What do you think?


 I committed this, after some more cleanup of the comments. Thanks!


Great! Thanks for doing the cleanups and committing it.

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-08-04 Thread Robert Haas

On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:
 On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas robertmh...@gmail.com wrote:

 OK, sure.  But let's make sure postgres_fdw gets a server-level option
 to control this.


 For postgres_fdw it's a boolean server-level option 'twophase_compliant'
 (suggestions for name welcome).


 How about just 'twophase'?

How about two_phase_commit?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-08-04 Thread Alvaro Herrera

Alvaro Herrera wrote:

 Thankfully I found
 another way to solve it, which is to forgo usage of MVCC here and
 instead use SnapshotAny.  There's already a mode in
 IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
 to do what we need.  I'm going to propose a patch along those lines
 shortly.

Here's the promised patch.  Includes a new isolation test (which fails
with the old code in two ways: first when using VACUUM it causes the
error message Kevin didn't like to be raised; second when using
brin_summarize_new_values it causes the inserted tuple to not be part of
the summary, which is wrong).  This test leaves the pageinspect
extension installed in the isolationtest database, but that seems fine
to me.  (No other isolation test installs an extension.)

The error message Kevin complained about is gone, as is some related
commentary.  This is replaced by tweaks in IndexBuildHeapRangeScan that
know to do a SnapshotAny scan without being noisy about in progress
insertions/deletions.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 360b26e..de32cca 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -951,9 +951,13 @@ summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
 	 * Execute the partial heap scan covering the heap blocks in the specified
 	 * page range, summarizing the heap tuples in it.  This scan stops just
 	 * short of brinbuildCallback creating the new index entry.
+	 *
+	 * Note that it is critical we use the any visible mode of
+	 * IndexBuildHeapRangeScan here: otherwise, we would miss tuples inserted
+	 * by transactions that are still in progress, among other corner cases.
 	 */
 	state-bs_currRangeStart = heapBlk;
-	IndexBuildHeapRangeScan(heapRel, state-bs_irel, indexInfo, false,
+	IndexBuildHeapRangeScan(heapRel, state-bs_irel, indexInfo, false, true,
 			heapBlk, state-bs_pagesPerRange,
 			brinbuildCallback, (void *) state);
 
@@ -1058,36 +1062,6 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 state = initialize_brin_buildstate(index, revmap,
    pagesPerRange);
 indexInfo = BuildIndexInfo(index);
-
-/*
- * We only have ShareUpdateExclusiveLock on the table, and
- * therefore other sessions may insert tuples into the range
- * we're going to scan.  This is okay, because we take
- * additional precautions to avoid losing the additional
- * tuples; see comments in summarize_range.  Set the
- * concurrent flag, which causes IndexBuildHeapRangeScan to
- * use a snapshot other than SnapshotAny, and silences
- * warnings emitted there.
- */
-indexInfo-ii_Concurrent = true;
-
-/*
- * If using transaction-snapshot mode, it would be possible
- * for another transaction to insert a tuple that's not
- * visible to our snapshot if we have already acquired one,
- * when in snapshot-isolation mode; therefore, disallow this
- * from running in such a transaction unless a snapshot hasn't
- * been acquired yet.
- *
- * This code is called by VACUUM and
- * brin_summarize_new_values. Have the error message mention
- * the latter because VACUUM cannot run in a transaction and
- * thus cannot cause this issue.
- */
-if (IsolationUsesXactSnapshot()  FirstSnapshotSet)
-	ereport(ERROR,
-			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-			 errmsg(brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot)));
 			}
 			summarize_range(indexInfo, state, heapRel, heapBlk);
 
@@ -,7 +1085,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 	/* free resources */
 	brinRevmapTerminate(revmap);
 	if (state)
+	{
 		terminate_brin_buildstate(state);
+		pfree(indexInfo);
+	}
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69f35c9..e59b163 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2161,6 +2161,7 @@ IndexBuildHeapScan(Relation heapRelation,
 {
 	return IndexBuildHeapRangeScan(heapRelation, indexRelation,
    indexInfo, allow_sync,
+   false,
    0, InvalidBlockNumber,
    callback, callback_state);
 }
@@ -2170,12 +2171,17 @@ IndexBuildHeapScan(Relation heapRelation,
  * number of blocks are scanned.  Scan to end-of-rel can be signalled by
  * passing InvalidBlockNumber as numblocks.  Note that restricting the range
  * to scan cannot be done when requesting syncscan.
+ *
+ * When anyvisible mode is requested, all tuples visible to any transaction
+ * are considered, including those inserted or deleted by transactions that are
+ * still in progress.
  */
 double
 IndexBuildHeapRangeScan(Relation heapRelation,
 		Relation indexRelation,

Re: [HACKERS] [DOCS] max_worker_processes on the standby

2015-08-04 Thread Alvaro Herrera

Adding CC to hackers, since this is clearly not just a docs issue.  Also
CCing Petr and Craig since they are the ones that know how this is used
in BDR.

Robert Haas wrote:
 On Tue, Aug 4, 2015 at 12:41 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:

  The alternative is to turn the feature on automatically if it sees that
  the master also has it on, i.e. the value would not be what the config
  file says it is.  Doing this would be a bit surprising IMO, but given
  the behavior above maybe it's better than the current behavior.
 
 I think it's totally reasonable for the standby to follow the master's
 behavior rather than the config file.  That should be documented, but
 otherwise, no problem.  If it were technologically possible for the
 standby to follow the config file rather than the master in all cases,
 that would be fine, too.  But the current behavior is somewhere in the
 middle, and that doesn't seem like a good plan.

So I discussed this with Petr.  He points out that if we make the
standby follows the master, then the problem would be the misbehavior
that results once the standby is promoted: at that point the standby
would no longer follow the master and would start with the feature
turned off, which could be disastrous (depending on what are you using
the commit timestamps for).

To solve that problem, you could suggest that if we see the setting
turned on in pg_control then we should follow that instead of the config
file; but then the problem is that there's no way to turn the feature
off.  And things are real crazy by then.

Given this, we're leaning towards the idea that the standby should not
try to follow the master at all.  Instead, an extension that wants to
use this stuff can check the value for itself, and raise a fatal error
if it's not already turned on the config file.  That way, a lot of the
strange corner cases disappear.


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2015-08-04 Thread Tom Lane

Piotr Stefaniak postg...@piotr-stefaniak.me writes:
 On 08/03/2015 09:18 PM, Tom Lane wrote:
 ... but I can't reproduce it on HEAD with either of these queries.
 Not clear why you're getting different results.

 I'm terribly sorry, but I didn't notice that postgresql.conf was modified...
 Set join_collapse_limit = 32 and you should see the error.

Ah ... now I get that error on the smaller query, but the larger one
(that you put in an attachment) still doesn't show any problem.
Do you have any other nondefault settings?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2015-08-04 Thread Ewan Higgs

Hi there,I've been following the sqlsmith work and wanted to jump in and try it 
out. I took Peter's idea and tried building postgres with the flags suggested 
but it was hard to get anything working. 

I'm on commit 85e5e222b1dd02f135a8c3bf387d0d6d88e669bd (Tue Aug 4 14:55:32 2015 
-0400)
Configure arguments:./configure --prefix=$HOME/pkg CC=clang CFLAGS='-O1 -g 
-fsanitize=address -fno-omit-frame-pointer -fno-optimize-sibling-calls' 
--enable-cassert

I had to make a simple leak suppression file:
$ cat leak.supp
leak:save_ps_display_args
leak:__GI___strdup
$ export LSAN_OPTIONS=suppressions=leak.supp
And then I could run postgres. After 50,000 queries, I'm left with the 
following report:

queries: 50514
AST stats (avg): height = 7.29877 nodes = 37.8156
296    ERROR:  canceling statement due to statement timeout
166    ERROR:  invalid regular expression: quantifier operand invalid
26     ERROR:  could not determine which collation to use for string comparison
23     ERROR:  cannot compare arrays of different element types
12     ERROR:  invalid regular expression: brackets [] not balanced
5      ERROR:  cache lookup failed for index 2619
2      ERROR:  invalid regular expression: parentheses () not balanced
error rate: 0.0104921

AddressSanitizer didn't fire except for the suppressed leaks. The suppressed 
leaks were only hit at the beginning:
-
Suppressions used:
  count  bytes template
  1    520 save_ps_display_args
  1 10 __GI___strdup
-


sqlsmith is a cool little piece of kit and I see a lot of room for on going 
work (performance bumps for more queries per second; more db back ends; 
different fuzzers). 

Yours,Ewan Higgs

 

 From: Peter Geoghegan p...@heroku.com
 To: Andreas Seltenreich seltenre...@gmx.de 
Cc: Pg Hackers pgsql-hackers@postgresql.org 
 Sent: Sunday, 2 August 2015, 10:39
 Subject: Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
   
On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich seltenre...@gmx.de wrote:
 sqlsmith triggered the following assertion in master (c188204).

Thanks for writing sqlsmith. It seems like a great tool.

I wonder, are you just running the tool with assertions enabled when
PostgreSQL is built? If so, it might make sense to make various
problems more readily detected. As you may know, Clang has a pretty
decent option called AddressSanitizer that can detect memory errors as
they occur with an overhead that is not excessive. One might use the
following configure arguments when building PostgreSQL to use
AddressSanitizer:

./configure CC=clang CFLAGS='-O1 -g -fsanitize=address
-fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert

Of course, it remains to be seen if this pays for itself. Apparently
the tool has about a 2x overhead [1]. I'm really not sure that you'll
find any more bugs this way, but it's certainly possible that you'll
find a lot more. Given your success in finding bugs without using
AddressSanitizer, introducing it may be premature.

[1] http://clang.llvm.org/docs/AddressSanitizer.html#introduction
-- 
Peter Geoghegan




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Peter Geoghegan

On Tue, Aug 4, 2015 at 2:29 AM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 Perhaps, it may have to do with how EXCLUDED pseudo-rel's targetlist is
 manipulated through parse-plan stage?

I think so, yes.

I'll look into writing a fix for this later in the week.

Thanks for the report, Geoff, and thanks for the analysis Amit.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier

On Wed, Aug 5, 2015 at 2:23 AM, Alvaro Herrera wrote:
 Michael Paquier wrote:
 On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera wrote:
 Now, let's take for example this case with locks A, B, C, D:
 - Lock A conflicts with ACD
 - B with BCD
 - C with itself
 - D with itself
 What would you choose as a result of add(C,D)? A or B? Or the super
 lock conflicting with all of them?

Actually the answer is the sum of all the potential candidates. This
converges to AccessExclusiveLock.

 This appears to me an hypothetical case that I don't think occurs in our
 conflicts table, so I wouldn't worry about it.

No it doesn't. I am using a huge hypothetical if here :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 11:50 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I have kept barriers based on comments on top of atomic read, refer
 below code:

  * No barrier semantics.
  */
 STATIC_IF_INLINE uint32
 pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)

 Note - The function header comments on pg_atomic_read_u32 and
 corresponding write call seems to be reversed, but that is something
 separate.

That doesn't matter, because the compare-and-exchange *is* a barrier.
Putting a barrier between a store and an operation that is already a
barrier doesn't do anything useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Sun, Aug 2, 2015 at 8:20 PM, Stephen Frost sfr...@snowman.net wrote:
 +1.
 
 I was doing testing the other day and ran into the pg_dump doesn't
 support shell types issue and it was annoyingly confusing.

 Is anyone working on this?  Should it be added to the open items list?

I plan to work on it as soon as I'm done fixing the planner issue Andreas
found (which I'm almost done with).  Not sure whether we should consider
it a back-patchable bug fix or something to do only in HEAD, though ---
comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-08-04 13:52:54 -0400, Tom Lane wrote:
 Not sure whether we should consider it a back-patchable bug fix or
 something to do only in HEAD, though --- comments?

 Tentatively I'd say it's a bug and should be back-patched.

 Agreed.  If investigation turns up reasons to worry about
 back-patching it, I'd possibly back-track on that position, but I
 think we should start with the notion that it is back-patchable and
 retreat from that position only at need.

OK.  Certainly we can fix 9.5 the same way as HEAD; the pg_dump code
hasn't diverged much yet.  I'll plan to push it as far back as convenient,
but I won't expend any great effort on making the older branches do it if
they turn out to be too different.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Alvaro Herrera

Andres Freund wrote:
 On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:

  As long as this only applies on master, this may be fine... We could
  basically pass a LOCKMASK to the multiple layers of tablecmds.c
  instead of LOCKMODE to track all the locks that need to be taken, and
  all the relations open during operations.
 
 This sounds far too complicated to me. Just LockRelationOid() the
 relation with the appropriate level everytime you pass through the
 function?

That opens up for lock escalation and deadlocks, doesn't it?  You are
probably thinking that it's okay to ignore those but I don't necessarily
agree with that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Alvaro Herrera

Michael Paquier wrote:
 On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:

  Maybe the solution to this is to add the concept of addition of two
  lock modes, where the result is another lock mode that conflicts with
  any lock that would conflict with either of the two operand lock modes.

 That's commutative, as this is basically looking at the conflict table
 to get the union of the bits to indicate what are all the locks
 conflicting with lock A and lock B, and then we select the lock on the
 table that includes the whole union, with a minimum number of them.

Yes.

 Now, let's take for example this case with locks A, B, C, D:
 - Lock A conflicts with ACD
 - B with BCD
 - C with itself
 - D with itself
 What would you choose as a result of add(C,D)? A or B? Or the super
 lock conflicting with all of them?

This appears to me an hypothetical case that I don't think occurs in our
conflicts table, so I wouldn't worry about it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 11:29 AM, Robert Haas robertmh...@gmail.com wrote:
 4. I factored out the actual XID-clearing logic into a new function
 ProcArrayEndTransactionInternal instead of repeating it twice. On the
 flip side, I merged PushProcAndWaitForXidClear with
 PopProcsAndClearXids and renamed the result to ProcArrayGroupClearXid,
 since there seemed to be no need to separate them.

Thinking about this a bit more, it's probably worth sticking an
inline designation on ProcArrayEndTransactionInternal.  Keeping the
time for which we hold ProcArrayLock in exclusive mode down to the
absolute minimum possible number of instructions seems like a good
plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tab completion for CREATE SEQUENCE

2015-08-04 Thread Robert Haas

On Mon, Aug 3, 2015 at 2:17 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Jul 7, 2015 at 9:14 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-19 06:41:19 +, Brendan Jurd wrote:
 I'm marking this Waiting on Author.  Once the problems have been
 corrected, it should be ready for a committer.

 Vik, are you going to update the patch?

 Seeing no activity on this thread and as it would be a waste not to do
 it, I completed the patch as attached. The following things are done:
 - Fixed code indentation
 - Removal of RESTART, SET SCHEMA, OWNER TO, and RENAME TO in
 CREATE SEQUENCE
 - Added START WITH.
 - Added TEMP/TEMPORARY in the set of keywords tracked.
 I am switching at the same time this patch as Ready for committer.

You didn't remove RESTART, so I did that.  And this needed some more
parentheses to make my compiler happy.

Committed with those changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-08-04 Thread Andres Freund

On 2015-08-04 13:16:52 -0400, Robert Haas wrote:
 On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao masao.fu...@gmail.com wrote:
  track_commit_timestamp tracks COMMIT PREPARED as expected in standby server,
  but not in master server. Is this intentional? It should track COMMIT 
  PREPARED
  even in master? Otherwise, we cannot use commit_timestamp feature to check
  the replication lag properly while we use 2PC.
 
 That sounds like it must be a bug.  I think you should add it to the
 open items list.

Yea, completely agreed. It's probably because twophase.c duplicates a
good bit of xact.c. How about we also add a warning to the respective
places that one should always check the others?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Andres Freund

On 2015-08-04 13:52:54 -0400, Tom Lane wrote:
 Not sure whether we should consider it a back-patchable bug fix or
 something to do only in HEAD, though --- comments?

Tentatively I'd say it's a bug and should be back-patched.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Alvaro Herrera

Jeff Janes wrote:

 The attached proof of concept patch greatly improves the bloat for both the
 insert and the update cases.  You need to turn on both features: adding the
 pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).  The
 first of those two things could probably be adopted for real, but the
 second probably is not acceptable.  What is the right way to do this?

I think that's the right way, only that I wouldn't call FSMVacuum()
inside the loop but only once you're out of it.

FWIW I had to add a FSMVacuum call somewhere in the BRIN code also,
which I didn't like at all.  If there's a way to optimize this bubbling
up of some individual page all the way to the root level, that would
probably be a lot better than what it's currently doing.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 1:18 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-08-04 13:16:52 -0400, Robert Haas wrote:
 On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao masao.fu...@gmail.com wrote:
  track_commit_timestamp tracks COMMIT PREPARED as expected in standby 
  server,
  but not in master server. Is this intentional? It should track COMMIT 
  PREPARED
  even in master? Otherwise, we cannot use commit_timestamp feature to check
  the replication lag properly while we use 2PC.

 That sounds like it must be a bug.  I think you should add it to the
 open items list.

 Yea, completely agreed. It's probably because twophase.c duplicates a
 good bit of xact.c. How about we also add a warning to the respective
 places that one should always check the others?

Sounds like a good idea.  Or we could try to, heh heh, refactor away
the duplication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 9:52 AM, Andres Freund and...@anarazel.de wrote:
 So my vote is for the original approach.  (I've not read Josh's patch,
 so there might be something wrong with it in detail, but I like the
 basic approach.)

 +1

OK, committed and back-patched that all the way back to 9.0.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 2:41 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Yep, true as things stand now. But this would get broken if we add a
 new lock level between ShareRowExclusiveLock and AccessExclusiveLock
 that does not respect the current monotone hierarchy between those.

But we're probably not going to do that, so it doesn't matter; and if
we do do it, we can worry about it then.  I don't think this is worth
getting concerned about now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-08-04 Thread Robert Haas

On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao masao.fu...@gmail.com wrote:
 track_commit_timestamp tracks COMMIT PREPARED as expected in standby server,
 but not in master server. Is this intentional? It should track COMMIT PREPARED
 even in master? Otherwise, we cannot use commit_timestamp feature to check
 the replication lag properly while we use 2PC.

That sounds like it must be a bug.  I think you should add it to the
open items list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-08-04 13:52:54 -0400, Tom Lane wrote:
 Not sure whether we should consider it a back-patchable bug fix or
 something to do only in HEAD, though --- comments?

 Tentatively I'd say it's a bug and should be back-patched.

Agreed.  If investigation turns up reasons to worry about
back-patching it, I'd possibly back-track on that position, but I
think we should start with the notion that it is back-patchable and
retreat from that position only at need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-08-04 Thread Alvaro Herrera

Robert Haas wrote:
 On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:

  I think the only way to close this hole is to have summarize_range()
  sleep until all open snapshots are gone after inserting the placeholder
  tuple and before acquiring the snapshot, similarly to how CREATE INDEX
  CONCURRENTLY does it.
 
 That's gonna be really slow, though, right?  Even if you rework things
 so that vacuum inserts all the placeholder tuples first, then waits,
 then does all the summarization, that could easily turn a vacuum that
 would have finished in a second into one that instead takes multiple
 hours.  During that time an AV worker is pinned down, and all sorts of
 badness can ensue.

Yeah, it is bad and I was concerned about that too.  Thankfully I found
another way to solve it, which is to forgo usage of MVCC here and
instead use SnapshotAny.  There's already a mode in
IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
to do what we need.  I'm going to propose a patch along those lines
shortly.

 Maybe I'm misunderstanding, but this whole thing seems like a pretty
 serious problem for BRIN.  :-(

With this new approach it shouldn't be.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] subplan variable reference / unassigned NestLoopParams (was: [sqlsmith] Failed assertion in joinrels.c)

2015-08-04 Thread Tom Lane

Andreas Seltenreich seltenre...@gmx.de writes:
 Tom Lane writes:
 Well, I certainly think all of these represent bugs:
 3 | ERROR:  plan should not reference subplan's variable
 2 | ERROR:  failed to assign all NestLoopParams to plan nodes

 These appear to be related.  The following query produces the former,
 but if you replace the very last reference of provider with the literal
 'bar', it raises the latter error.

Fixed that, thanks for the test case!

 ,[ FWIW: git bisect run ]
 | first bad commit: [e83bb10d6dcf05a666d4ada00d9788c7974ad378]
 | Adjust definition of cheapest_total_path to work better with LATERAL.
 `

There's still something fishy about your git bisect results; they don't
have much to do with what seems to me to be the triggering condition.
I suspect the problem is that git bisect doesn't allow for the possibility
that the symptom might appear and disappear over time, ie it might have
been visible at some early stage of the LATERAL work but been fixed later,
and then reintroduced by still-later optimization efforts.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Jeff Janes

On Tue, Aug 4, 2015 at 1:39 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote:


 The attached proof of concept patch greatly improves the bloat for both
 the insert and the update cases.  You need to turn on both features: adding
 the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
 The first of those two things could probably be adopted for real, but the
 second probably is not acceptable.  What is the right way to do this?
 Could a variant of RecordFreeIndexPage bubble the free space up the map
 immediately rather than waiting for a vacuum?  It would only have to move
 up until it found a page with freespace already recorded in it, which the
 vast majority of the time would mean observing up one level and then not
 writing to it, assuming the pending list pages remain well clustered.


 You make a good case for action here since insert only tables with GIN
 indexes on text are a common use case for GIN.

 Why would vacuuming the FSM be unacceptable? With a
 large gin_pending_list_limit it makes sense.


But with a smallish gin_pending_list_limit (like the default 4MB) this
could be called a lot (multiple times a second during some spurts), and
would read the entire fsm each time.



 If it is unacceptable, perhaps we can avoid calling it every time, or
 simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
 80/20 heuristic for this case.


Or maybe it could be passed a range of blocks which need vacuuming, so it
concentrated on that range.

But from the README file, it sounds like it is already supposed to be
bubbling up.  I'll have to see just whats going on there when I get a
chance.

Cheers,

Jeff

Re: [HACKERS] more-helpful-izing a debug message

2015-08-04 Thread Robert Haas

On Wed, Jul 8, 2015 at 5:38 AM, Marko Tiikkaja ma...@joh.to wrote:
 One of the debug messages related to logical replication could be more
 helpful than it currently is.  The attached patch reorders the two
 operations to make it so.

 Please consider patching and back-patching.

Andres, this looks like a bug fix to me.  What do you think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev
i.kurbangal...@postgrespro.ru wrote:
 A new version of the patch. I used your idea with macros, and with tranches 
 that
 allowed us to remove array with names (they can be written directly to the 
 corresponding
 tranche).

You seem not to have addressed a few of the points I brought up here:

http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirupjs...@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 3:55 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-08-04 15:45:44 -0400, Tom Lane wrote:
 I'm not sure that there's any great urgency about changing the instances
 that exist now; the real point of this discussion is that we will allow
 new code to use static inlines in headers.

 I agree that we don't have to (and probably shouldn't) make the required
 configure changes and change definitions.  But I do think some of the
 current macro usage deserves to be cleaned up at some point. I,
 somewhere during 9.4 or 9.5, redefined some of the larger macros into
 static inlines and it both reduced the binary size and gave minor
 performance benefits.

We typically recommend that people write their new code like the
existing code.  If we say that the standards for new code are now
different from old code in this one way, I don't think that's going to
be very helpful to anyone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

2015-08-04 Thread Robert Haas

On Fri, Jul 3, 2015 at 8:33 PM, Peter Geoghegan p...@heroku.com wrote:
 Since apparently we're back to development work, I thought it was time
 to share a patch implementing a few additional simple tricks to make
 sorting text under a non-C locale even faster than in 9.5. These
 techniques are mostly effective when values are physically clustered
 together. This might be because there is a physical/logical
 correlation, but cases involving any kind of clustering of values are
 helped significantly.

Interesting work.

Some comments:

1. My biggest gripe with this patch is that the comments are not easy
to understand.  For example:

+ * Attempt to re-use buffers across calls.  Also, avoid work in the event
+ * of clustered together identical items by exploiting temporal locality.
+ * This works well with divide-and-conquer, comparison-based sorts like
+ * quicksort and mergesort.
+ *
+ * With quicksort, there is, in general, a pretty strong chance that the
+ * same buffer contents can be used repeatedly for pivot items -- early
+ * pivot items will account for large number of total comparisons, since
+ * they must be compared against many (possibly all other) items.

Well, what I would have written is something like: We're likely to be
asked to compare the same strings repeatedly, and memcmp() is so much
cheaper than memcpy() that it pays to attempt a memcmp() in the hopes
of avoiding a memcpy().  This doesn't seem to slow things down
measurably even if it doesn't work out very often.

+ * While it is worth going to the trouble of trying to re-use buffer
+ * contents across calls, ideally that will lead to entirely avoiding a
+ * strcoll() call by using a cached return value.
+ *
+ * This optimization can work well again and again for the same set of
+ * clustered together identical attributes;  as they're relocated to new
+ * subpartitions, only one strcoll() is required for each pivot (in respect
+ * of that clump of identical values, at least).  Similarly, the final
+ * N-way merge of a mergesort can be effectively accelerated if each run
+ * has its own locally clustered values.

And here I would have written something like: If we're comparing the
same two strings that we compared last time, we can return the same
answer without calling strcoll() again.  This is more likely than it
seems, because quicksort compares the same pivot against many values,
and some of those values might be duplicates.

Of course everybody may prefer something different here; I'm just
telling you what I think.

2. I believe the change to bttextcmp_abbrev() should be pulled out
into a separate patch and committed separately.  That part  seems like
a slam dunk.

3. What is the worst case for the strxfrm()-reuse stuff?  I suppose
it's the case where we have many strings that are long, all
equal-length, and all different, but only in the last few characters.
Then the memcmp() is as expensive as possible but never works out.
How does the patch hold up in that case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Jeff Janes

On Tue, Aug 4, 2015 at 6:35 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 4 August 2015 at 09:39, Simon Riggs si...@2ndquadrant.com wrote:

 On 4 August 2015 at 06:03, Jeff Janes jeff.ja...@gmail.com wrote:


 The attached proof of concept patch greatly improves the bloat for both
 the insert and the update cases.  You need to turn on both features: adding
 the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
 The first of those two things could probably be adopted for real, but the
 second probably is not acceptable.  What is the right way to do this?
 Could a variant of RecordFreeIndexPage bubble the free space up the map
 immediately rather than waiting for a vacuum?  It would only have to move
 up until it found a page with freespace already recorded in it, which the
 vast majority of the time would mean observing up one level and then not
 writing to it, assuming the pending list pages remain well clustered.


 You make a good case for action here since insert only tables with GIN
 indexes on text are a common use case for GIN.

 Why would vacuuming the FSM be unacceptable? With a
 large gin_pending_list_limit it makes sense.

 If it is unacceptable, perhaps we can avoid calling it every time, or
 simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
 80/20 heuristic for this case.


 Couple of questions here...

 * the docs say it's desirable to have pending-list cleanup occur in the
 background, but there is no way to invoke that, except via VACUUM. I
 think we need a separate function to be able to call this as a background
 action. If we had that, we wouldn't need much else, would we?


I thought maybe the new bgworker framework would be a way to have a backend
signal a bgworker to do the cleanup when it notices the pending list is
getting large.  But that wouldn't directly fix this issue, because the
bgworker still wouldn't recycle that space (without further changes), only
vacuum workers do that currently.

But I don't think this could be implemented as an extension, because the
signalling code has to be in core, so (not having studied the matter at
all) I don't know if it is good fit for bgworker.


 * why do we have two parameters: gin_pending_list_limit and fastupdate?
 What happens if we set gin_pending_list_limit but don't set fastupdate?


Fastupdate is on by default.  If it were turned off, then
gin_pending_list_limit would be mostly irrelevant for those tables.
Fastupdate could have been implemented as a magic value (0 or -1) for
gin_pending_list_limit but that would break backwards compatibility (and
arguably would not be a better way of doing things, anyway).


 * how do we know how to set that parameter? Is there a way of knowing
 gin_pending_list_limit has been reached?


I don't think there is an easier answer to that.  The trade offs are
complex and depend on things like how well cached the parts of the index
needing insertions are, how many lexemes/array elements are in an average
document, and how many documents inserted near the same time as each other
share lexemes in common.  And of course what you need to optimize for,
latency or throughput, and if latency search latency or insert latency.

This and the OP seem like 9.5 open items to me.


I don't think so.  Freeing gin_pending_list_limit from being forcibly tied
to work_mem is a good thing.  Even if I don't know exactly how to set
gin_pending_list_limit, I know I don't want to be 4GB just because work_mem
was set there for some temporary reason.  I'm happy to leave it at its
default and let its fine tuning be a topic for people who really care about
every microsecond of performance.

Cheers,

Jeff

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Robert Haas

On Tue, Aug 4, 2015 at 4:47 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev
 i.kurbangal...@postgrespro.ru wrote:
 A new version of the patch. I used your idea with macros, and with tranches 
 that
 allowed us to remove array with names (they can be written directly to the 
 corresponding
 tranche).

 You seem not to have addressed a few of the points I brought up here:

 http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirupjs...@mail.gmail.com

More generally, I'd like to stop smashing all the things that need to
be done here into one patch.  We need to make some changes, such as
the one I proposed earlier today, to make it easier to properly
identify locks.  Let's talk about how to do that and agree on the
details.  Then, once that's done, let's do the main part of the work
afterwards, in a separate commit.  We're running through patch
versions at light speed here, but I'm not sure we're really building
consensus around how to do things.  The actual technical work here
isn't really the problem; that part is easy.  The hard part is
agreeing on the details of how it should work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Tom Lane

Andres Freund and...@anarazel.de writes:
 On 2015-08-04 15:20:14 -0400, Robert Haas wrote:
 OK, so do we want to rip out all instances of the static inline dance
 in favor of more straightforward coding?  Do we then shut pandemelon
 and any other affected buildfarm members down as unsupported, or what?

 I think all that happens is that they'll log a couple more warnings
 about defined but unused static functions. configure already defines
 inline away if not supported.

Right.  We had already concluded that this would be safe to do, it's
just a matter of somebody being motivated to do it.

I'm not sure that there's any great urgency about changing the instances
that exist now; the real point of this discussion is that we will allow
new code to use static inlines in headers.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Alvaro Herrera

Ildus Kurbangaliev wrote:

 A new version of the patch. I used your idea with macros, and with tranches 
 that
 allowed us to remove array with names (they can be written directly to the 
 corresponding
 tranche).

Just a bystander here, I haven't reviewed this patch at all, but I have
two questions,

1. have you tested this under -DEXEC_BACKEND ?  I wonder if those
initializations are going to work on Windows.

2. why keep the SLRU control locks as individual locks?  Surely we could
put them in the SlruCtl struct and get rid of a few individual lwlocks?

Also, I wonder if we shouldn't be doing this in two parts, one that
changes the underlying lwlock structure and another one to change
pg_stat_activity.


We have CppAsString() in c.h IIRC, which we use instead of # (we do use
## in a few places though).  I wonder if that stuff has any value
anymore.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Andres Freund

On 2015-08-04 15:20:14 -0400, Robert Haas wrote:
 OK, so do we want to rip out all instances of the static inline dance
 in favor of more straightforward coding?  Do we then shut pandemelon
 and any other affected buildfarm members down as unsupported, or what?

I think all that happens is that they'll log a couple more warnings
about defined but unused static functions. configure already defines
inline away if not supported.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

2015-08-04 Thread Peter Geoghegan

On Tue, Aug 4, 2015 at 12:41 PM, Robert Haas robertmh...@gmail.com wrote:
 Interesting work.

Thanks.

 1. My biggest gripe with this patch is that the comments are not easy
 to understand.

 Of course everybody may prefer something different here; I'm just
 telling you what I think.

I have struggled with trying to put just the right amount of
exposition on the theory behind a particular optimization in source
code comments, and things like that. Since no one is telling me that I
need to write more, clearly I don't have the balance right yet. To a
certain extent, it is a matter of personal style, but I'll try and be
more terse.

 2. I believe the change to bttextcmp_abbrev() should be pulled out
 into a separate patch and committed separately.  That part  seems like
 a slam dunk.

Makes sense.

 3. What is the worst case for the strxfrm()-reuse stuff?  I suppose
 it's the case where we have many strings that are long, all
 equal-length, and all different, but only in the last few characters.
 Then the memcmp() is as expensive as possible but never works out.
 How does the patch hold up in that case?

I haven't tested it. I'll get around to it at some point in the next
couple of weeks. I imagine that it's exactly the same as the memcmp()
equality thing because of factors like speculative execution, and the
fact that we need both strings in cache anyway. It's almost exactly
the same story, although unlike the memcmp() opportunistic equality
pre-check thing, this check happens only n times, not n log n times.

I'm quite sure that the cost needs to be virtually zero to go ahead
with the idea. I think it probably is. Note that like the memcmp()
thing, we check string length first, before a memcmp().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

2015-08-04 Thread Peter Geoghegan

On Tue, Aug 4, 2015 at 1:30 PM, Peter Geoghegan p...@heroku.com wrote:
 2. I believe the change to bttextcmp_abbrev() should be pulled out
 into a separate patch and committed separately.  That part  seems like
 a slam dunk.

 Makes sense.

BTW, I want to put the string_uint() macro in a common header now. It
can be used for other types. I've written a SortSupport + abbreviated
keys patch for the UUID type which will use it, too, so that it too
can use simple unsigned integer comparisons within its abbreviated
comparator. I haven't posted the UUID patch yet only because everyone
is up to their ears in my sorting patches.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Andres Freund

On 2015-08-04 15:45:44 -0400, Tom Lane wrote:
 I'm not sure that there's any great urgency about changing the instances
 that exist now; the real point of this discussion is that we will allow
 new code to use static inlines in headers.

I agree that we don't have to (and probably shouldn't) make the required
configure changes and change definitions.  But I do think some of the
current macro usage deserves to be cleaned up at some point. I,
somewhere during 9.4 or 9.5, redefined some of the larger macros into
static inlines and it both reduced the binary size and gave minor
performance benefits.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Robert Haas

On Thu, Jul 2, 2015 at 12:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So far this thread is all about the costs of desupporting compilers
 that don't have these features, and you're making a good argument
 (that I think we all agree with) that the cost is small.  But you
 haven't really said what the benefits are.

 I made the same remark with respect to varargs macros, and I continue
 to think that the cost-benefit ratio there is pretty marginal.

 However, I do think that there's a case to be made for adopting static
 inline.  The INCLUDE_DEFINITIONS dance is very inconvenient, so it
 discourages people from using static inlines over macros, even in cases
 where the macro approach is pretty evil (usually, because of multiple-
 evaluation hazards).  If we allowed static inlines then we could work
 towards having a safer coding standard along the lines of thou shalt
 not write macros that evaluate their arguments more than once.
 So the benefits here are easier development and less risk of bugs.
 On the other side of the ledger, the costs are pretty minimal; we will
 not actually break any platforms, at worst we'll make them unpleasant
 to develop on because of lots of warning messages.  We have some platforms
 like that already, and it's not been a huge problem.

OK, so do we want to rip out all instances of the static inline dance
in favor of more straightforward coding?  Do we then shut pandemelon
and any other affected buildfarm members down as unsupported, or what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dependency between bgw_notify_pid and bgw_flags

2015-08-04 Thread Robert Haas

On Tue, Jul 7, 2015 at 4:34 AM, Ashutosh Bapat
ashutosh.ba...@enterprisedb.com wrote:
 With that notion of backend, to fix the original problem I reported,
 PostmasterMarkPIDForWorkerNotify() should also look at the
 BackgroundWorkerList. As per the comments in the prologue of this function,
 it assumes that only backends need to be notified about worker state change,
 which looks too restrictive. Any backend or background worker, which wants
 to be notified about a background worker it requested to be spawned, should
 be allowed to do so.

Yeah.  I'm wondering if we should fix this problem by just insisting
that all workers have an entry in BackendList.  At first look, this
seems like it would make the code a lot simpler and have virtually no
downside.  As it is, we're repeatedly reinventing new and different
ways for unconnected background workers to do things that work fine in
all other cases, and I don't see the point of that.

I haven't really tested the attached patch - sadly, we have no testing
of background workers without shared memory access in the tree - but
it shows what I have in mind.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 000524d..1818f7c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -155,8 +155,7 @@
  * they will never become live backends.  dead_end children are not assigned a
  * PMChildSlot.
  *
- * Background workers that request shared memory access during registration are
- * in this list, too.
+ * Background workers are in this list, too.
  */
 typedef struct bkend
 {
@@ -404,13 +403,11 @@ static long PostmasterRandom(void);
 static void RandomSalt(char *md5Salt);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
-static bool SignalUnconnectedWorkers(int signal);
 static void TerminateChildren(int signal);
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
-static int	CountUnconnectedWorkers(void);
 static void maybe_start_bgworker(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(AuxProcType type);
@@ -2414,7 +2411,6 @@ SIGHUP_handler(SIGNAL_ARGS)
 (errmsg(received SIGHUP, reloading configuration files)));
 		ProcessConfigFile(PGC_SIGHUP);
 		SignalChildren(SIGHUP);
-		SignalUnconnectedWorkers(SIGHUP);
 		if (StartupPID != 0)
 			signal_child(StartupPID, SIGHUP);
 		if (BgWriterPID != 0)
@@ -2491,7 +2487,6 @@ pmdie(SIGNAL_ARGS)
 /* and bgworkers too; does this need tweaking? */
 SignalSomeChildren(SIGTERM,
 			   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
-SignalUnconnectedWorkers(SIGTERM);
 /* and the autovac launcher too */
 if (AutoVacPID != 0)
 	signal_child(AutoVacPID, SIGTERM);
@@ -2543,11 +2538,11 @@ pmdie(SIGNAL_ARGS)
 signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 signal_child(WalReceiverPID, SIGTERM);
-			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
+SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
 /*
- * Only startup, bgwriter, walreceiver, unconnected bgworkers,
+ * Only startup, bgwriter, walreceiver, possibly bgworkers,
  * and/or checkpointer should be active in this state; we just
  * signaled the first four, and we don't want to kill
  * checkpointer yet.
@@ -2999,25 +2994,21 @@ CleanupBackgroundWorker(int pid,
 		}
 
 		/* Get it out of the BackendList and clear out remaining data */
-		if (rw-rw_backend)
-		{
-			Assert(rw-rw_worker.bgw_flags  BGWORKER_BACKEND_DATABASE_CONNECTION);
-			dlist_delete(rw-rw_backend-elem);
+		dlist_delete(rw-rw_backend-elem);
 #ifdef EXEC_BACKEND
-			ShmemBackendArrayRemove(rw-rw_backend);
+		ShmemBackendArrayRemove(rw-rw_backend);
 #endif
 
-			/*
-			 * It's possible that this background worker started some OTHER
-			 * background worker and asked to be notified when that worker
-			 * started or stopped.  If so, cancel any notifications destined
-			 * for the now-dead backend.
-			 */
-			if (rw-rw_backend-bgworker_notify)
-BackgroundWorkerStopNotifications(rw-rw_pid);
-			free(rw-rw_backend);
-			rw-rw_backend = NULL;
-		}
+		/*
+		 * It's possible that this background worker started some OTHER
+		 * background worker and asked to be notified when that worker
+		 * started or stopped.  If so, cancel any notifications destined
+		 * for the now-dead backend.
+		 */
+		if (rw-rw_backend-bgworker_notify)
+			BackgroundWorkerStopNotifications(rw-rw_pid);
+		free(rw-rw_backend);
+		rw-rw_backend = NULL;
 		rw-rw_pid = 0;
 		rw-rw_child_slot = 0;
 		ReportBackgroundWorkerPID(rw);	/* report child death */
@@ -3160,15 +3151,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 			 * Found entry for

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Ildus Kurbangaliev


 On Aug 4, 2015, at 4:54 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 
 On 08/04/2015 03:15 PM, Robert Haas wrote:
 On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
 with the list of individual locks in lwlock.h. Sooner or later someone will
 add an LWLock and forget to update the names-array. That needs to be made
 less error-prone, so that the names are maintained in the same place as the
 #defines. Perhaps something like rmgrlist.h.
 
 This is a good idea, but it's not easy to do in the style of
 rmgrlist.h, because I don't believe there's any way to define a macro
 that expands to a preprocessor directive.  Attached is a patch that
 instead generates the list of macros from a text file, and also
 generates an array inside lwlock.c with the lock names that gets used
 by the Trace_lwlocks stuff where applicable.
 
 Any objections to this solution to the problem?  If not, I'd like to
 go ahead and push this much.  I can't test the Windows changes
 locally, though, so it would be helpful if someone could check that
 out.
 
 A more low-tech solution would be to something like this in lwlocknames.c:
 
 static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS];
 
 /* Turn pointer into one of the LWLocks in main array into an index number */
 #define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name
 
 InitLWLockNames()
 {
  NAME_LWLOCK(ShmemIndexLock, ShmemIndexLock);
  NAME_LWLOCK(OidGenLock, OidGenLock);
  ...
 }
 
 That would not be auto-generated, so you'd need to keep that list in sync 
 with lwlock.h, but it would be much better than the original patch because if 
 you forgot to add an entry in the names-array, the numbering of all the other 
 locks would not go wrong. And you could have a runtime check that complains 
 if there's an entry missing, like Ildus did in his latest patch.
 
 I have no particular objection to your perl script either, though. I'll leave 
 it up to you.
 
 - Heikki
 

A new version of the patch. I used your idea with macros, and with tranches that
allowed us to remove array with names (they can be written directly to the 
corresponding
tranche).


Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



extend_pg_stat_activity_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Andrew Dunstan



On 08/04/2015 02:09 PM, Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:

On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund and...@anarazel.de wrote:

On 2015-08-04 13:52:54 -0400, Tom Lane wrote:

Not sure whether we should consider it a back-patchable bug fix or
something to do only in HEAD, though --- comments?

Tentatively I'd say it's a bug and should be back-patched.

Agreed.  If investigation turns up reasons to worry about
back-patching it, I'd possibly back-track on that position, but I
think we should start with the notion that it is back-patchable and
retreat from that position only at need.

OK.  Certainly we can fix 9.5 the same way as HEAD; the pg_dump code
hasn't diverged much yet.  I'll plan to push it as far back as convenient,
but I won't expend any great effort on making the older branches do it if
they turn out to be too different.





From my POV 9.5 is the one that's most critical, because it's the one 
that introduced a regression test that leaves a shell type lying around. 
But as far back as convenient works for me.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs

On 4 August 2015 at 21:04, Jeff Janes jeff.ja...@gmail.com wrote:


 Couple of questions here...

 * the docs say it's desirable to have pending-list cleanup occur in the
 background, but there is no way to invoke that, except via VACUUM. I
 think we need a separate function to be able to call this as a background
 action. If we had that, we wouldn't need much else, would we?


 I thought maybe the new bgworker framework would be a way to have a
 backend signal a bgworker to do the cleanup when it notices the pending
 list is getting large.  But that wouldn't directly fix this issue, because
 the bgworker still wouldn't recycle that space (without further changes),
 only vacuum workers do that currently.

 But I don't think this could be implemented as an extension, because the
 signalling code has to be in core, so (not having studied the matter at
 all) I don't know if it is good fit for bgworker.


We need to expose 2 functions:

1. a function to perform the recycling directly (BRIN has an equivalent
function)

2. a function to see how big the pending list is for a particular index,
i.e. do we need to run function 1?

We can then build a bgworker that polls the pending list and issues a
recycle if and when needed - which is how autovac started.


 * why do we have two parameters: gin_pending_list_limit and fastupdate?
 What happens if we set gin_pending_list_limit but don't set fastupdate?


 Fastupdate is on by default.  If it were turned off, then
 gin_pending_list_limit would be mostly irrelevant for those tables.
 Fastupdate could have been implemented as a magic value (0 or -1) for
 gin_pending_list_limit but that would break backwards compatibility (and
 arguably would not be a better way of doing things, anyway).


 * how do we know how to set that parameter? Is there a way of knowing
 gin_pending_list_limit has been reached?


 I don't think there is an easier answer to that.  The trade offs are
 complex and depend on things like how well cached the parts of the index
 needing insertions are, how many lexemes/array elements are in an average
 document, and how many documents inserted near the same time as each other
 share lexemes in common.  And of course what you need to optimize for,
 latency or throughput, and if latency search latency or insert latency.


So we also need a way to count the number of times the pending list is
flushed. Perhaps record that on the metapage, so we can see how often it
has happened - and another function to view the stats on that

This and the OP seem like 9.5 open items to me.


 I don't think so.  Freeing gin_pending_list_limit from being forcibly tied
 to work_mem is a good thing.  Even if I don't know exactly how to set
 gin_pending_list_limit, I know I don't want to be 4GB just because work_mem
 was set there for some temporary reason.  I'm happy to leave it at its
 default and let its fine tuning be a topic for people who really care about
 every microsecond of performance.


OK, I accept this.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Noah Misch

On Tue, Aug 04, 2015 at 07:35:43AM +0100, Simon Riggs wrote:
 On 4 August 2015 at 05:56, Michael Paquier michael.paqu...@gmail.com wrote:
  The thing is that, as mentioned by Alvaro and Andres on this thread,
  we have no guarantee that the different relation locks compared have a
  monotone hierarchy and we may finish by taking a lock that does not
  behave as you would like to. We are now lucky enough that ALTER TABLE
  only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and
  AccessExclusiveLock that actually have a hierarchy so this is not a
  problem yet.
  However it may become a problem if we add in the future more lock
  modes and that are used by ALTER TABLE.
 
 
 Please provide the link to the discussion of this. I don't see a problem
 here right now that can't be solved by saying
 
 Assert(locklevel==ShareUpdateExclusiveLock ||
 locklevelShareRowExclusiveLock);

Agreed; that addresses the foreseeable future of this threat.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Takashi Horikawa

 ... Josh's approach of restricting the buffer size seems
 a lot more robust.
I understand that the capping of approach of restricting the buffer size
is much more robust and is suitable in this case.

I, howerver, think that the chane from
'page = XLogCtl-pages[firstIdx * XLOG_BLCKSZ];'
to
'page = XLogCtl-pages[firstIdx * (Size) XLOG_BLCKSZ];'
is no harm even when restricting the wal buffer size.

It is in harmony with the usage of 'XLogCtl-pages' found in, for example, 
'cachedPos = XLogCtl-pages + idx * (Size) XLOG_BLCKSZ;'
in GetXLogBuffer(XLogRecPtr ptr)
and 
'NewPage = (XLogPageHeader) (XLogCtl-pages + nextidx * (Size) XLOG_BLCKSZ);
'
in AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
, etc.

Only exception is
'page = XLogCtl-pages[firstIdx * XLOG_BLCKSZ];'
in
StartupXLOG(void)
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
 Sent: Tuesday, August 04, 2015 10:50 PM
 To: Horikawa Takashi(堀川 隆)
 Cc: PostgreSQL-development
 Subject: Re: [HACKERS] patch: prevent user from setting wal_buffers over
 2GB bytes
 
 Takashi Horikawa t-horik...@aj.jp.nec.com writes:
  Why does this cause a core dump?  We could consider fixing whatever
  the problem is rather than capping the value.
 
  As far as I experiment with my own evaluation environment using
  PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the
  patch attached.
 
 I'm unsure whether this represents a complete fix ... but even if it does,
 it would be awfully easy to re-introduce similar bugs in future code
changes,
 and who would notice?  Josh's approach of restricting the buffer size
seems
 a lot more robust.
 
 If there were any practical use-case for such large WAL buffers then it
 might be worth spending some effort/risk here.  But AFAICS, there is not.
 Indeed, capping wal_buffers might be argued to be a good thing in itself
 because it would prevent users from wasting shared memory foolishly.
 
 So my vote is for the original approach.  (I've not read Josh's patch, so
 there might be something wrong with it in detail, but I like the basic
 approach.)
 
   regards, tom lane
 
 
 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
 changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


smime.p7s
Description: S/MIME cryptographic signature

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-04 Thread Michael Paquier

On Tue, Aug 4, 2015 at 8:37 PM, Beena Emerson memissemer...@gmail.com wrote:
 Robert Haas wrote:
Maybe shoehorning this into the GUC mechanism is the wrong thing, and
what we really need is a new config file for this.  The information
we're proposing to store seems complex enough to justify that.


 I think the consensus is that JSON is better.

I guess so as well. Thanks for brainstorming the whole thread in a single post.

 And using a new file with multi line support would be good.

This file just contains a JSON blob, hence we just need to fetch its
content entirely and then let the server parse it using the existing
facilities.

 Name of the file: how about pg_syncinfo.conf?
 Backward compatibility: synchronous_standby_names will be supported.
 synchronous_standby_names='pg_syncinfo' indicates use of new file.

This strengthens the fact that parsing is done at SIGHUP, so that
sounds fine to me. We may still find out an application_name that uses
pg_syncinfo but well, that's unlikely to happen...

 JSON format:
 It would contain 2 main keys: sync_info and  groups
 The sync_info would consist of quorum/priority with the count and
 nodes/group with the group name or node list.
 The optional groups key would list out all the group mentioned within
 sync_info along with the node list.

 [...]

 Thoughts?

Yes, I think that's the idea. I would let a couple of days to let
people time to give their opinion and objections regarding this
approach though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

100 matches

Mail list logo