date:20150803

Simon Riggs wrote:

 * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all,
 since they aren't critical path activities at that point

It is not possible to skip scanning indexes completely, unless no tuples
are to be removed from the heap.  Otherwise, index tuples become
lingering pointers (and when such heap address are later refilled, they
become corrupted indexscans).

But actually this is an interesting point and I don't think we do this:
if in emergency mode, maybe we shouldn't try to remove any dead tuples
at all, and instead only freeze very old tuples.  That would make such
vacuums go much quicker.  (More accurately, if the updating xid is older
than the freeze point, then remove the tuple, but otherwise keep it.)

My point is that emergency vacuums are troublesome for various reasons
and it would be better if they did only the minimum possible.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-03 Thread Geoff Winkless

Hi

We've come across a weirdness with ON CONFLICT, where UPSERTing a smallint
value produces an error:

db=# INSERT INTO brokentab(id, k1,k2,k3,k4,k5,k6,k7, smallval) VALUES
(5,0,0,0,1,0,1,0, 0) ON CONFLICT (id, k1,k2,k3,k4,k5,k6,k7) DO UPDATE SET
smallval=EXCLUDED.smallval;
ERROR:  attribute 29 has wrong type
DETAIL:  Table has type integer, but query expects smallint.

If you change the SET to smallval=0 the problem goes away, although using
SET smallval=CAST(EXCLUDED.smallval AS smallint) - or indeed AS int -
doesn't help at all.

If I create a copy of the table using

CREATE mytab (LIKE brokentab INCLUDING ALL);
INSERT INTO mytab SELECT * FROM brokentab;

the new table does not exhibit the same problem (so I'm assuming it's not
easily reproducible and giving you a creation script isn't going to help).

VACUUM FULL on the table makes no difference.

Is there anything you guys can suggest that I can do to help narrow down
the problem?

Linux Centos 6.5, kernel 2.6.32-431.el6.i686, pgsql alpha1, built from
source using gcc 4.4.7.

Thanks

Geoff

Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-08-03 Thread Anastasia Lubennikova



On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova 
a.lubennik...@postgrespro.ru mailto:a.lubennik...@postgrespro.ru 
wrote:


1) Test and results are in attachments. Everything seems to work
as expected.
2) I dropped these notices. It was done only for debug purposes.
Updated patch is attached.
3) fixed


Good! Another couple of notes from me:
1) I think gistvacuumpage() and gistkillitems() need function-level 
comments.
2) ItemIdIsDead() can be used in index scan like it's done 
in _bt_checkkeys().


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com 
http://www.postgrespro.com/

The Russian Postgres Company

I've added some comments.
ItemIdIsDead() is used now (just skip dead tuples as not matching the 
quals).
And there is one else check of LSN in gistkillitems to make sure that 
page was not changed between reads.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***
*** 36,42  static bool gistinserttuples(GISTInsertState *state, 
GISTInsertStack *stack,
 bool unlockbuf, bool unlockleftchild);
  static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool 
releasebuf);
! 
  
  #define ROTATEDIST(d) do { \
SplitedPageLayout 
*tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
--- 36,42 
 bool unlockbuf, bool unlockleftchild);
  static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool 
releasebuf);
! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
  
  #define ROTATEDIST(d) do { \
SplitedPageLayout 
*tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
***
*** 209,214  gistplacetopage(Relation rel, Size freespace, GISTSTATE 
*giststate,
--- 209,225 
 * because the tuple vector passed to gistSplit won't include this 
tuple.
 */
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ 
+   /*
+* If leaf page is full, try at first to delete dead tuples. And then
+* check again.
+*/
+   if ((is_split)  GistPageIsLeaf(page)  GistPageHasGarbage(page))
+   {
+   gistvacuumpage(rel, page, buffer);
+   is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+   }
+ 
if (is_split)
{
/* no space for insertion */
***
*** 1440,1442  freeGISTstate(GISTSTATE *giststate)
--- 1451,1522 
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate-scanCxt);
  }
+ 
+ /*
+  * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+  */
+ static void
+ gistvacuumpage(Relation rel, Page page, Buffer buffer)
+ {
+   OffsetNumber deletable[MaxOffsetNumber];
+   int ndeletable = 0;
+   OffsetNumber offnum,
+   minoff,
+   maxoff;
+ 
+   /*
+* Scan over all items to see which ones need to be deleted according to
+* LP_DEAD flags.
+*/
+   maxoff = PageGetMaxOffsetNumber(page);
+   for (offnum = FirstOffsetNumber;
+offnum = maxoff;
+offnum = OffsetNumberNext(offnum))
+   {
+   ItemId  itemId = PageGetItemId(page, offnum);
+ 
+   if (ItemIdIsDead(itemId))
+   deletable[ndeletable++] = offnum;
+   }
+ 
+   if (ndeletable  0)
+   {
+   START_CRIT_SECTION();
+ 
+   PageIndexMultiDelete(page, deletable, ndeletable);
+ 
+   /*
+* Mark the page as not containing any LP_DEAD items.  This is 
not
+* certainly true (there might be some that have recently been 
marked,
+* but weren't included in our target-item list), but it will 
almost
+* always be true and it doesn't seem worth an additional page 
scan to
+* check it. Remember that F_HAS_GARBAGE is only a hint anyway.
+*/
+   GistClearPageHasGarbage(page);
+ 
+   MarkBufferDirty(buffer);
+ 
+   /* XLOG stuff */
+   if (RelationNeedsWAL(rel))
+   {
+   XLogRecPtr  recptr;
+ 
+   recptr = gistXLogUpdate(rel-rd_node, buffer,
+   
deletable, ndeletable,
+   NULL, 
0, InvalidBuffer);
+ 
+   PageSetLSN(page, recptr);
+   }
+   else
+

Re: [HACKERS] Planner debug views

Qingqing Zhou wrote:
 On Thu, Jul 30, 2015 at 2:42 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 
  I think a better option would be shoving it into a backend tuplestore and
  just leaving it there (maybe with a command to clear it for the paranoid).
  That gives a relation you can query against, insert into another table, etc.
 
 This is something I don't know how to do it: in my understanding, a
 tuplestore is an internal store, which means it has no name exposed.
 Thus we can't reference it later.

Yeah, that doesn't sound the kind of project you should attempt here.
EXPLAIN already knows to return tuples, so I was assuming you would
return your stuff using that mechanism.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-08-03 Thread Fujii Masao

On Mon, Aug 3, 2015 at 8:55 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Jul 16, 2015 at 9:58 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jul 16, 2015 at 9:41 AM, Fujii Masao masao.fu...@gmail.com
 wrote:
  Here are some minor comments:
 
  +ereport(LOG,
  +(errmsg(ignoring \%s\ file because no
  \%s\ file exists,
  +TABLESPACE_MAP, BACKUP_LABEL_FILE),
  + errdetail(could not rename file \%s\ to
  \%s\: %m,
  +   TABLESPACE_MAP,
  TABLESPACE_MAP_OLD)));
 
  WARNING is better than LOG here because it indicates a problematic case?

 No, that's not the right distinction.  Remember that, when sending
 messages to the client, WARNING  LOG, and when sending messages to
 the log, LOG  WARNING.  So messages that a user is more likely to
 care about than the administrator should be logged at WARNNG; those
 that the administrator is more likely to care about should be LOG.  I
 think LOG is clearly the appropriate thing here.

  In detail message, the first word of sentence needs to be capitalized.
 
  + errdetail(renamed file \%s\ to \%s\,
  +   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 
  In detail message, basically we should use a complete sentence.
  So like other similar detail messages in xlog.c, I think that it's
  better
  to use \%s\ was renamed to \%s\. as the detail message here.

 Right, that's what the style guidelines say.


 I have changed the errdetail messages as per above suggestion.
 Also, there was one issue (it was displaying 2 messages when
 rename fails) in patch which I have fixed in the updated version.

Thanks! Pushed.

BTW, while reading the code related to tablespace_map, I found that
CancelBackup() emits the WARNING message online backup mode was not canceled
when rename() fails. Isn't this confusing (or incorrect)? ISTM that we can
see that the online backup mode has already been canceled if backup_label file
is successfully removed whether tablespace_map file remains or not. No?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-08-03 Thread Fujii Masao

Hi,

track_commit_timestamp tracks COMMIT PREPARED as expected in standby server,
but not in master server. Is this intentional? It should track COMMIT PREPARED
even in master? Otherwise, we cannot use commit_timestamp feature to check
the replication lag properly while we use 2PC.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-03 Thread Geoff Winkless

On 3 August 2015 at 16:53, Geoff Winkless pgsqlad...@geoff.dj wrote:

 If I create a copy of the table using

 CREATE mytab (LIKE brokentab INCLUDING ALL);

 Of course I meant CREATE TABLE here... finger slippage :)

Re: [HACKERS] optimizing vacuum truncation scans

2015-08-03 Thread Jeff Janes

On Mon, Jul 27, 2015 at 1:40 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 22 July 2015 at 17:11, Jeff Janes jeff.ja...@gmail.com wrote:

 On Wed, Jul 22, 2015 at 6:59 AM, Robert Haas robertmh...@gmail.com
 wrote:

 On Mon, Jun 29, 2015 at 1:54 AM, Jeff Janes jeff.ja...@gmail.com
 wrote:
  Attached is a patch that implements the vm scan for truncation.  It
  introduces a variable to hold the last blkno which was skipped during
 the
  forward portion.  Any blocks after both this blkno and after the last
  inspected nonempty page (which the code is already tracking) must have
 been
  observed to be empty by the current vacuum.  Any other process
 rendering the
  page nonempty are required to clear the vm bit, and no other process
 can set
  the bit again during the vacuum's lifetime.  So if the bit is still
 set, the
  page is still empty without needing to inspect it.

 Urgh.  So if we do this, that forever precludes having HOT pruning set
 the all-visible bit.


 I wouldn't say forever, as it would be easy to revert the change if
 something more important came along that conflicted with it.


 I think what is being said here is that someone is already using this
 technique, or if not, then we actively want to encourage them to do so as
 an extension or as a submission to core.

 In that case, I think the rely-on-VM technique sinks again, sorry Jim,
 Jeff. Probably needs code comments added.


Sure, that sounds like the consensus.  The VM method was very efficient,
but I agree it is pretty fragile and restricting.



 That does still leave the prefetch technique, so all is not lost.

 Can we see a patch with just prefetch, probably with a simple choice of
 stride? Thanks.


I probably won't get back to it this commit fest, so it can be set to
returned with feedback.  But if anyone has good ideas for how to set the
stride (or detect that it is on SSD and so is pointless to try) I'd love to
hear about them anytime.

Cheers,

Jeff

Re: [HACKERS] buffer locking fix for lazy_scan_heap

On Mon, Aug 3, 2015 at 12:03 AM, 高增琦 pgf...@gmail.com wrote:
 sorry for asking this really old commit.

 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7ab9b2f3b79177e501a1ef90ed004cc68788abaf

 I could not understand why releases the lock on the buffer is
 a problem since vacuum will lock and check the page bit again before
 set the vm. Did I missed something?

Hmm, you might be right.  It certainly seems safer to do all the work
without releasing the buffer lock temporarily in the middle, but maybe
the old way wasn't actually broken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-03 Thread Fabrízio de Royes Mello

On Mon, Aug 3, 2015 at 2:15 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote:
  On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote:
  For instance, if you told me to choose between ShareLock and
  ShareUpdateExclusiveLock I wouldn't know which one is strongest.  I
  don't it's sensible to have the lock mode compare primitive
 honestly.
  I don't have any great ideas to offer ATM sadly.
 
 Yes, the thing is that lowering the lock levels is good for
 concurrency, but the non-monotony of the lock levels makes it
 impossible to choose an intermediate state correctly.
 
  How about simply acquiring all the locks individually of they're
different types? These few acquisitions won't matter.

 As long as this only applies on master, this may be fine... We could
 basically pass a LOCKMASK to the multiple layers of tablecmds.c
 instead of LOCKMODE to track all the locks that need to be taken, and
 all the relations open during operations. Would it be worth having
 some routines like relation_multi_[open|close]() as well? Like that:
 Relation[] relation_multi_open(relation Oid, LOCKMASK):
 void relation_multi_close(Relation[]);

 If we do something, we may consider patching as well 9.5, it seems to
 me that tablecmds.c is broken by assuming that lock hierarchy is
 monotone in AlterTableGetLockLevel().


Hey guys,

Are you sure we need to do all this changes just to check the highest
locklevel based on the reloptions? Or I misunderstood the whole thing?

Maybe we just check the DoLockModeConflict in the new method
GetReloptionsLockLevel and if true then fallback to AccessExclusiveLock (as
Michael suggested in a previous email).


Regards,


*** This work is funded by Zenvia Mobile Results (http://www.zenvia.com.br)

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-08-03 Thread Amit Langote

On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:
 On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas robertmh...@gmail.com wrote:

 OK, sure.  But let's make sure postgres_fdw gets a server-level option
 to control this.


 For postgres_fdw it's a boolean server-level option 'twophase_compliant'
 (suggestions for name welcome).
 

How about just 'twophase'?

Thanks,
Amit





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Arguable RLS security bug, EvalPlanQual() paranoia

* Peter Geoghegan (p...@heroku.com) wrote:
 On Mon, Jun 1, 2015 at 12:29 AM, Peter Geoghegan p...@heroku.com wrote:
  If you're using another well known MVCC database system that has RLS,
  I imagine when this happens the attacker similarly waits on the
  conflicting (privileged) xact to finish (in my example in the patch,
  Bob's xact). However, unlike with the Postgres READ COMMITTED mode,
  Mallory would then have her malicious UPDATE statement entirely rolled
  back, and her statement would acquire an entirely new MVCC snapshot,
  to be used by the USING security barrier qual (and everything else)
  from scratch. This other system would then re-run her UPDATE with the
  new MVCC snapshot. This would repeat until Mallory's UPDATE statement
  completes without encountering any concurrent UPDATEs/DELETEs to her
  would-be affected rows.
 
  In general, with this other database system, an UPDATE must run to
  completion without violating MVCC, even in READ COMMITTED mode. For
  that reason, I think we can take no comfort from the presumption that
  this flexibility in USING security barrier quals (allowing subqueries,
  etc) works securely in this other system. (I actually didn't check
  this out, but I imagine it's true).
 
 Where are we on this?
 
 Discussion during pgCon with Heikki and Andres led me to believe that
 the issue is acceptable. The issue can be documented to help ensure
 that user expectation is in line with actual user-visible behavior.
 Unfortunately, I think that that will be a clunky documentation patch.

It's important to realize that this is an issue beyond RLS and that it
impacts Security Barrier Views also.

One idea which I have for making the documentation patch a bit less
clunky is to provide a simple way for users to address the issue.  Along
those lines, here's what I'd suggest (certainly open for discussion):

---
When reducing the set of rows which a user has access to, through
modifications to relations referenced by Row-Level Security Policies or
Security Barrier Views, be aware that users with a currently open
transaction might have a lock on an existing row and be able to see that
row after the change.  The best approach to avoid any possible leak of
information during a reduction of a user's rights is to ensure that the
user does not have any open transactions, perhaps by ensuring they have
no concurrent sessions running.
---

Thoughts?  Trying to keep it straight-forward and provide a simple
solution for users to be able to address the issue, if they're worried
about it.  Perhaps this, plus an additional paragraph which goes into
more detail about exactly what's going on?

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Arguable RLS security bug, EvalPlanQual() paranoia

On Mon, Aug 3, 2015 at 3:07 PM, Stephen Frost sfr...@snowman.net wrote:
 Thoughts?  Trying to keep it straight-forward and provide a simple
 solution for users to be able to address the issue, if they're worried
 about it.  Perhaps this, plus an additional paragraph which goes into
 more detail about exactly what's going on?

I'm still thinking about it, but I think you have the right idea here.

However, as the docs put it when talking about conventional access
controls and SELECT: The use of FOR UPDATE or FOR SHARE requires
UPDATE privilege as well [as SELECT privilege] (for at least one
column of each table so selected). I wonder if RLS needs to consider
this, too.

If you can just say that you don't have to worry about this when the
user has no right to UPDATE or DELETE the rows in the first place,
that makes it more practical to manage the issue. ISTM that as things
stand, you can't say that because RLS does not consider SELECT FOR
UPDATE special in any way.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

On Tue, Aug 4, 2015 at 9:12 AM, Fabrízio de Royes Mello
fabriziome...@gmail.com wrote:
 Are you sure we need to do all this changes just to check the highest
 locklevel based on the reloptions?

Well, by looking at the code that's what it looks as. That's a large
change not that straight-forward.

 Or I misunderstood the whole thing?

No I think we are on the same page.

 Maybe we just check the DoLockModeConflict in the new method
 GetReloptionsLockLevel and if true then fallback to AccessExclusiveLock (as
 Michael suggested in a previous email).

Honestly that's what I would suggest for this patch, and also fix the
existing problem of tablecmds.c that does the same assumption. It
seems saner to me for now than adding a whole new level of routines
and wrappers, and your patch has IMO great value when each ALTER
COMMAND is kicked individually.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] FSM versus GIN pending list bloat

2015-08-03 Thread Jeff Janes

For a GIN index with fastupdate turned on, both the user backends and
autoanalyze routine will clear out the pending list, pushing the entries
into the normal index structure and deleting the pages used by the pending
list.  But those deleted pages will not get added to the freespace map
until a vacuum is done.  This leads to horrible bloat on insert only
tables, as it is never vacuumed and so the pending list space is never
reused.  And the pending list is very inefficient in space usage to start
with, even compared to the old style posting lists and especially compared
to the new compressed ones.  (If they were aggressively recycled, this
inefficient use wouldn't be much of a problem.)

Even on a table receiving mostly updates after its initial population (and
so being vacuumed regularly) with default autovac setting, there is a lot
of bloat.

The attached proof of concept patch greatly improves the bloat for both the
insert and the update cases.  You need to turn on both features: adding the
pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).  The
first of those two things could probably be adopted for real, but the
second probably is not acceptable.  What is the right way to do this?
Could a variant of RecordFreeIndexPage bubble the free space up the map
immediately rather than waiting for a vacuum?  It would only have to move
up until it found a page with freespace already recorded in it, which the
vast majority of the time would mean observing up one level and then not
writing to it, assuming the pending list pages remain well clustered.

Or would a completely different approach be better, like managing the
vacated pending list pages directly in the index without going to the fsm?

Cheers,

Jeff


gin_fast_freespace.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] tablecmds.c and lock hierarchy

Hi all,

As mentioned in the thread related to lowering locks of autovacuum
reloptions in ALTER TABLE SET
(http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com),
I have noticed the following code in
AlterTableGetLockLevel@tablecmds.c:
/*
 * Take the greatest lockmode from any subcommand
 */
if (cmd_lockmode  lockmode)
lockmode = cmd_lockmode;

The thing is that, as mentioned by Alvaro and Andres on this thread,
we have no guarantee that the different relation locks compared have a
monotone hierarchy and we may finish by taking a lock that does not
behave as you would like to. We are now lucky enough that ALTER TABLE
only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and
AccessExclusiveLock that actually have a hierarchy so this is not a
problem yet.
However it may become a problem if we add in the future more lock
modes and that are used by ALTER TABLE.

One idea mentioned in the ALTER TABLE SET thread was to enforce the
lock to AccessExclusiveLock should two locks of any subcommands
differ. Another idea was to select all the locks of the subcommands
found to be present, and open the different relations with all of
them. Tracking all those locks is going to require some heavy
rewriting, like for example the use of a LOCKMASK to track the locks
that need to be taken, and more extended routines for relation_open.
Other possibilities may be to add a comment on top of
AlterTableGetLockLevel so as we do not use a lock in ALTER TABLE
commands that may result in a choice conflict with the other ones, to
simply do nothing because committers are very careful people usually,
or to track all the locks taken in AlterTableGetLockLevel and then run
DoLockModesConflict and ERROR if there are incompatible locks.

I am attaching a patch that implements the first approach mentioned to
give a short idea of how things could be improved.
Thoughts?
-- 
Michael
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 970abd4..0bcdd76 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2807,7 +2807,8 @@ AlterTableInternal(Oid relid, List *cmds, bool recurse)
  * Function is called before and after parsing, so it must give same
  * answer each time it is called. Some subcommands are transformed
  * into other subcommand types, so the transform must never be made to a
- * lower lock level than previously assigned. All transforms are noted below.
+ * lock level different than previously assigned. All transforms are noted
+ * below.
  *
  * Since this is called before we lock the table we cannot use table metadata
  * to influence the type of lock we acquire.
@@ -2837,6 +2838,7 @@ AlterTableGetLockLevel(List *cmds)
 	 */
 	ListCell   *lcmd;
 	LOCKMODE	lockmode = ShareUpdateExclusiveLock;
+	bool		is_init = false;
 
 	foreach(lcmd, cmds)
 	{
@@ -3057,10 +3059,22 @@ AlterTableGetLockLevel(List *cmds)
 		}
 
 		/*
-		 * Take the greatest lockmode from any subcommand
+		 * At the first command evaluated, take the lock corresponding to
+		 * it. When looking at the next commands, check if the lock taken
+		 * is different than the first one taken and enforce it to
+		 * AccessExclusiveLock, which is the only lock super-seeding all
+		 * the others.
 		 */
-		if (cmd_lockmode  lockmode)
+		if (!is_init)
+		{
 			lockmode = cmd_lockmode;
+			is_init = true;
+		}
+		else
+		{
+			if (cmd_lockmode != lockmode)
+lockmode = AccessExclusiveLock;
+		}
 	}
 
 	return lockmode;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind tap test unstable

On Mon, Aug 3, 2015 at 5:35 PM, Christoph Berg m...@debian.org wrote:
 Re: Michael Paquier 2015-07-28 
 cab7npqqcpgy3u7cmfo8sqquobsfmeieohueslxwycc64j3g...@mail.gmail.com
 On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg m...@debian.org wrote:
  for something between 10% and 20% of the devel builds for 
  apt.postgresql.org
  (which happen every 6h if there's a git change, so it happens every few 
  days),
  I'm seeing this:
  Dubious, test returned 1 (wstat 256, 0x100)
  Failed 1/8 subtests
 
  I don't have the older logs available, but from memory, the subtest
  failing and the two numbers mentioned are always the same.

 There should be some output logs in src/bin/pg_rewind/tmp_check/log/*?
 Could you attach them here if you have them? That would be helpful to
 understand what is happening.

 It took me a few attempts to tell the build environment to save a copy
 on failure and not shred everything right away. So here we go:

In test case 2, the failure happens to be that the standby did not
have the time to replicate the database beforepromotion that has been
created on the master. One possible explanation for this failure is
that the standby has been promoted before all the WAL needed for the
tests has been replayed, hence we had better be sure that the
replay_location of the standby matches pg_current_xlog_location()
before promotion. On the buildfarm, hamster, the legendary slowest
machine of the whole set, does not complain about that. Is your
environment that heavy loaded when you run the tests?

Perhaps the attached patch helps?
-- 
Michael
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index b66ff0d..fce725f 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -232,6 +232,13 @@ sub promote_standby
 	 Now run the test-specific parts to run after standby has been started
 	# up standby
 
+	# Wait until the standby has caught up with the primary, by polling
+	# pg_stat_replication.
+	my $caughtup_query =
+SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';;
+	poll_query_until($caughtup_query, $connstr_master)
+	  or die Timed out while waiting for standby to catch up;
+
 	# Now promote slave and insert some new data on master, this will put
 	# the master out-of-sync with the standby. Wait until the standby is
 	# out of recovery mode, and is ready to accept read-write connections.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

2015-08-03 Thread Jeff Janes

On Jul 31, 2015 4:22 AM, Jeremy Harris j...@wizmail.org wrote:

 On 30/07/15 02:05, Peter Geoghegan wrote:
  Since heapification is now a big fraction of the total cost of a sort
  sometimes, even where the heap invariant need not be maintained for
  any length of time afterwards, it might be worth revisiting the patch
  to make that an O(n) rather than a O(log n) operation [3].

  O(n log n) ?

 Heapification is O(n) already, whether siftup (existing) or down.

They are both linear on average, but the way we currently do it has an
NlogN worst case, while the other way is linear even in the worst case.

Cheers, Jeff

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-08-03 Thread Amit Kapila

On Mon, Aug 3, 2015 at 7:44 PM, Fujii Masao masao.fu...@gmail.com wrote:


 Thanks! Pushed.


Thanks to you as well for committing the patch.

 BTW, while reading the code related to tablespace_map, I found that
 CancelBackup() emits the WARNING message online backup mode was not
canceled
 when rename() fails. Isn't this confusing (or incorrect)?

Yes, it looks confusing.

 ISTM that we can
 see that the online backup mode has already been canceled if backup_label
file
 is successfully removed whether tablespace_map file remains or not. No?


I think what we should do is that display successful cancellation message
only when both the files are renamed.  I have drafted a patch (still I needs
to verify/test it, I will do that if you think the fix is in right
direction) to show
what I have in mind.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_rename_backup_and_mapfile_cancel_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] cost_agg() with AGG_HASHED does not account for startup costs

2015-08-03 Thread David Rowley

During working on allowing the planner to perform GROUP BY before joining
I've noticed that cost_agg() completely ignores input_startup_cost
when aggstrategy == AGG_HASHED.

I can see at least 3 call to cost_agg() which pass aggstrategy as
AGG_HASHED that are also passing a possible non-zero startup cost.

The attached changes cost_agg() to include the startup cost, but does
nothing for the changed plans in the regression tests.

Is this really intended?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index 7069f60..06870a3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1639,7 +1639,8 @@ cost_agg(Path *path, PlannerInfo *root,
else
{
/* must be AGG_HASHED */
-   startup_cost = input_total_cost;
+   startup_cost = input_startup_cost;
+   startup_cost += input_total_cost;
startup_cost += aggcosts-transCost.startup;
startup_cost += aggcosts-transCost.per_tuple * input_tuples;
startup_cost += (cpu_operator_cost * numGroupCols) * 
input_tuples;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM

On Tue, Aug 4, 2015 at 2:04 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 3 August 2015 at 17:36, Tom Lane t...@sss.pgh.pa.us wrote:

 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Simon Riggs wrote:
  * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at
  all,
  since they aren't critical path activities at that point

  It is not possible to skip scanning indexes completely, unless no tuples
  are to be removed from the heap.

 Right.

  But actually this is an interesting point and I don't think we do this:
  if in emergency mode, maybe we shouldn't try to remove any dead tuples
  at all, and instead only freeze very old tuples.

 +1 ... not sure if that's what Simon had in mind exactly, but it seems
 like a correct statement of what he was getting at.


 Yes, that's what I was thinking, I just didn't say actually it. I'd been
 thinking about having VACUUM do just Phase 1 for some time, since its so
 much faster to do that. Will code.

Interesting. I'll be happy to have a look at any patch produced,
that's surely something we want to improve in emergency mode.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tablecmds.c and lock hierarchy

Michael Paquier wrote:

 As mentioned in the thread related to lowering locks of autovacuum
 reloptions in ALTER TABLE SET
 (http://www.postgresql.org/message-id/cafcns+ox7jvenc_3i54fdq3ibmogmknc2tmevdsmvojbsxg...@mail.gmail.com),
 I have noticed the following code in
 AlterTableGetLockLevel@tablecmds.c:
 /*
  * Take the greatest lockmode from any subcommand
  */
 if (cmd_lockmode  lockmode)
 lockmode = cmd_lockmode;
 
 The thing is that, as mentioned by Alvaro and Andres on this thread,
 we have no guarantee that the different relation locks compared have a
 monotone hierarchy and we may finish by taking a lock that does not
 behave as you would like to.

Maybe the solution to this is to add the concept of addition of two
lock modes, where the result is another lock mode that conflicts with
any lock that would conflict with either of the two operand lock modes.
For instance, if you add a lock mode to itself, you get the same lock
mode; if you add anything to AccessExclusive, you get AccessExclusive;
if you add anything to AccessShare, you end up with that other lock
mode.  The most interesting case in our current lock table is if you
add ShareUpdateExclusive to Share, where you end up with
ShareRowExclusive.  In essence, holding the result of that operation is
the same as holding both locks, which AFAICS is the behavior we want.

This can be implemented with a simple table and solves the problem for
both ALTER TABLE SET and this existing usage you cite and is not as
invasive as your first proposed solution.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-03 Thread Masahiko Sawada

On Thu, Jul 30, 2015 at 2:16 PM, Beena Emerson memissemer...@gmail.com wrote:
 Hello,

 Just looking at how the 2 differnt methods can be used to set the s_s_names
 value.

 1. For a simple case where quorum is required for a single group the JSON
 could be:

  {
  sync_standby_names:
  {
  quorum:2,
  nodes:
  [ node1,node2,node3 ]
  }
  }

 or

  {
  sync_standby_names:
  {
  quorum:2,
  group: cluster1
  },
  groups:
  {
  cluster1:[node1,node2,node3]
  }
  }

 Language:
 2(node1, node2, node3)


 2. For having quorum between different groups and node:
  {
  sync_standby_names:
  {
  quorum:2,
  nodes:
 [
 {priority:1,nodes:[node0]},
 {quorum:2,group: cluster1}
 ]
  },
  groups:
  {
  cluster1:[node1,node2,node3]
  }
  }

 or
  {
  sync_standby_names:
  {
  quorum:2,
  nodes:
 [
 {priority:1,group: cluster2},
 {quorum:2,group: cluster1}
 ]
  },
  groups:
  {
  cluster1:[node1,node2,node3],
  cluster2:[node0]
  }
  }

 Language:
 2 (node0, cluster1: 2(node1, node2, node3))

 Since there will not be many nesting and grouping, I still prefer new
 language to JSON.
 I understand one can easily, modify/add groups in JSON using in built
 functions but I think changes will not be done too often.


If we decided to use dedicated language, the syntax checker for that
language is needed, via SQL or something.
Otherwise we will not be able to know whether the parsing that value
will be done correctly, until reloading or restarting server.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-03 Thread Takashi Horikawa

  Why does this cause a core dump?  We could consider fixing whatever
  the problem is rather than capping the value.
As far as I experiment with my own evaluation environment using 
PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the patch 
attached.

I have confirmed that applying the patch makes 'wal_buffers = 4GB'  works 
fine, while original PostgreSQL-9.4.4 results in core dump (segfault). I'll be 
happy if anyone reconfirm this.

--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
 Sent: Tuesday, August 04, 2015 2:29 AM
 To: Josh Berkus
 Cc: PostgreSQL-development
 Subject: Re: [HACKERS] patch: prevent user from setting wal_buffers over
 2GB bytes

 On Fri, Jul 31, 2015 at 8:09 PM, Josh Berkus j...@agliodbs.com wrote:
  On 07/31/2015 10:43 AM, Robert Haas wrote:
  On Thu, Jul 30, 2015 at 9:17 PM, Josh Berkus j...@agliodbs.com wrote:
  In guc.c, the maximum for wal_buffers is INT_MAX.  However,
  wal_buffers is actually measured in 8KB buffers, not in bytes.  This
  means that users are able to set wal_buffers  2GB.  When the
  database is started, this can cause a core dump if the WAL offset is
  2GB.
 
  Why does this cause a core dump?  We could consider fixing whatever
  the problem is rather than capping the value.
 
  The underlying issue is that byte position in wal_buffers is a 32-bit
  INT, so as soon as you exceed that, core dump.

 OK.  So capping it sounds like the right approach, then.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
 Company


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


allow_wal_buffer_over_2G.patch
Description: Binary data


smime.p7s
Description: S/MIME cryptographic signature

Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

On Mon, Aug 3, 2015 at 1:36 PM, Robert Haas robertmh...@gmail.com wrote:
 I don't think that's what Heikki is talking about.  He can correct me
 if I'm wrong, but what I think he's saying is that we should try to
 exploit the fact that we've already determined which in-memory tuples
 can be part of the current run and which in-memory tuples must become
 part of the next run.  Suppose half the tuples in memory can become
 part of the current run and the other half must become part of the
 next run.  If we throw all of those tuples into a single bucket and
 quicksort it, we're losing the benefit of the comparisons we did to
 figure out which tuples go in which runs.  Instead, we could quicksort
 the current-run tuples and, separately, quick-sort the next-run
 tuples.  Ignore the current-run tuples completely until the tape is
 empty, and then merge them with any next-run tuples that remain.

Oh. Well, the benefit of the comparisons we did to figure out which
tuples go in which runs is that we can determine the applicability of
this optimization. Also, by keeping run 1 (if any) usually in memory,
and run 0 partially on disk we avoid having to worry about run 1 as a
thing that spoils the optimization (in the current single run
optimization, dumping all tuples can make us acknowledge run 1
(i.e. currentRun++), preventing single run optimization, which we
handily avoid in the patch). Finally, it saves us a bunch of real
COMPARETUP() comparisons as HEAPCOMPARE() is called as tuples are
inserted into the still-heapified memtuples array.

 I'm not sure if there's any reason to believe that would be faster
 than your approach.  In general, sorting is O(n lg n) so sorting two
 arrays that are each half as large figures to be slightly faster than
 sorting one big array.  But the difference may not amount to much.

IMV, the smart way of avoiding wasting the comparisons we did to
figure out which tuples go in which runs is to rig HEAPCOMPARE() to
only do a COMPARETUP() for the currentRun, and make sure that we don't
mess up and forget that if we don't end up quicksorting.

The second run that is in memory can only consist of whatever tuples
were added after heapification that were less than any of those
previously observed tuples (a big majority, usually). So like you, I
can't see any of these techniques helping much, even my smart
technique. Maybe I should look at a case involving text or something
to be sure.

Thinking about it some more, I don't think it would be easy to
maintain a clear separation between run 0 and run 1 in the memtuples
array in terms of a cutoff point. It's still a heap at that stage, of
course. You'd have to rig each tuple comparator so that COMPARETUP()
cared about tupindex before comparing datum1 just for this, which
seems rather unappealing.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-08-03 Thread Alexander Korotkov

On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova 
a.lubennik...@postgrespro.ru wrote:

 1) Test and results are in attachments. Everything seems to work as
 expected.
 2) I dropped these notices. It was done only for debug purposes. Updated
 patch is attached.
 3) fixed


Good! Another couple of notes from me:
1) I think gistvacuumpage() and gistkillitems() need function-level
comments.
2) ItemIdIsDead() can be used in index scan like it's done
in _bt_checkkeys().

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] pageinspect patch, for showing tuple data

2015-08-03 Thread Nikolay Shaplov

В письме от 3 августа 2015 14:30:46 пользователь Michael Paquier написал:
On Mon, Aug 3, 2015 at 1:03 AM, Nikolay Shaplov

n.shap...@postgrespro.ru wrote:
Hi!

I've created a patch for pageinspect that allows to see data stored in the
tuple.

This patch has two main purposes:

1. Practical: Make manual DB recovery more simple

To what are you referring to in this case? Manual manipulation of
on-disk data manually?
Yes, when DB is broken for example

b) I have plenty of sanity check while reading parsing that tuple, for
this
function I've changed error level from ERROR to WARNING. This function
will
allow to write proper tests that all these checks work as they were
designed (I hope to write these tests sooner or later)

+ereport(inter_call_data-error_level,
+(errcode(ERRCODE_DATA_CORRUPTED),
+errmsg(Data corruption: Iterating over
tuple data reached out of actual tuple size)));
I don't think that the possibility to raise a WARNING is a good thing
in those code paths. If data is corrupted this may crash, and I am not
sure that anybody would want that even for educational purposes.
Hm... I considered _heap_page_items really very dangerous function, with big
red Do not call it if you not sure what are you doing warning. Reading data
with not proper attribute descriptor is dangerous any way. But when I wrote
that code, I did not have that checks at first, and it was really interesting
to subst one data with another and see what will happen. And I thought that
may be other explorers will like to do the same. And it is really possible
only in warning mode. So I left warnings only in _heap_page_items, and set
errors for all other cases.

--
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Microvacuum for gist.

2015-08-03 Thread Anastasia Lubennikova




30.07.2015 16:33, Alexander Korotkov пишет:

Hi!

On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova 
lubennikov...@gmail.com mailto:lubennikov...@gmail.com wrote:


I have written microvacuum support for gist access method.
Briefly microvacuum includes two steps:
1. When search tells us that the tuple is invisible to all
transactions it is marked LP_DEAD and page is marked as has dead
tuples,
2. Then, when insert touches full page which has dead tuples it
calls microvacuum instead of splitting page.
You can find a kind of review here [1].

[1]

http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120

Patch is in attachements. Please review it.


Nice!

Some notes about this patch.

1) Could you give same test case demonstrating that microvacuum really 
work with patch? Finally, we should get index less growing with 
microvacuum.


2) Generating notices for every dead tuple would be too noisy. I 
suggest to replace notice with one of debug levels.


+ elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d, 
offnum, BufferGetBlockNumber(buffer));



3) Please, recheck coding style. For instance, this line needs more 
spaces and open brace should be on the next line.


+ if ((scan-kill_prior_tuple)(so-curPageData  
0)(so-curPageData == so-nPageData)) {


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com 
http://www.postgrespro.com/

The Russian Postgres Company
1) Test and results are in attachments. Everything seems to work as 
expected.
2) I dropped these notices. It was done only for debug purposes. Updated 
patch is attached.

3) fixed
//

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***
*** 36,42  static bool gistinserttuples(GISTInsertState *state, 
GISTInsertStack *stack,
 bool unlockbuf, bool unlockleftchild);
  static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool 
releasebuf);
! 
  
  #define ROTATEDIST(d) do { \
SplitedPageLayout 
*tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
--- 36,42 
 bool unlockbuf, bool unlockleftchild);
  static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool 
releasebuf);
! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
  
  #define ROTATEDIST(d) do { \
SplitedPageLayout 
*tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
***
*** 209,214  gistplacetopage(Relation rel, Size freespace, GISTSTATE 
*giststate,
--- 209,225 
 * because the tuple vector passed to gistSplit won't include this 
tuple.
 */
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ 
+   /*
+* If leaf page is full, try at first to delete dead tuples. And then
+* check again.
+*/
+   if ((is_split)  GistPageIsLeaf(page)  GistPageHasGarbage(page))
+   {
+   gistvacuumpage(rel, page, buffer);
+   is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+   }
+ 
if (is_split)
{
/* no space for insertion */
***
*** 1440,1442  freeGISTstate(GISTSTATE *giststate)
--- 1451,1519 
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate-scanCxt);
  }
+ 
+ static void
+ gistvacuumpage(Relation rel, Page page, Buffer buffer)
+ {
+   OffsetNumber deletable[MaxOffsetNumber];
+   int ndeletable = 0;
+   OffsetNumber offnum,
+   minoff,
+   maxoff;
+ 
+   /*
+* Scan over all items to see which ones need to be deleted according to
+* LP_DEAD flags.
+*/
+   maxoff = PageGetMaxOffsetNumber(page);
+   for (offnum = FirstOffsetNumber;
+offnum = maxoff;
+offnum = OffsetNumberNext(offnum))
+   {
+   ItemId  itemId = PageGetItemId(page, offnum);
+ 
+   if (ItemIdIsDead(itemId))
+   deletable[ndeletable++] = offnum;
+   }
+ 
+   if (ndeletable  0)
+   {
+   START_CRIT_SECTION();
+ 
+   PageIndexMultiDelete(page, deletable, ndeletable);
+ 
+   /*
+* Mark the page as not containing any LP_DEAD items.  This is 
not
+* certainly true (there might be some that have recently been 
marked,
+* but weren't included in our target-item list), but it will 
almost
+* always be true and it doesn't seem worth an additional

Re: [HACKERS] pg_rewind tap test unstable

2015-08-03 Thread Christoph Berg

Re: Michael Paquier 2015-07-28 
cab7npqqcpgy3u7cmfo8sqquobsfmeieohueslxwycc64j3g...@mail.gmail.com
 On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg m...@debian.org wrote:
  for something between 10% and 20% of the devel builds for apt.postgresql.org
  (which happen every 6h if there's a git change, so it happens every few 
  days),
  I'm seeing this:
  Dubious, test returned 1 (wstat 256, 0x100)
  Failed 1/8 subtests
 
  I don't have the older logs available, but from memory, the subtest
  failing and the two numbers mentioned are always the same.
 
 There should be some output logs in src/bin/pg_rewind/tmp_check/log/*?
 Could you attach them here if you have them? That would be helpful to
 understand what is happening.

It took me a few attempts to tell the build environment to save a copy
on failure and not shred everything right away. So here we go:

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


master.log.gz
Description: application/gzip


regress_log_001_basic.gz
Description: application/gzip


regress_log_002_databases.gz
Description: application/gzip


regress_log_003_extrafiles.gz
Description: application/gzip


standby.log.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM

2015-08-03 Thread Simon Riggs

On 2 August 2015 at 13:13, Michael Paquier michael.paqu...@gmail.com
wrote:

 Hi all,

 Commit 4046e58c (dated of 2001) has introduced the following comment
 in vacuumlazy.c:
 +   /* If any tuples need to be deleted, perform final vacuum cycle */
 +   /* XXX put a threshold on min nuber of tuples here? */
 +   if (vacrelstats-num_dead_tuples  0)
 In short, we may want to have a reloption to decide if we do or not
 the last pass of VACUUM or not depending on a given number of
 remaining tuples. Is this still something we would like to have?


I don't think we want a new user parameter, but we should have an internal
limit with a heuristic, similar to how we decide whether to truncate.

I would suggest this internal logic...

* If its a VACUUM FREEZE then index_scan_threshold = 0, i.e. always scan if
needed, since the user is requesting maximum vacuum

* For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all,
since they aren't critical path activities at that point

* For normal VACUUMs we should scan indexes only if (num_dead_tuples * 20)
 (blocks to be scanned in any one index), which allows some index bloat
but not much

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] Tab completion for CREATE SEQUENCE

On Tue, Jul 7, 2015 at 9:14 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-19 06:41:19 +, Brendan Jurd wrote:
 I'm marking this Waiting on Author.  Once the problems have been
 corrected, it should be ready for a committer.

 Vik, are you going to update the patch?

Seeing no activity on this thread and as it would be a waste not to do
it, I completed the patch as attached. The following things are done:
- Fixed code indentation
- Removal of RESTART, SET SCHEMA, OWNER TO, and RENAME TO in
CREATE SEQUENCE
- Added START WITH.
- Added TEMP/TEMPORARY in the set of keywords tracked.
I am switching at the same time this patch as Ready for committer.
Regards,
-- 
Michael
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index ece0515..0748284 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2467,6 +2467,35 @@ psql_completion(const char *text, int start, int end)
 			 pg_strcasecmp(prev_wd, TO) == 0)
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables, NULL);
 
+/* CREATE TEMP/TEMPORARY SEQUENCE name */
+	else if ((pg_strcasecmp(prev3_wd, CREATE) == 0 
+			  pg_strcasecmp(prev2_wd, SEQUENCE) == 0) ||
+			 (pg_strcasecmp(prev4_wd, CREATE) == 0 
+			  (pg_strcasecmp(prev3_wd, TEMP) == 0 ||
+			   pg_strcasecmp(prev3_wd, TEMPORARY) == 0) 
+			  pg_strcasecmp(prev2_wd, SEQUENCE) == 0))
+	{
+		static const char *const list_CREATESEQUENCE[] =
+		{INCREMENT BY, MINVALUE, MAXVALUE, RESTART, NO, CACHE,
+		 CYCLE, OWNED BY, START WITH, NULL};
+
+		COMPLETE_WITH_LIST(list_CREATESEQUENCE);
+	}
+/* CREATE TEMP/TEMPORARY SEQUENCE name NO */
+	else if ((pg_strcasecmp(prev4_wd, CREATE) == 0 
+			  pg_strcasecmp(prev3_wd, SEQUENCE) == 0) ||
+			 (pg_strcasecmp(prev5_wd, CREATE) == 0 
+			  (pg_strcasecmp(prev4_wd, TEMP) == 0 ||
+			   pg_strcasecmp(prev4_wd, TEMPORARY) == 0) 
+			  pg_strcasecmp(prev3_wd, SEQUENCE) == 0) 
+			 pg_strcasecmp(prev_wd, NO) == 0)
+	{
+		static const char *const list_CREATESEQUENCE2[] =
+		{MINVALUE, MAXVALUE, CYCLE, NULL};
+
+		COMPLETE_WITH_LIST(list_CREATESEQUENCE2);
+	}
+
 /* CREATE SERVER name */
 	else if (pg_strcasecmp(prev3_wd, CREATE) == 0 
 			 pg_strcasecmp(prev2_wd, SERVER) == 0)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2015-08-03 Thread Andreas Seltenreich

Peter Geoghegan writes:

 On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich seltenre...@gmx.de 
 wrote:
 sqlsmith triggered the following assertion in master (c188204).

 Thanks for writing sqlsmith. It seems like a great tool.

 I wonder, are you just running the tool with assertions enabled when
 PostgreSQL is built?

Right.  I have to admit my testing setup is still more tailored towards
testing sqlsmith than postgres.

 If so, it might make sense to make various problems more readily
 detected.  As you may know, Clang has a pretty decent option called
 AddressSanitizer that can detect memory errors as they occur with an
 overhead that is not excessive.

I didn't known this clang feature yet, thanks for pointing it out.  I
considered running some instances under valgrind to detect these, but
the performance penalty seemed not worth it.

 One might use the following configure arguments when building
 PostgreSQL to use AddressSanitizer:

 ./configure CC=clang CFLAGS='-O1 -g -fsanitize=address
 -fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert

A quick attempt to sneak these in made my ansible playbooks unhappy due
to make check failures and other generated noise.  I'll try to have an
instance with the AddressSanitizer active soon though.

 Of course, it remains to be seen if this pays for itself. Apparently
 the tool has about a 2x overhead [1]. I'm really not sure that you'll
 find any more bugs this way, but it's certainly possible that you'll
 find a lot more. Given your success in finding bugs without using
 AddressSanitizer, introducing it may be premature.

Piotr also suggested on IRC to run coverage tests w/ sqlsmith.  This
could yield valuable hints in which direction to extend sqlsmith's
grammar.

Thanks,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS restrictive hook policies

Dean, all,

* Stephen Frost (sfr...@snowman.net) wrote:
 Agreed.  I'm happy to commit that change and back-patch it to 9.5,
 barring objections.  Given that the only way to have restrictive
 policies currently is using hooks, I don't believe there's any
 documentation update required either; of course, I'll update the comment
 to reflect this discussion/decision and make any necessary changes to
 test_rls_hooks.

Patch attached to make this change, with appropriate comment updates and
fixes to the test_rls_hooks modules.

Comments?

Thanks!

Stephen
From 173db8f942636e92f20145844b7dadd04ccac765 Mon Sep 17 00:00:00 2001
From: Stephen Frost sfr...@snowman.net
Date: Mon, 3 Aug 2015 10:59:24 -0400
Subject: [PATCH] RLS: Keep deny policy when only restrictive exist

Only remove the default deny policy when a permissive policy exists
(either from the hook or defined by the user).  If only restrictive
policies exist then no rows will be visible, as restrictive policies
shouldn't make rows visible.  To address this requirement, a single
USING (true) permissive policy can be created.

Update the test_rls_hooks regression tests to create the necessary
USING (true) permissive policy.

Back-patch to 9.5 where RLS was added.

Per discussion with Dean.
---
 src/backend/rewrite/rowsecurity.c  | 14 ++
 .../modules/test_rls_hooks/expected/test_rls_hooks.out |  7 +++
 src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql |  8 
 src/test/modules/test_rls_hooks/test_rls_hooks.c   |  5 +
 4 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/src/backend/rewrite/rowsecurity.c b/src/backend/rewrite/rowsecurity.c
index 562dbc9..5a81db3 100644
--- a/src/backend/rewrite/rowsecurity.c
+++ b/src/backend/rewrite/rowsecurity.c
@@ -225,12 +225,18 @@ get_row_security_policies(Query *root, CmdType commandType, RangeTblEntry *rte,
 	}
 
 	/*
-	 * If the only built-in policy is the default-deny one, and hook policies
-	 * exist, then use the hook policies only and do not apply the
+	 * If the only built-in policy is the default-deny one, and permissive hook
+	 * policies exist, then use the hook policies only and do not apply the
 	 * default-deny policy.  Otherwise, we will apply both sets below.
+	 *
+	 * Note that we do not remove the defaultDeny policy if only *restrictive*
+	 * policies exist as restrictive policies should only ever be reducing what
+	 * is visible.  Therefore, at least one permissive policy must exist which
+	 * allows records to be seen before restrictive policies can remove rows
+	 * from that set.  A single true policy can be created to address this
+	 * requirement, if necessary.
 	 */
-	if (defaultDeny 
-		(hook_policies_restrictive != NIL || hook_policies_permissive != NIL))
+	if (defaultDeny  hook_policies_permissive != NIL)
 	{
 		rowsec_expr = NULL;
 		rowsec_with_check_expr = NULL;
diff --git a/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out b/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out
index 3a7a4c3..4587eb0 100644
--- a/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out
+++ b/src/test/modules/test_rls_hooks/expected/test_rls_hooks.out
@@ -13,6 +13,11 @@ CREATE TABLE rls_test_restrictive (
 supervisor  name,
 datainteger
 );
+-- At least one permissive policy must exist, otherwise
+-- the default deny policy will be applied.  For
+-- testing the only-restrictive-policies from the hook,
+-- create a simple 'allow all' policy.
+CREATE POLICY p1 ON rls_test_restrictive USING (true);
 -- initial test data
 INSERT INTO rls_test_restrictive VALUES ('r1','s1',1);
 INSERT INTO rls_test_restrictive VALUES ('r2','s2',2);
@@ -109,6 +114,8 @@ RESET ROLE;
 -- Create internal policies, to check that the policies from
 -- the hooks are combined correctly.
 CREATE POLICY p1 ON rls_test_permissive USING (data % 2 = 0);
+-- Remove the original allow-all policy
+DROP POLICY p1 ON rls_test_restrictive;
 CREATE POLICY p1 ON rls_test_restrictive USING (data % 2 = 0);
 CREATE POLICY p1 ON rls_test_both USING (data % 2 = 0);
 SET ROLE r1;
diff --git a/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql b/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql
index ece4ab9..3071213 100644
--- a/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql
+++ b/src/test/modules/test_rls_hooks/sql/test_rls_hooks.sql
@@ -17,6 +17,12 @@ CREATE TABLE rls_test_restrictive (
 datainteger
 );
 
+-- At least one permissive policy must exist, otherwise
+-- the default deny policy will be applied.  For
+-- testing the only-restrictive-policies from the hook,
+-- create a simple 'allow all' policy.
+CREATE POLICY p1 ON rls_test_restrictive USING (true);
+
 -- initial test data
 INSERT INTO rls_test_restrictive VALUES ('r1','s1',1);
 INSERT INTO rls_test_restrictive VALUES ('r2','s2',2);
@@ -101,6 +107,8 @@ RESET ROLE;
 -- the hooks are combined correctly.
 CREATE POLICY

Re: [HACKERS] nodes/*funcs.c inconsistencies

Peter Geoghegan wrote:
 On Sun, Aug 2, 2015 at 3:07 PM, Peter Geoghegan p...@heroku.com wrote:
  I'm surprised that this stuff was only ever used for logical decoding
  infrastructure so far.
 
 On second thought, having tried it, one reason is that that breaks
 things that are considered legitimate for historical reasons. For
 example, AttrNumber is often used with READ_INT_FIELD(), which is an
 int16. Whether or not it's worth fixing that by introducing a
 READ_ATTRNUM_FIELD() (and so on) is not obvious to me.

If it allows us to introduce additional checking for new code, I'm all
for it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM

On Mon, Aug 3, 2015 at 11:52 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Simon Riggs wrote:
 * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all,
 since they aren't critical path activities at that point

 It is not possible to skip scanning indexes completely, unless no tuples
 are to be removed from the heap.  Otherwise, index tuples become
 lingering pointers (and when such heap address are later refilled, they
 become corrupted indexscans).

Well, if we skip the index scans, we can't do the second heap pass
either, but that's OK.  I think we're all talking about the same thing
here, which is to do only the first heap pass in some cases.  That
will prune dead tuples to line pointers, freeze old XIDs, and mark
pages all-visible where appropriate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj wrote:
 the new table does not exhibit the same problem (so I'm assuming it's not
 easily reproducible and giving you a creation script isn't going to help).

 VACUUM FULL on the table makes no difference.

 Is there anything you guys can suggest that I can do to help narrow down the
 problem?

Yes. You should enable all of the following settings:

debug_print_parse (boolean)
debug_print_rewritten (boolean)
debug_print_plan (boolean)
debug_pretty_print (boolean)

which you can do dynamically from psql.

Then post the output for both the failing version (on that table where
you can reproduce the issue), and the other table where it seems that
you ought to be able to reproduce the problem, but you can't. Maybe
that'll show where the problem is.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect patch, for showing tuple data

Nikolay Shaplov wrote:

 This patch adds several new functions, available from SQL queries. All these 
 functions are based on heap_page_items, but accept slightly different 
 arguments and has one additional column at the result set:
 
 heap_page_tuples - accepts relation name, and bulkno, and returns usual 
 heap_page_items set with additional column that contain snapshot of tuple 
 data 
 area stored in bytea.

I think the API around get_raw_page is more useful generally.  You might
have table blocks stored in a bytea column for instance, because you
extracted from some remote server and inserted into a local table for
examination.  If you only accept relname/blkno, there's no way to
examine data other than blocks directly in the database dir, which is
limiting.

 There is also one strange function: _heap_page_items it is useless for 
 practical purposes. As heap_page_items it accepts page data as bytea, but 
 also 
 get a relation name. It tries to apply tuple descriptor of that relation to 
 that page data. 

For BRIN, I added something similar (brin_page_items), but it receives
the index OID rather than name, and constructs a tuple descriptor to
read the data.  I think OID is better than name generally.  (You can
cast the relation name to regclass).

It's easy to misuse, but these functions are superuser-only, so there
should be no security issue at least.  The possibility of a crash
remains real, though, so maybe we should try to come up with a way to
harden that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2015-08-03 Thread Tom Lane

Piotr Stefaniak postg...@piotr-stefaniak.me writes:
 How about this one?
 1 ERROR:  could not find RelOptInfo for given relids

That would be a bug, for sure ...

 It's triggered on 13bba02271dce865cd20b6f49224889c73fed4e7 by this query 
 and the attached one:

... but I can't reproduce it on HEAD with either of these queries.
Not clear why you're getting different results.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM

On Mon, Aug 3, 2015 at 4:13 AM, Simon Riggs si...@2ndquadrant.com wrote:
 * For normal VACUUMs we should scan indexes only if (num_dead_tuples * 20) 
 (blocks to be scanned in any one index), which allows some index bloat but
 not much

I think this kind of heuristic is good, but I think we should expose a
setting for it.  There's no way for us to know without testing whether
the right value for that multiplier is 2 or 20 or 200 or 2000, and if
we don't make it easy to tweak, we'll never find out.  It may even be
workload-dependent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nodes/*funcs.c inconsistencies

Peter Geoghegan wrote:

 Couldn't we adopt
 AssertVariableIsOfType()/AssertVariableIsOfTypeMacro() to macros like
 READ_UINT_FIELD()?
 
 I'm surprised that this stuff was only ever used for logical decoding
 infrastructure so far.

The reason it's only used there is that Andres is the one who introduced
those macros precisely for that code.  We've not yet had time to adjust 
the rest of the code to have more sanity checks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM

2015-08-03 Thread Tom Lane

Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Simon Riggs wrote:
 * For emergency anti-wraparound VACUUMs we shouldn't scan indexes at all,
 since they aren't critical path activities at that point

 It is not possible to skip scanning indexes completely, unless no tuples
 are to be removed from the heap.

Right.

 But actually this is an interesting point and I don't think we do this:
 if in emergency mode, maybe we shouldn't try to remove any dead tuples
 at all, and instead only freeze very old tuples.

+1 ... not sure if that's what Simon had in mind exactly, but it seems
like a correct statement of what he was getting at.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

On Fri, Jul 31, 2015 at 8:09 PM, Josh Berkus j...@agliodbs.com wrote:
 On 07/31/2015 10:43 AM, Robert Haas wrote:
 On Thu, Jul 30, 2015 at 9:17 PM, Josh Berkus j...@agliodbs.com wrote:
 In guc.c, the maximum for wal_buffers is INT_MAX.  However, wal_buffers
 is actually measured in 8KB buffers, not in bytes.  This means that
 users are able to set wal_buffers  2GB.  When the database is started,
 this can cause a core dump if the WAL offset is  2GB.

 Why does this cause a core dump?  We could consider fixing whatever
 the problem is rather than capping the value.

 The underlying issue is that byte position in wal_buffers is a 32-bit
 INT, so as soon as you exceed that, core dump.

OK.  So capping it sounds like the right approach, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nodes/*funcs.c inconsistencies

* Noah Misch (n...@leadboat.com) wrote:
 On Sun, Aug 02, 2015 at 11:31:16PM -0400, Tom Lane wrote:
  That being the case, it would probably be a good idea to get them done
  before alpha2, as there may not be a good opportunity afterwards.
 
 Freedom to bump catversion after alpha2 will be barely-distinguishable from
 freedom to do so now.  I have planned to leave my usual comment period of a
 few days, though skipping that would be rather innocuous in this case.

This is clearly necessary, of course, and I wouldn't be surprised if we
have another necessary bump post-alpha2, but at the same time, it'd
certainly be nice if we are able to put out alpha2 and then a beta1 in
the future without needing to do a bump.

In other words, +1 from me for going ahead and committing this for
alpha2, if possible.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless pgsqlad...@geoff.dj wrote:
 If I create a copy of the table using

 CREATE mytab (LIKE brokentab INCLUDING ALL);
 INSERT INTO mytab SELECT * FROM brokentab;

Also, did you drop any columns from the original brokentab table
where the bug can be reproduced?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-08-03 Thread Amit Kapila

On Thu, Jul 16, 2015 at 9:58 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jul 16, 2015 at 9:41 AM, Fujii Masao masao.fu...@gmail.com
wrote:
  Here are some minor comments:
 
  +ereport(LOG,
  +(errmsg(ignoring \%s\ file because no
  \%s\ file exists,
  +TABLESPACE_MAP, BACKUP_LABEL_FILE),
  + errdetail(could not rename file \%s\ to
  \%s\: %m,
  +   TABLESPACE_MAP,
TABLESPACE_MAP_OLD)));
 
  WARNING is better than LOG here because it indicates a problematic case?

 No, that's not the right distinction.  Remember that, when sending
 messages to the client, WARNING  LOG, and when sending messages to
 the log, LOG  WARNING.  So messages that a user is more likely to
 care about than the administrator should be logged at WARNNG; those
 that the administrator is more likely to care about should be LOG.  I
 think LOG is clearly the appropriate thing here.

  In detail message, the first word of sentence needs to be capitalized.
 
  + errdetail(renamed file \%s\ to \%s\,
  +   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 
  In detail message, basically we should use a complete sentence.
  So like other similar detail messages in xlog.c, I think that it's
better
  to use \%s\ was renamed to \%s\. as the detail message here.

 Right, that's what the style guidelines say.


I have changed the errdetail messages as per above suggestion.
Also, there was one issue (it was displaying 2 messages when
rename fails) in patch which I have fixed in the updated version.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


rename_mapfile_if_backupfile_not_present_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-08-03 Thread Ashutosh Bapat

On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Jul 31, 2015 at 6:33 AM, Ashutosh Bapat
 ashutosh.ba...@enterprisedb.com wrote:
  I'm not hung up on the table-level attribute, but I think having a
  server-level attribute rather than a global GUC is a good idea.
  However, I welcome other thoughts on that.
 
  The patch supports server level attribute. Let me repeat the relevant
  description from my earlier mail
  --
  Every FDW needs to register the connection while starting new
 transaction on
  a foreign connection (RegisterXactForeignServer()). A foreign server
  connection is identified by foreign server oid and the local user oid
  (similar to the entry cached by postgres_fdw). While registering, FDW
 also
  tells whether the foreign server is capable of participating in two-phase
  commit protocol. How to decide that is left entirely to the FDW. An FDW
 like
  file_fdw may not have 2PC support at all, so all its foreign servers do
 not
  comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW
  like postgres_fdw can have some of its servers compliant and some not,
  depending upon server version, configuration (max_prepared_transactions
 = 0)
  etc.
  --
 
  Does that look good?

 OK, sure.  But let's make sure postgres_fdw gets a server-level option
 to control this.


For postgres_fdw it's a boolean server-level option 'twophase_compliant'
(suggestions for name welcome).


   Done, there are three hooks now
   1. For preparing a foreign transaction
   2. For resolving a prepared foreign transaction
   3. For committing/aborting a running foreign transaction (more
   explanation
   later)
 
  (2) and (3) seem like the same thing.  I don't see any further
  explanation later in your email; what am I missing?
 
  In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz'
 (fill
  the prepared transaction id) and 3 always fires COMMIT/ABORT TRANSACTION
  (notice absence of PREPARED and 'xyz').

 Oh, OK.  But then isn't #3 something we already have?  i.e.
 pgfdw_xact_callback?


While transactions are being prepared on the foreign connections, if any
prepare fails, we have to abort transactions on the rest of the connections
(and abort the prepared transactions). pgfdw_xact_callback wouldn't know,
which connections have prepared transactions and which do not have. So,
even in case of two-phase commit we need all the three hooks. Since we have
to define these three hooks, we might as well centralize all the
transaction processing and let the foreign transaction manager decide which
of the hooks to invoke. So, the patch moves most of the code in
pgfdw_xact_callback in the relevant hook and foreign transaction manager
invokes appropriate hook. Only thing that remains in pgfdw_xact_callback
now is end of transaction handling like resetting cursor numbering.


  That seems totally broken.  Before PITR, the database might be
  inconsistent, in which case you can't call any functions at all.
  Also, you shouldn't be trying to resolve any transactions until the
  end of recovery, because you don't know when you see that the
  transaction was prepared whether, at some subsequent time, you will
  see it resolved.  You need to finish recovery and, only after entering
  normal running, decide whether to resolve any transactions that are
  still sitting around.
 
  That's how it works in the patch for unresolved prepared foreign
  transactions belonging to xids within the known range. For those
 belonging
  to xids in future (beyond of known range of xids after PITR), we can not
  determine the status of that local transaction (as those do not appear in
  the xlog) and hence can not decide the fate of prepared foreign
 transaction.
  You seem to be suggesting that we should let the recovery finish and mark
  those prepared foreign transaction as can not be resolved or something
  like that. A DBA can remove those entries once s/he has dealt with them
 on
  the foreign server.
 
  There's little problem with that approach. Triplet (xid, serverid,
 userid)
  is used to identify the a foreign prepared transaction entry in memory
 and
  is used to create unique file name for storing it on the disk. If we
 allow a
  future xid after PITR, it might conflict with an xid of a transaction
 that
  might take place after PITR. It will cause problem if exactly same
 foreign
  server and user participate in the transaction with conflicting xid (rare
  but possible).
 
  Other problem is that the foreign server on which the transaction was
  prepared (or the user whose mapping was used to prepare the transaction),
  might have got added in a future time wrt PITR, in which case, we can not
  even know which foreign server this transaction was prepared on.
 
  There should be no situation (short of e.g. OS
  errors writing the state files) where this stuff makes recovery fail.
 
  During PITR, if we encounter a prepared (local) transaction with a future
  xid, we

[HACKERS] RLS restrictive hook policies

2015-08-03 Thread Dean Rasheed

In get_row_security_policies():

/*
 * If the only built-in policy is the default-deny one, and hook policies
 * exist, then use the hook policies only and do not apply the
 * default-deny policy.  Otherwise, we will apply both sets below.
 */
if (defaultDeny 
(hook_policies_restrictive != NIL || hook_policies_permissive != NIL))
{
rowsec_expr = NULL;
rowsec_with_check_expr = NULL;
}

So if the only policies that exist are restrictive policies from an
extension, the default-deny policy is not applied and the restrictive
policies are applied instead, which is effectively a default-allow
situation with each restrictive policy further limiting access to the
table.

I think this is potentially very dangerous when both kinds of policy
exist. Consider for example a situation where initially multiple
permissive policies are set up for different users allowing them
access to different subsets of the table, and users not covered by any
of those permissive policies have no access. Then suppose that later a
restrictive policy is added, applying to all users -- the intention
being to prevent any user from accessing some particularly sensitive
data. The result of adding that restrictive policy would be that all
the users who previously had no access at all would suddenly have
access to everything not prohibited by the restrictive policy.

That goes against everything I would expect from a restrictive policy
-- adding more restrictive policies should only ever reduce the number
of rows visible, not ever open up more. So I think the test above
should only disable the default-deny policy if there are any
permissive policies from the extension.

Of course that will make life a little harder for people who only want
to use restrictive policies, since they will be forced to first add a
permissive policy, possibly just one that's always true, but I think
it's the safest approach.

Possibly if/when we add proper SQL support for defining restrictive
policies, we might also add an option to ALTER TABLE ... ENABLE ROW
LEVEL SECURITY to allow a table to have RLS enabled in default-allow
mode, for users who know they only want to add restrictive policies.
However, I think that should be something the user has to explicitly
ask for, not an automatic decision based on the existence of
restrictive policies.

Thoughts?

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-03 Thread Amit Kapila

On Fri, Jul 31, 2015 at 10:11 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Wed, Jul 29, 2015 at 11:48 PM, Andres Freund and...@anarazel.de
wrote:
 
  On 2015-07-29 12:54:59 -0400, Robert Haas wrote:
   I would try to avoid changing lwlock.c.  It's pretty easy when so
   doing to create mechanisms that work now but make further upgrades to
   the general lwlock mechanism difficult.  I'd like to avoid that.
 
  I'm massively doubtful that re-implementing parts of lwlock.c is the
  better outcome. Then you have two different infrastructures you need to
  improve over time.

 I agree and modified the patch to use 32-bit atomics based on idea
 suggested by Robert and didn't modify lwlock.c.

While looking at patch, I found that the way it was initialising the list
to be empty was wrong, it was using pgprocno as 0 to initialise the
list, as 0 is a valid pgprocno.  I think we should use a number greater
that PROCARRAY_MAXPROC (maximum number of procs in proc
array).

Apart from above fix, I have modified src/backend/access/transam/README
to include the information about the improvement this patch brings to
reduce ProcArrayLock contention.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


group_xid_clearing_at_trans_end_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-08-03 Thread Heikki Linnakangas


On 08/03/2015 07:01 AM, Michael Paquier wrote:

On Sun, Aug 2, 2015 at 4:01 AM, Heikki Linnakangas wrote:

Perhaps it's best if we copy all the WAL files from source in copy-mode, but
not in libpq mode. Regarding old WAL files in the target, it's probably best
to always leave them alone. They should do no harm, and as a general
principle it's best to avoid destroying evidence.

It'd be nice to get some fix for this for alpha2, so I'll commit a fix to do
that on Monday, unless we come to a different conclusion before that.


+1. Both things sound like a good plan to me.


I had some trouble implementing that. Recovery seemed to get confused 
sometimes, when it didn't find some of the WAL files in pg_xlog 
directory, even though it could fetch them through streaming 
replication. I'll have to investigate that further, but in the meantime, 
to have some fix in place for alpha2, I committed an even simpler fix 
for the immediate issue that pg_xlog is a symlink: just pretend that 
pg_xlog is a normal directory, even when it's a symlink.


I'll continue to investigate what was wrong with my initial attempt. And 
it would be nice to avoid copying the pre-allocated WAL files from the 
source, because it's really unnecessary. But this fixes the immediate 
problem that pg_rewind didn't work at all if pg_xlog was a symlink.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Autonomous Transaction is back

2015-08-03 Thread Merlin Moncure

On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi
rajeev.rast...@huawei.com wrote:
 On 31 July 2015 23:10, Robert Haas Wrote:
I think we're going entirely down the wrong path here.  Why is it ever useful 
for a backend's lock requests to conflict with themselves, even with 
autonomous transactions?  That seems like an artifact of somebody else's 
implementation that we should be happy we don't need to copy.

 IMHO, since most of the locking are managed at transaction level not backend 
 level and we consider main  autonomous transaction to be independent 
 transaction, then practically they may conflict right.
 It is also right as you said that there is no as such useful use-cases where 
 autonomous transaction conflicts with main (parent) transaction. But we 
 cannot take it for granted as user might make a mistake. So at-least we 
 should have some mechanism to handle this rare case, for which as of now I 
 think throwing error from autonomous transaction as one of the solution. Once 
 error thrown from autonomous transaction, main transaction may continue as it 
 is (or abort main transaction also??).

hm.  OK, what's the behavior of:

BEGIN
  UPDATE foo SET x = x + 1 WHERE foo_id = 1;

  BEGIN WITH AUTONOMOUS TRANSACTION
UPDATE foo SET x = x + 1 WHERE foo_id = 1;
  END;

  RAISE EXCEPTION ...;
EXCEPTION ...

END;

Also,
*) What do the other candidate implementations do?  IMO, compatibility
should be the underlying design principle.

*) What will the SQL only feature look like?

*) Is the SPI interface going to be extended to expose AT?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-08-03 Thread Fujii Masao

On Fri, Jul 17, 2015 at 12:28 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 06/29/2015 09:44 AM, Michael Paquier wrote:

 On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

 But we'll still need to handle the pg_xlog symlink case somehow. Perhaps
 it
 would be enough to special-case pg_xlog for now.


 Well, sure, pg_rewind does not copy the soft links either way. Now it
 would be nice to have an option to be able to recreate the soft link
 of at least pg_xlog even if it can be scripted as well after a run.


 Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In
 any non-trivial scenarios, just copying all the files from pg_xlog isn't
 enough anyway, and you need to set up a recovery.conf after running
 pg_rewind that contains a restore_command or primary_conninfo, to fetch the
 WAL. So you can argue that by not copying pg_xlog automatically, we're
 actually doing a favour to the DBA, by forcing him to set up the
 recovery.conf file correctly. Because if you just test simple scenarios
 where not much time has passed between the failover and running pg_rewind,
 it might be enough to just copy all the WAL currently in pg_xlog, but it
 would not be enough if more time had passed and not all the required WAL is
 present in pg_xlog anymore.  And by not copying the WAL, we can avoid some
 copying, as restore_command or streaming replication will only copy what's
 needed, while pg_rewind would copy all WAL it can find the target's data
 directory.

 pg_basebackup also doesn't include any WAL, unless you pass the --xlog
 option. It would be nice to also add an optional --xlog option to pg_rewind,
 but with pg_rewind it's possible that all the required WAL isn't present in
 the pg_xlog directory anymore, so you wouldn't always achieve the same
 effect of making the backup self-contained.

 So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory
 in both the source and the target.

 If pg_xlog is simply ignored, some old WAL files may remain in target server.
 Don't these old files cause the subsequent startup of target server as new
 standby to fail? That is, it's the case where the WAL file with the same name
 but different content exist both in target and source. If that's harmfull,
 pg_rewind also should remove the files in pg_xlog of target server.

 This would reduce usability. The rewound node will replay WAL from the
 previous checkpoint where WAL forked up to the minimum recovery point
 of source node where pg_rewind has been run. Hence if we remove
 completely the contents of pg_xlog we'd lose a portion of the logs
 that need to be replayed until timeline is switched on the rewound
 node when recovering it (while streaming from the promoted standby,
 whatever).

Even if we remove the WAL files in *target server, we don't lose any
files in *source server that we will need to replay later.

 I don't really see why recycled segments would be a
 problem, as that's perhaps what you are referring to, but perhaps I am
 missing something.

Please imagine the case where the WAL files with the same name were
created in both servers after the fork. Their contents may be different.
After pg_rewind is executed successfully, the rewound server
(i.e., target server) should retrieve and replay that WAL file from
the *source* server. But the problem is that the rewound server tries to
replay the WAL file from its local since the file exists locally (even
if primary_conninfo is specified).

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-03 Thread Ildus Kurbangaliev


On 07/28/2015 10:28 PM, Heikki Linnakangas wrote:

On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote:

Hello.
In the attached patch I've made a refactoring for tranches.
The prefix for them was extended,  and I've did a split of LWLockAssign
to two
functions (one with tranche and second for user defined LWLocks).


This needs some work in order to be maintainable:

* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in 
sync with the list of individual locks in lwlock.h. Sooner or later 
someone will add an LWLock and forget to update the names-array. That 
needs to be made less error-prone, so that the names are maintained in 
the same place as the #defines. Perhaps something like rmgrlist.h.


* The base tranches are a bit funny. They all have the same 
array_base, pointing to MainLWLockArray. If there are e.g. 5 clog 
buffer locks, I would expect the T_NAME() to return ClogBufferLocks 
for all of them, and T_ID() to return numbers between 0-4. But in 
reality, T_ID() will return something like 55-59.


Instead of passing a tranche-id to LWLockAssign(), I think it would be 
more clear to have a new function to allocate a contiguous block of 
lwlocks as a new tranche. It could then set the base correctly.


* Instead of having LWLOCK_INDIVIDUAL_NAMES to name individual 
locks, how about just giving each one of them a separate tranche?


* User manual needs to be updated to explain the new column in 
pg_stat_activity.


- Heikki



Hello. Thanks for review. I attached new version of patch.

It adds new field in pg_stat_activity that shows current wait in backend.

I've did a some refactoring LWLocks tranche mechanism. In lwlock.c only
invididual and user defined LWLocks is creating, other LWLocks created by
modules who need them. I think that is more logical (user know about module,
not module about of all users). It also simplifies LWLocks acquirement.

Now each individual LWLock and other groups of LWLocks have their 
tranche, and can
be easily identified. If somebody will add new individual LWLock and 
forget to add
its name, postgres will show a message. Individual LWLocks still 
allocated in

one array but tranches for them point to particular index in main array.

Sample:

b1=# select pid, wait_event from pg_stat_activity; \watch 1

 pid  |   wait_event
--+-
 7722 | Storage: READ
 7653 |
 7723 | Network: WRITE
 7725 | Network: READ
 7727 | Locks: Transaction
 7731 | Storage: READ
 7734 | Network: READ
 7735 | Storage: READ
 7739 | LWLocks: WALInsertLocks
 7738 | Locks: Transaction
 7740 | LWLocks: BufferMgrLocks
 7741 | Network: READ
 7742 | Network: READ
 7743 | Locks: Transaction


--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..2e4e67e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -630,6 +630,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  entryTrue if this backend is currently waiting on a lock/entry
 /row
 row
+ entrystructfieldwait_event//entry
+ entrytypetext//entry
+ entryName of a current wait event in backend/entry
+/row
+row
  entrystructfieldstate//entry
  entrytypetext//entry
  entryCurrent overall state of this backend.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3a58f1e..10c25cf 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -457,7 +457,8 @@ CLOGShmemInit(void)
 {
 	ClogCtl-PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(ClogCtl, CLOG Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  CLogControlLock, pg_clog);
+  CLogControlLock, pg_clog,
+  CLogBufferLocks);
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5ad35c0..dd7578f 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -466,7 +466,8 @@ CommitTsShmemInit(void)
 
 	CommitTsCtl-PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, CommitTs Ctl, CommitTsShmemBuffers(), 0,
-  CommitTsControlLock, pg_commit_ts);
+  CommitTsControlLock, pg_commit_ts,
+  CommitTSBufferLocks);
 
 	commitTsShared = ShmemInitStruct(CommitTs shared,
 	 sizeof(CommitTimestampShared),
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1933a87..b905c59 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1842,10 +1842,12 @@ MultiXactShmemInit(void)
 
 	SimpleLruInit(MultiXactOffsetCtl,
   MultiXactOffset Ctl, NUM_MXACTOFFSET_BUFFERS, 0,
-  MultiXactOffsetControlLock, pg_multixact/offsets);
+  MultiXactOffsetControlLock, pg_multixact/offsets,
+  MultiXactOffsetBufferLocks);
 	SimpleLruInit(MultiXactMemberCtl,

Re: [HACKERS] RLS restrictive hook policies

Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 In get_row_security_policies():
 
 /*
  * If the only built-in policy is the default-deny one, and hook policies
  * exist, then use the hook policies only and do not apply the
  * default-deny policy.  Otherwise, we will apply both sets below.
  */
 if (defaultDeny 
 (hook_policies_restrictive != NIL || hook_policies_permissive != NIL))
 {
 rowsec_expr = NULL;
 rowsec_with_check_expr = NULL;
 }
 
 So if the only policies that exist are restrictive policies from an
 extension, the default-deny policy is not applied and the restrictive
 policies are applied instead, which is effectively a default-allow
 situation with each restrictive policy further limiting access to the
 table.

Right, the idea there being that applying the default-deny would render
the restrictive policies pointless, since nothing would ever be visible,
however..

 I think this is potentially very dangerous when both kinds of policy
 exist. Consider for example a situation where initially multiple
 permissive policies are set up for different users allowing them
 access to different subsets of the table, and users not covered by any
 of those permissive policies have no access. Then suppose that later a
 restrictive policy is added, applying to all users -- the intention
 being to prevent any user from accessing some particularly sensitive
 data. The result of adding that restrictive policy would be that all
 the users who previously had no access at all would suddenly have
 access to everything not prohibited by the restrictive policy.
 
 That goes against everything I would expect from a restrictive policy
 -- adding more restrictive policies should only ever reduce the number
 of rows visible, not ever open up more. So I think the test above
 should only disable the default-deny policy if there are any
 permissive policies from the extension.
 
 Of course that will make life a little harder for people who only want
 to use restrictive policies, since they will be forced to first add a
 permissive policy, possibly just one that's always true, but I think
 it's the safest approach.

Requiring a permissive policy which allows the records first, to avoid
the default-deny, makes a ton of sense.

 Possibly if/when we add proper SQL support for defining restrictive
 policies, we might also add an option to ALTER TABLE ... ENABLE ROW
 LEVEL SECURITY to allow a table to have RLS enabled in default-allow
 mode, for users who know they only want to add restrictive policies.

Perhaps...  I'm not sure that's really better than simply requiring a
'create policy p1 on t1 using (true);' to be done.

 However, I think that should be something the user has to explicitly
 ask for, not an automatic decision based on the existence of
 restrictive policies.

Agreed.  I'm happy to commit that change and back-patch it to 9.5,
barring objections.  Given that the only way to have restrictive
policies currently is using hooks, I don't believe there's any
documentation update required either; of course, I'll update the comment
to reflect this discussion/decision and make any necessary changes to
test_rls_hooks.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] RLS restrictive hook policies

Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 3 August 2015 at 16:09, Stephen Frost sfr...@snowman.net wrote:
  * Stephen Frost (sfr...@snowman.net) wrote:
  Agreed.  I'm happy to commit that change and back-patch it to 9.5,
  barring objections.  Given that the only way to have restrictive
  policies currently is using hooks, I don't believe there's any
  documentation update required either; of course, I'll update the comment
  to reflect this discussion/decision and make any necessary changes to
  test_rls_hooks.
 
  Patch attached to make this change, with appropriate comment updates and
  fixes to the test_rls_hooks modules.
 
  Comments?
 
 
 Looks good to me.

Thanks!  Pushed to master and 9.5.

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 Oh, ok, I was confused on how the heap works. You could still abstract this
 as in-memory tails of the tapes, but it's more complicated than I thought
 at first:

 When it's time to drain the heap, in performsort, divide the array into two
 arrays, based on the run number of each tuple, and then quicksort the arrays
 separately. The first array becomes the in-memory tail of the current tape,
 and the second array becomes the in-memory tail of the next tape.

 You wouldn't want to actually allocate two arrays and copy SortTuples
 around, but keep using the single large array, just logically divided into
 two. So the bookkeeping isn't trivial, but seems doable.

Since you're talking about the case where we must drain all tuples
within tuplesort_performsort(), I think you're talking about a
distinct idea here (since surely you're not suggesting giving up on my
idea of avoiding dumping most tuples, which is a key strength of the
patch). That's fine, but I just want to be clear on that. Importantly,
my optimization does not care about the fact that the array may have
two runs in it, because it quicksorts the array and forgets about it
being a heap. It only cares whether or not more than one run made it
out to tape, which makes it more widely applicable than it would
otherwise be. Also, the fact that much I/O can be avoided is clearly
something that can only happen when work_mem is at least ~50% of a
work_mem setting that would have resulted in an (entirely) internal
sort.

You're talking about a new thing here, that happens when it is
necessary to dump everything and do a conventional merging of on-tape
runs. IOW, we cannot fit a significant fraction of overall tuples in
memory, and we need much of the memtuples array for the next run
(usually this ends as a TSS_FINALMERGE). That being the case
(...right?), I'm confused that you're talking about doing something
clever within tuplesort_performsort(). In the case you're targeting,
won't the vast majority of tuple dumping (through calls to
dumptuples()) occur within puttuple_common()?

I think that my optimization should probably retain it's own
state.status even if we do this (i.e. TSS_MEMTAPEMERGE should stay).

 Hmm, I can see another possible optimization here, in the way the heap is
 managed in TSS_BUILDRUNS state. Instead of keeping a single heap, with
 tupindex as the leading key, it would be more cache efficient to keep one
 heap for the currentRun, and an unsorted array of tuples belonging to
 currentRun + 1. When the heap becomes empty, and currentRun is incemented,
 quicksort the unsorted array to become the new heap.

 That's a completely separate idea from your patch, although if you did it
 that way, you wouldn't need the extra step to divide the large array into
 two, as you'd maintain that division all the time.

This sounds to me like a refinement of your first idea (the idea that
I just wrote about).

I think the biggest problem with tuplesort after this patch of mine is
committed is that it is still too attached to the idea of
incrementally spilling and sifting. It makes sense to some degree
where it makes my patch possible...if we hang on to the idea of
incrementally spilling tuples on to tape in sorted order for a while,
then maybe we can hang on for long enough to quicksort most tuples,
*and* to avoid actually dumping most tuples (incremental spills make
the latter possible). But when work_mem is only, say, 10% of the
setting required for a fully internal sort, then the incremental
spilling and sifting starts to look dubious, at least to me, because
the TSS_MEMTAPEMERGE optimization added by my patch could not possibly
apply, and dumping and merging many times is inevitable.

What I think you're getting at here is that we still have a heap, but
we don't use the heap to distinguish between tuples within a run. In
other words, HEAPCOMPARE() often/always only cares about run number.
We quicksort after a deferred period of time, probably just before
dumping everything. Perhaps I've misunderstood, but I don't see much
point in quicksorting a run before being sure that you're sorting as
opposed to heapifying at that point (you're not clear on what we've
decided on once we quicksort). I think it could make sense to make
HEAPCOMPARE() not care about tuples within a run that is not
currentRun, though.

I think that anything that gives up on replacement selection's ability
to generate large runs, particularly for already sorted inputs will be
too hard a sell (not that I think that's what you proposed). That's
way, way less of an advantage than it was in the past (back when
external sorts took place using actual magnetic tape, it was a huge),
but the fact remains that it is an advantage. And so, I've been
prototyping an approach where we don't heapify once it is established
that this TSS_MEMTAPEMERGE optimization of mine cannot possibly apply.
We quicksort in batches rather

Re: [HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

On Sat, Aug 1, 2015 at 9:49 AM, Jeremy Harris j...@wizmail.org wrote:
 On 31/07/15 18:31, Robert Haas wrote:
 On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris j...@wizmail.org wrote:
 Heapification is O(n) already, whether siftup (existing) or down.

 That's not my impression, or what Wikipedia says.  Source?

 Measurements done last year:

 http://www.postgresql.org/message-id/52f35462.3030...@wizmail.org
 (spreadsheet attachment)

 http://www.postgresql.org/message-id/52f40ce9.1070...@wizmail.org
 (measurement procedure and spreadsheet explanation)

I don't think that running benchmarks is the right way to establish
the asymptotic runtime of an algorithm.  I mean, if you test
quicksort, it will run in much less than O(n^2) time on almost any
input.  But that does not mean that the worst-case run time is
anything other than O(n^2).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minimum tuple threshold to decide last pass of VACUUM

2015-08-03 Thread Jim Nasby


On 8/3/15 12:04 PM, Simon Riggs wrote:

Yes, that's what I was thinking, I just didn't say actually it. I'd been
thinking about having VACUUM do just Phase 1 for some time, since its so
much faster to do that. Will code.


I'd like to see that exposed as an option as well. There are certain 
situations where you'd really like to just freeze things as fast as 
possible, without waiting for a full vacuum.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Using quicksort and a merge step to significantly improve on tuplesort's single run external sort

On Sat, Aug 1, 2015 at 9:56 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-07-31 13:31:54 -0400, Robert Haas wrote:
 On Fri, Jul 31, 2015 at 7:21 AM, Jeremy Harris j...@wizmail.org wrote:
  Heapification is O(n) already, whether siftup (existing) or down.

 That's not my impression, or what Wikipedia says.  Source?

 Building a binary heap via successive insertions is O(n log n), but
 building it directly is O(n). Looks like wikipedia agrees too
 https://en.wikipedia.org/wiki/Binary_heap#Building_a_heap

That doesn't really address the sift-up vs. sift-down question.  Maybe
I'm just confused about the terminology.  I think that Wikipedia
article is saying that if you iterate from the middle element of an
unsorted array towards the beginning, establishing the heap invariant
for every item as you reach it, you will take only O(n) time.  But
that is not what inittapes() does.  It instead starts at the beginning
of the array and inserts each element one after the other.  If this is
any different from building the heap via successive insertions, I
don't understand how.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort