Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-11-10 Thread Michael Paquier
On Sun, Nov 9, 2014 at 10:32 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Nov 9, 2014 at 6:41 AM, Rahila Syed rahilasye...@gmail.com
wrote:
 Hello,

The patch was not applied to the master cleanly. Could you update the
 patch?
 Please find attached updated and rebased patch to compress FPW. Review
 comments given above have been implemented.

 Thanks for updating the patch! Will review it.

 BTW, I got the following compiler warnings.

 xlogreader.c:755: warning: assignment from incompatible pointer type
 autovacuum.c:1412: warning: implicit declaration of function
 'CompressBackupBlocksPagesAlloc'
 xlogreader.c:755: warning: assignment from incompatible pointer type
I have been looking at this patch, here are some comments:
1) This documentation change is incorrect:
-  termvarnamefull_page_writes/varname (typeboolean/type)
+  termvarnamefull_page_writes/varname (typeenum/type)/term
   indexterm
primaryvarnamefull_page_writes/ configuration
parameter/primary
   /indexterm
-  /term
The termination of block term was correctly places before.
2) This patch defines FullPageWritesStr and full_page_writes_str, but both
do more or less the same thing.
3) This patch is touching worker_spi.c and calling
CompressBackupBlocksPagesAlloc directly. Why is that necessary? Doesn't a
bgworker call InitXLOGAccess once it connects to a database?
4) Be careful as well of whitespaces (code lines should have as well a
maximum of 80 characters):
+ * If compression is set on replace the rdata nodes of backup blocks
added in the loop
+ * above by single rdata node that contains compressed backup blocks
and their headers
+ * except the header of first block which is used to store the
information about compression.
+ */
5) GetFullPageWriteGUC or something similar is necessary, but I think that
for consistency with doPageWrites its value should be fetched in XLogInsert
and then passed as an extra argument in XLogRecordAssemble. Thinking more
about this, I think that it would be cleaner to simply have a bool flag
tracking if compression is active or not, something like doPageCompression,
that could be fetched using GetFullPageWriteInfo. Thinking more about it,
we could directly track forcePageWrites and fullPageWrites, but that would
make back-patching more difficult with not that much gain.
6) Not really a complaint, but note that this patch is using two bits that
were unused up to now to store the compression status of a backup block.
This is actually safe as long as the maximum page is not higher than 32k,
which is the limit authorized by --with-blocksize btw. I think that this
deserves a comment at the top of the declaration of BkpBlock.
!   unsignedhole_offset:15, /* number of bytes before hole */
!   flags:2,/* state of a
backup block, see below */
!   hole_length:15; /* number of bytes in
hole */
7) Some code in RestoreBackupBlock:
+   char *uncompressedPages;
+
+   uncompressedPages = (char *)palloc(XLR_TOTAL_BLCKSZ);
[...]
+   /* Check if blocks in WAL record are compressed */
+   if (bkpb.flag_compress == BKPBLOCKS_COMPRESSED)
+   {
+   /* Checks to see if decompression is successful is made
inside the function */
+   pglz_decompress((PGLZ_Header *) blk, uncompressedPages);
+   blk = uncompressedPages;
+   }
uncompressedPages is pallocd'd all the time but you actually just need to
do that when the block is compressed.
8) Arf, I don't like much the logic around CompressBackupBlocksPagesAlloc
using a malloc to allocate once the space necessary for compressed and
uncompressed pages. You are right to not do that inside a critical section,
but PG tries to maximize the allocations to be palloc'd. Now it is true
that if a palloc does not succeed, PG always ERROR's out (writer adding
entry to TODO list)... Hence I think that using a static variable for those
compressed and uncompressed pages makes more sense, and this simplifies
greatly the patch as well.
9) Is avw_sighup_handler really necessary, what's wrong in allocating it
all the time by default? This avoids some potential caveats in error
handling as well as in value updates for full_page_writes.

So, note that I am not only complaining about the patch, I actually rewrote
it as attached while reviewing, with additional minor cleanups and
enhancements. I did as well a couple of tests like the script attached,
compression numbers being more or less the same as your previous patch,
some noise creating differences. I have done also some regression test runs
with a standby replaying behind.

I'll go through the patch once again a bit later, but feel free to comment.
Regards,
-- 
Michael


compress_test_2.sh
Description: Bourne shell script
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 47b1192..6756172 100644
--- a/doc/src/sgml/config.sgml
+++ 

Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

2014-11-10 Thread furuyao
 On Fri, Oct 31, 2014 at 5:46 PM,  furu...@pm.nttdata.co.jp wrote:
   We seem to be going in circles. You suggested having two options,
   --feedback, and --fsync, which is almost exactly what Furuya posted
   originally. I objected to that, because I think that user interface
  is
   too complicated. Instead, I suggested having just a single option
   called --synchronous, or even better, have no option at all and
   have the server tell the client if it's participating in
   synchronous replication, and have pg_receivexlog automatically
   fsync when it is, and not otherwise [1]. That way you don't need
 to
   expose any new options to the user. What did you think of that idea?
 
  I think it's pretty weird to make the fsync behavior of the client
 is
  controlled by the server.
 
  I also don't think that fsync() on the client side is useless in
  asynchronous replication.  Yeah, it's true that there are no
  *guarantees* with asynchronous replication, but the bound on how long
  the data can take to get out to disk is a heck of a lot shorter if
  you fsync frequently than if you don't.  And on the flip side, that
  has a performance impact.
 
  So I actually think the design you proposed is not as good as what
  was proposed by Furuya and Simon.  But I don't feel incredibly
  strongly about it.
 
  Thanks for lots of comments!!
 
  I fixed the patch.
  As a default, it behave like a walreceiver.
  Same as walreceiver, it fsync and send a feedback after fsync.
 
 On second thought, flipping the default behavior seems not worthwhile
 here.
 Which might surprise the existing users and cause some troubles to them.
 I'd like to back to the Heikki's original suggestion like just adding
 --synchronous option. That is, only when --synchronous is specified, WAL
 is flushed and feedback is sent back as soon as WAL is received.
 
 I changed the patch heavily in that way. Please find the attached patch.
 By default, pg_receivexlog flushes WAL data only when WAL file is closed.
 If --synchronous is specified, like the standby's WAL receiver, sync
 commands are issued as soon as there is WAL data which has not been flushed
 yet. Also status packets are sent back to the server just after WAL data
 is flushed whatever --status-interval is set to. I added the description
 of this behavior to the doc.
 
 Thought?

I think it's as you pointed out.
Thank you for the patch.
I did a review of the patch. 
There was no problem.
I confirmed the following.

1. applied cleanly and compilation was without warnings and errors
2. all regress tests was passed ok
3. when --synchronous is not specified, do not fsync except wal has switched
4. it get normal backup when pg_basebackup has executed with '-X s'
5. when --synchronous is specified, it fsync every time when it receive WAL data
6. when --synchronous is specified, in spite of -s option, feedback are sent 
back when it fsync
7. when --slog is not specified, it wouldn't feedback fsync location
8. no problem in document

Regards,

--
Furuya Osamu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter user/role CURRENT_USER

2014-11-10 Thread Kyotaro HORIGUCHI
Hello, This is the new version. (WIP v2)

The first attachment is the patch and the second is test sql
script.

- Behavior changing

Almost all syntax taking role accepts CURRENT_USER and
SESSION_USER and they are distinguished from current_user and
session_user. The exceptions are follows.

 - CREATE USER/GROUP roleid
 - ALTER ROLE/GROUP/USER roleid RENAME TO newname
 
 These syntax still do not accept the keywords like CURRENT_USER
 and special names like public at all, but accepts
 current_user. The error message is changed as follows.

| postgres=# create user current_user;
| ERROR:  role name should not be a keyword nor reserved name.
| LINE 1: create user current_user;
| ^

# Some other messages may changed...

USER and CURRENT_ROLE haven't been extended to other syntax. The
former still can be used only in CREATE/ALTER/DROP USER MAPPING
and the latter cannot be used out of function expressions.

- Storage for new information

The new struct NameId stores an identifier which telling what it
logically is using the new enum NameIdTypes.

This is still be a bit suffered by the difference between
CURRENT_USER and PUBLIC but now it makes distinction between
current_user and current_user. User oid does not have the room
for representing the difference among PUBLIC, NONE and 'not
found' as the result of get_nameid_oid(), so members of NameId is
exposed in foreigncmds.c and it gets a bit uglier.

-  Changes of related structs and grammar.

Type of role member is changed to NameId in some of parser
structs. AlterTableCmd.name has many other usage so I added new
member NameId *newowner for exclusive usage.

Type of OWNER clause of CREATE TABLESPACE is changed to RoleId. I
suppose there's no particular reason that the non-terminal was
name.

Usage of public and none had been blocked for CREATE/RENAME
USER in user.c but now it is blocked in gram.y


- New function to resolve NameId

New function get_nameid_oid() is added. It is an alternative of
get_role_oid which can handle current_user and current_user
properly. get_role_oid() still be used in many places having no
direct relation to syntax.

- Others

No doc provided for now.


regards,


  Adam Brightwell adam.brightw...@crunchydatasolutions.com writes:
   FWIW, I found the following in the archives:
  
   http://www.postgresql.org/message-id/15516.1038718...@sss.pgh.pa.us
  
   Now this is from 2002 and it appears it wasn't necessary to change at the
   time, but I haven't yet found anything else related (it's a big archive).
   Though, as I understand it, PUBLIC is now non-reserved as of SQL:2011 
   which
   might make a compelling argument to leave it as is?
  
  The current spec does list PUBLIC as a non-reserved keyword, but it also
  says (5.4 Names and identifiers syntax rules)
  
   20) No authorization identifier shall specify PUBLIC.
  
  which, oddly enough, seems to license us to handle PUBLIC the way
  we are doing.  OTOH, it lists CURRENT_USER as a reserved word, suggesting
  that they don't think the same type of hack should be used for that.
  
  I'd be inclined to leave the grammar as such alone (ie CURRENT_USER is
  a keyword, PUBLIC isn't).  Changing that has more downside than upside,
  and we do have justification in the spec for treating the two cases
  differently.  However, I agree that we should fix the subsequent
  processing so that current_user is not confused with CURRENT_USER.
 
 Sure, maybe.
 
  - PUBLIC should be left as it is, as an supecial identifier
which cannot be used even if quoted under some syntax.
 
  - CURRENT_USER should be a kayword as it is, but we shouldn't
inhibit it from being used as an identifier if
quoted. SESSION_USER and USER should be treated in the same way.
 
We don't want to use something other than string (prefixed by
zero-byte) as a matter of course:). And resolving the name to
roleid instantly in gram.y is not allowed for the reason shown
upthread.
 
So it is necessary to add a new member for the struct, say
int special_role;:... Let me have more sane name for it :(
 
  - USER and CURRENT_ROLE are not needed for the syntax other than
them which already uses them.
 
 I will work on this way. Let me know if something is not acceptable.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f18d078d5e6e4005e803ecc954e59c899dbfd557 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp
Date: Mon, 10 Nov 2014 19:08:42 +0900
Subject: [PATCH] ALTER USER CURRENT_USER v2

---
 src/backend/catalog/aclchk.c   |  13 ++-
 src/backend/commands/alter.c   |   2 +-
 src/backend/commands/foreigncmds.c |  39 -
 src/backend/commands/schemacmds.c  |  26 +-
 src/backend/commands/tablecmds.c   |   2 +-
 src/backend/commands/tablespace.c  |   2 +-
 src/backend/commands/user.c|  70 +++
 src/backend/nodes/copyfuncs.c  |  37 +---
 src/backend/nodes/equalfuncs.c |  35 +---
 

Re: [HACKERS] [v9.5] Custom Plan API

2014-11-10 Thread Kouhei Kaigai
 On Sat, Nov 8, 2014 at 4:16 AM, Robert Haas robertmh...@gmail.com wrote:
 
  On Mon, Oct 27, 2014 at 2:35 AM, Kouhei Kaigai kai...@ak.jp.nec.com
 wrote:
   FYI, patch v12 part 2 no longer applies cleanly.
  
   Thanks. I rebased the patch set according to the latest master branch.
   The attached v13 can be applied to the master.
 
  I've committed parts 1 and 2 of this, without the documentation, and
  with some additional cleanup.
 
 Few observations/questions related to this commit:
 
 1.
 @@ -5546,6 +5568,29 @@ get_variable(Var *var, int levelsup, bool istoplevel,
 deparse_context *context)
   colinfo = deparse_columns_fetch(var-varno, dpns);
   attnum = var-varattno;
   }
 + else if (IS_SPECIAL_VARNO(var-varno)  IsA(dpns-planstate,
 + CustomScanState)  (expr = GetSpecialCustomVar((CustomScanState *)
 + dpns-planstate, var, child_ps)) != NULL) { deparse_namespace
 + save_dpns;
 +
 + if (child_ps)
 + push_child_plan(dpns, child_ps, save_dpns);
 + /*
 + * Force parentheses because our caller probably assumed a Var is a
 + * simple expression.
 + */
 + if (!IsA(expr, Var))
 + appendStringInfoChar(buf, '(');
 + get_rule_expr((Node *) expr, context, true); if (!IsA(expr, Var))
 + appendStringInfoChar(buf, ')');
 +
 + if (child_ps)
 + pop_child_plan(dpns, save_dpns);
 + return NULL;
 + }
 
 a. It seems Assert for netlelvelsup is missing in this loop.

Indeed, this if-block does not have assertion unlike other special-varno.

 b. Below comment in function get_variable can be improved w.r.t handling
 for CustomScanState.  The comment indicates that if varno is OUTER_VAR or
 INNER_VAR or INDEX_VAR, it handles all of them similarly which seems to
 be slightly changed for CustomScanState.
 
 /*
 * Try to find the relevant RTE in this rtable.  In a plan tree, it's
 * likely that varno is
 OUTER_VAR or INNER_VAR, in which case we must dig
 * down into the subplans, or INDEX_VAR, which is resolved similarly. Also
 * find the aliases previously assigned for this RTE.
 */
 
I made a small comment that introduces only extension knows the mapping
between these special varno and underlying expression, thus, it queries
providers the expression being tied with this special varnode.
Does it make sense?

 2.
 +void
 +register_custom_path_provider(CustomPathMethods *cpp_methods)
 {
 ..
 }
 
 Shouldn't there be unregister function corresponding to above register
 function?

Even though it is not difficult to implement, what situation will make
sense to unregister rather than enable__scan GUC parameter added by
extension itself?
I initially thought prepared statement with custom-scan node is problematic
if provider got unregistered / unloaded, however, internal_unload_library()
actually does nothing. So, it is at least harmless even if we implemented.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com



pgsql-v9.5-get_variable-smallfix.patch
Description: pgsql-v9.5-get_variable-smallfix.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw behaves oddly

2014-11-10 Thread Ashutosh Bapat
Since two separate issues 1. using reltargetlist instead of attr_needed and
2. system columns usage in FDW are being tackled here, we should separate
the patch into two one for each of the issues.

While I agree that the system columns shouldn't be sent to the remote node,
it doesn't look clear to me as to what would they or their values mean in
the context of foreign data. Some columns like tableoid would have a value
which is the OID of the foreign table, other system columns like
xmin/xmax/ctid will have different meanings (or no meaning) depending upon
the foreign data source. In case of later columns, each FDW would have its
own way of defining that meaning (I guess). But in any case, I agree that
we shouldn't send the system columns to the remote side.

Is there a way we can enforce this rule across all the FDWs? OR we want to
tackle it separately per FDW?

On Tue, Oct 14, 2014 at 8:05 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp
wrote:

 (2014/10/14 8:53), Robert Haas wrote:

 On Mon, Oct 13, 2014 at 3:30 PM, Bruce Momjian br...@momjian.us wrote:

 Uh, where are we on this patch?


 Probably it should be added to the upcoming CF.


 Done.

 https://commitfest.postgresql.org/action/patch_view?id=1599

 Thanks,

 Best regards,
 Etsuro Fujita


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


[HACKERS] Allow signal handlers to optionally use SA_SIGINFO information?

2014-11-10 Thread Andres Freund
Hi,

During benchmarking/debugging I just had the problem that autovacuum was
signalled at an insane rate - leading to more than one hundred autovac
workers being started per second. Leading to a overall slowdown of more
than 90% and the anti-wraparound vacuum not finishing.

The problem is that I couldn't easily figure out where all the SIGUSR2's
to the autovacuum launcher where coming from. This isn't the first time
that I had that kind of problem.

Posix provides information about the source of the signal when using
SA_SIGINFO style handlers via si_code/si_pid. That information has been
available for a *long* while
(c.f. http://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html).

I've now hacked up my development instance to log something like
autovacuum: invoked by pid 18175. I personally find that quite
helpful. I can also imagine it being rather helpful to log information
about the sender of SIGINT/TERM interrupts.

The existing abstractions make are nearly sufficient to make it easy to
optionally use SA_SIGINFO style handlers. Just by redifining SIGNAL_ARGS
and pqsigfunc. There unfortunately is two things making it harder:
SIG_IGN and SIG_DFL - those unfortunately can't be specified for
SA_SIGINFO style handlers (as they have a different signature). So we'd
need to use a different function for those two.

Comments, ideas?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-10 Thread Amit Kapila
On Fri, Nov 7, 2014 at 3:55 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund and...@2ndquadrant.com
wrote:
  Maybe we can, as a first step, make those edges in the lock graph
  visible to the deadlock detector? It's pretty clear that undetected
  deadlocks aren't ok, but detectable deadlocks in a couple corner cases
  might be acceptable.

 An interesting point came to mind this morning on this topic, while
 discussing this issue with Amit.  Suppose process A1 holds a lock on
 some object and process A2, a member of the same locking group, waits
 for a lock on that same object.  It follows that there must exist a
 process B which either *holds* or *awaits* a lock which conflicts with
 the one requested by A2; otherwise, A2 would have been granted the
 lock at once.  If B *holds* the conflicting lock, it may eventually
 release it; but if B *awaits* the conflicting lock, we have a soft
 deadlock, because A2 waits for B waits for A1, which we presume to
 wait for A1.[1]

 The logical consequence of this is that, if a process seeks to acquire
 a lock which is already held (in any mode) by a fellow group-member,
 it had better jump over the entire wait queue and acquire the lock
 immediately if no conflicting locks have already been granted.  If it
 fails to do this, it immediately creates a soft deadlock.  What if
 there *are* conflicting locks already granted?  In that case, things
 are more complicated.  We could, for example, have this situation: A1
 holds AccessShareLock, B holds ExclusiveLock, C1 awaits ShareLock, and
 then (following it in the wait queue) A2 awaits RowExclusiveLock.
 This is not a deadlock, because B can finish, then C can get the lock,
 then C1 can finish, then A2 can get the lock.  But if C2 now waits for
 some member of group A, then we have a soft deadlock between group A
 and group C, which can be resolved by moving A2's request ahead of C1.
 (This example is relatively realistic, too: a lock upgrade from
 AccessShareLock to RowExclusiveLock is probably the most common type
 of lock upgrade, for reasons that are hopefully obvious.)

 In practice, I suspect it's best to take a more aggressive approach
 and have A2 jump straight to the head of the line right out of the
 chute.  Letting some parallel workers make progress while holding up
 others out of a notion of locking fairness is probably stupid, because
 they're probably dependent on each other in some way that makes it
 useless for one to make progress while the others don't.  For the same
 reason, I'd argue that we ought to wait until all pending lock
 requests from a given group can be satisfied before granting any of
 them.

I think this logic can sometimes block processes from acquiring a lock
which no one is holding.  Assume Group-1 (G-1) is waiting on two locks
(Lock L1 on Table T1 and Lock L2 on Table T2) which are held by
unrelated processes P-2 and P-3 respectively; now P-3 releases the
lock on table T-2 and cannot grant it to G-1 who is waiting on this lock
because still L-1 lock cannot be granted to G-1.  At this point the
situation
is that no one holds lock on T-2 and G-1 waits for it, however any new
process won't be able to acquire lock on T-2 as there is already a waiter
for the same in wait queue.  OTOH if we would have granted lock for
T-2 to G-1, it might have finished it's work and released the lock
(considering it is a short time lock on catalog relation).
If it is possible to have such a situation, then even though it is
beneficial
for certain cases to view several pending requests from a same group
as single request, at the same time it can be harmful for some unrelated
processes as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Column/type dependency recording inconsistencies

2014-11-10 Thread Petr Jelinek

On 09/11/14 23:36, Petr Jelinek wrote:

On 09/11/14 23:04, Tom Lane wrote:


I suspect this is actually a bug in the dependency search logic in DROP.
Don't have the energy to chase it right now though.



Yes that's where I started searching also and yes it is. The actual
situation is that the columns of the table that is about to be dropped
are normally not added to the object list that is used for dependency
checks (if the table is going to be dropped who cares about what column
depends on anyway).
But this logic depends on the fact that the columns that belong to that
table are processed after the table was already processed, however as
the new type was added after the table was added, it's processed before
the table and because of the dependency between type and columns, the
columns are also processed before the table and so they are added to the
object list for further dependency checking.

See use and source of object_address_present_add_flags in dependency.c.



So here is what I came up with to fix this. It's quite simple and works 
well enough, we don't really have any regression test that would hit 
this though.


I wonder if the object_address_present_add_flags should be renamed since 
it does bit more than what name suggests at this point.


And I think this should be backpatched to all the versions with 
extension support.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 256486c..e7ec7e2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -2142,6 +2142,31 @@ object_address_present_add_flags(const ObjectAddress *object,
  */
 return true;
 			}
+			if (object-objectSubId == 0)
+			{
+/*
+ * If we got here, it means that we are dropping table whose
+ * columns are already in the object list, since we assume
+ * later on that there are no subobjects of object present
+ * in the list, we should remove those.
+ *
+ * Note that this is very rare occasion that normally happens
+ * only when dropping extension which had some of its table
+ * columns ALTERed after CREATE with custom types or
+ * collations.
+ */
+if (i  addrs-numrefs - 1)
+{
+	ObjectAddressExtra *thisextra = addrs-extras + i;
+
+	memmove(thisobj, thisobj + 1,
+			(addrs-numrefs - 1 - i) * sizeof(ObjectAddress));
+	memmove(thisextra, thisextra + 1,
+			(addrs-numrefs - 1 - i) * sizeof(ObjectAddressExtra));
+}
+addrs-numrefs--;
+continue;
+			}
 		}
 	}
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-10 Thread Robert Haas
On Mon, Nov 10, 2014 at 6:33 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I think this logic can sometimes block processes from acquiring a lock
 which no one is holding.  Assume Group-1 (G-1) is waiting on two locks
 (Lock L1 on Table T1 and Lock L2 on Table T2) which are held by
 unrelated processes P-2 and P-3 respectively; now P-3 releases the
 lock on table T-2 and cannot grant it to G-1 who is waiting on this lock
 because still L-1 lock cannot be granted to G-1.

That's not what I meant.  I meant we shouldn't grant a lock *on a
particular object* to any process on the group until all lock requests
*for that object* can be granted.  Waiting until every awaited lock in
the system can be granted in all requested modes would be really hard
to implement and probably have all kinds of problems (as you point out
here).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tab completion: \c [ dbname [ username ] ]

2014-11-10 Thread Robert Haas
On Sun, Nov 9, 2014 at 6:13 PM, Ian Barwick i...@2ndquadrant.com wrote:
 Attached is a mighty trivial patch to extend psql tab completion
 for \c / \connect to generate a list of role names, as lack thereof
 was annoying me recently and I can't see any downside to doing
 this.

Committed, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Robert Haas
On Sun, Nov 9, 2014 at 8:41 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Robert Haas wrote:
 I think the key question here is the time for which the data needs to
 be retained.  2^32 of anything is a lot, but why keep around that
 number of records rather than more (after all, we have epochs to
 distinguish one use of a given txid from another) or fewer?

 The problem is not how much data we retain; is about how much data we
 can address.

I thought I was responding to a concern about disk space utilization.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Petr Jelinek

On 10/11/14 08:01, Anssi Kääriäinen wrote:

On Sun, 2014-11-09 at 11:57 -0500, Steve Singer wrote:


The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that.  If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.


There is no guarantee that a commit with later LSN has a later
timestamp. There are cases where the clock could move significantly
backwards.

A robust solution to storing transaction commit information (including
commit order) in a way that can be referenced from other tables, can be
loaded to another cluster, and survives crashes would be a great
feature. But this feature doesn't have those properties.



It has the property of surviving crashes.

Not sure what you mean by referencing from other tables?

And about loading to another cluster, the txid does not really have any 
meaning on another cluster, so the info about it does not have either?


But anyway this patch is targeting extensions not DBAs - you could write 
extension that will provide that feature on top of this patch (although 
given what I wrote above I don't see how it's useful).


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Robert Haas
On Mon, Nov 10, 2014 at 2:01 AM, Anssi Kääriäinen
anssi.kaariai...@thl.fi wrote:
 There is no guarantee that a commit with later LSN has a later
 timestamp. There are cases where the clock could move significantly
 backwards.

Good point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Petr Jelinek

On 09/11/14 17:57, Steve Singer wrote:

On 11/07/2014 07:07 PM, Petr Jelinek wrote:

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I
still want to have some bytes for storing the origin or whatever you
want to call it there, as that's the one I personally have biggest
use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can
pull some tricks to lower that a bit.



The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that.  If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.



Hmm maybe just one part of LSN, but I don't really like that either, if 
we want to store LSN we should probably store it as is as somebody might 
want to map it to txid for other reasons.


I did the calculation above wrong btw, it's actually 20 bytes not 24 
bytes per record, I am inclined to just say we can live with that.


Since we agreed that the (B) case is not really feasible and we are 
doing the (C), I also wonder if extradata should be renamed to nodeid 
(even if it's not used at this point as nodeid). And then there is 
question about the size of it, since the nodeid itself can live with 2 
bytes probably (64k of nodes ought to be enough for everybody ;) ).
Or leave the extradata as is but use as reserved space for future use 
and not expose it at this time on SQL level at all?


I guess Andres could answer what suits him better here.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Alvaro Herrera
Robert Haas wrote:
 On Sun, Nov 9, 2014 at 8:41 PM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:
  Robert Haas wrote:
  I think the key question here is the time for which the data needs to
  be retained.  2^32 of anything is a lot, but why keep around that
  number of records rather than more (after all, we have epochs to
  distinguish one use of a given txid from another) or fewer?
 
  The problem is not how much data we retain; is about how much data we
  can address.
 
 I thought I was responding to a concern about disk space utilization.

Ah, right.  So AFAIK we don't need to keep anything older than
RecentXmin or something like that -- which is not too old.  If I recall
correctly Josh Berkus was saying in a thread about pg_multixact that it
used about 128kB or so in = 9.2 for his customers; that one was also
limited to RecentXmin AFAIR.  I think a similar volume of commit_ts data
would be pretty acceptable.  Moreso considering that it's turned off by
default.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tab completion: \c [ dbname [ username ] ]

2014-11-10 Thread Ian Barwick

On 10/11/14 22:20, Robert Haas wrote:

On Sun, Nov 9, 2014 at 6:13 PM, Ian Barwick i...@2ndquadrant.com wrote:

Attached is a mighty trivial patch to extend psql tab completion
for \c / \connect to generate a list of role names, as lack thereof
was annoying me recently and I can't see any downside to doing
this.


Committed, thanks.


Many thanks!


Regards

Ian Barwick

--
 Ian Barwick   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_multixact not getting truncated

2014-11-10 Thread Robert Haas
On Sat, Nov 8, 2014 at 3:10 PM, Josh Berkus j...@agliodbs.com wrote:
 I also think our
 defaults for multixact freezing should be tied to the ones for xid
 freezing, and should not by default be completely independent numbers;
 I'm still not convinced that it makes sense to have a separate multixact
 threshold at all **since the same amount of vacuuming needs to be done
 regardless of whether we're truncating xids or mxids**.

That doesn't make any sense at all.  For one thing, it's not as if
there is only ONE threshold here.  There are three different ones,
controlling three different aspects of the behavior: (a) the age at
which we begin trying to freeze the pages we are planning to vacuum
anyway, (b) the age at which we force a vacuum that we're planning to
do anyway to scan the entire table, and (c) the age at which we
trigger an autovacuum that we weren't otherwise planning to do.

Generally, the reason why I think we need different thresholds for
XIDs and MXIDs is that they may be getting consumed at vastly
different rates.  It may be useful to have a light that comes on in
your car when you only have one gallon of fuel left, but you'd want a
different threshold for an airplane because it burns fuel at a
different rate.  If you made that light come on when there's a gallon
of fuel left, it would way too late to do any good.

I think a big part of the tuning problem here is that we don't have
any way of knowing what the real burn rates will be in a particular
customer environment.  If you're burning MXIDs very slowly, you
probably want to threshold (a), the age at which we begin freezing
pages we are planning to vacuum anyway, quite low, so that the next
full-table vacuum triggered by XID consumption freezes all the MXIDs
also, and advances relminmxid, thus preventing freezing passes
specifically for MXIDs from ever happening.  But if the MXID
consumption rate is very high, that may result in unnecessary I/O
freezing tuples that would have been updated before MXID age became an
issue anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Robert Haas
On Mon, Nov 10, 2014 at 8:39 AM, Petr Jelinek p...@2ndquadrant.com wrote:
 I did the calculation above wrong btw, it's actually 20 bytes not 24 bytes
 per record, I am inclined to just say we can live with that.

If you do it as 20 bytes, you'll have to do some work to squeeze out
the alignment padding.  I'm inclined to think it's fine to have a few
extra padding bytes here; someone might want to use those for
something in the future, and they probably don't cost much.

 Since we agreed that the (B) case is not really feasible and we are doing
 the (C), I also wonder if extradata should be renamed to nodeid (even if
 it's not used at this point as nodeid). And then there is question about the
 size of it, since the nodeid itself can live with 2 bytes probably (64k of
 nodes ought to be enough for everybody ;) ).
 Or leave the extradata as is but use as reserved space for future use and
 not expose it at this time on SQL level at all?

I vote for calling it node-ID, and for allowing at least 4 bytes for
it.  Penny wise, pound foolish.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Robert Haas
On Mon, Nov 10, 2014 at 8:40 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Ah, right.  So AFAIK we don't need to keep anything older than
 RecentXmin or something like that -- which is not too old.  If I recall
 correctly Josh Berkus was saying in a thread about pg_multixact that it
 used about 128kB or so in = 9.2 for his customers; that one was also
 limited to RecentXmin AFAIR.  I think a similar volume of commit_ts data
 would be pretty acceptable.  Moreso considering that it's turned off by
 default.

I'm not sure whether keeping it just back to RecentXmin will be enough
for everybody's needs.  But we certainly don't need to keep the last
2^32 records as someone-or-other was suggesting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow signal handlers to optionally use SA_SIGINFO information?

2014-11-10 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 Posix provides information about the source of the signal when using
 SA_SIGINFO style handlers via si_code/si_pid. That information has been
 available for a *long* while
 (c.f. http://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html).

That does not mean that it exists on all the platforms we support;
Windows in particular is likely to be an issue.

I concur with Heikki that you're probably solving this at the wrong
end anyway, since the next question you'll be asking is *why* did
process X send that signal.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fwd: How parser works and to add keyword to keywords.h

2014-11-10 Thread Pankaj Bagul
Hello,
Can anyone tell about how parser works and if we want to add a new keyword
to the keywords.h,
and how to use that keyword ? How the all process works ?

-- 
Pankaj B.


Re: [HACKERS] row_to_json bug with index only scans: empty keys!

2014-11-10 Thread Tom Lane
I wrote:
 Attached are patches meant for HEAD and 9.2-9.4 respectively.

BTW, has anyone got an opinion about whether to stick the full fix into
9.4?  The argument for, of course, is that we'd get the full fix out to
the public a year sooner.  The argument against is that someone might
have already done compatibility testing of their application against
9.4 betas, and then get blindsided if we change this before release.

I'm inclined to think that since we expect json/jsonb usage to really
take off with 9.4, it might be better if we get row_to_json behaving
unsurprisingly now.  But I'm not dead set on it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_to_json bug with index only scans: empty keys!

2014-11-10 Thread Robert Haas
On Mon, Nov 10, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 Attached are patches meant for HEAD and 9.2-9.4 respectively.

 BTW, has anyone got an opinion about whether to stick the full fix into
 9.4?  The argument for, of course, is that we'd get the full fix out to
 the public a year sooner.  The argument against is that someone might
 have already done compatibility testing of their application against
 9.4 betas, and then get blindsided if we change this before release.

 I'm inclined to think that since we expect json/jsonb usage to really
 take off with 9.4, it might be better if we get row_to_json behaving
 unsurprisingly now.  But I'm not dead set on it.

I think we should put the full fix in 9.4.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove pg_standby?

2014-11-10 Thread Magnus Hagander
On Tue, Nov 4, 2014 at 10:36 PM, Peter Eisentraut pete...@gmx.net wrote:
 While we're talking about removing old things, is there any use left for
 pg_standby?

Was the original reason to keep it around anything other than
backwards compatibility? If not, then it was backwards compatibility
with a version that's not even supported anymore - so strong +1 for
getting rid of it.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: How parser works and to add keyword to keywords.h

2014-11-10 Thread Tom Lane
Pankaj Bagul bagul@gmail.com writes:
 Can anyone tell about how parser works and if we want to add a new keyword
 to the keywords.h,
 and how to use that keyword ? How the all process works ?

The minimum requirements are as stated in gram.y: add the keyword to
kwlist.h, to the %token keyword list near the head of gram.y, and
to the appropriate one of the keyword category lists near the bottom
of gram.y.  Presumably you want to use this keyword in some production
or other, so you'll also need to add or modify parsetree node struct types
to represent the output from the production.  Best bet really is to
identify some existing SQL language feature that is somewhat related
to what you want to do, and then grep the PG source code for all mentions
of the related parse node type(s).  There's usually a pretty fair amount
of boilerplate support code that you'll need to add or modify, but copy
and paste is your friend for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_to_json bug with index only scans: empty keys!

2014-11-10 Thread David G Johnston
Tom Lane-2 wrote
 I wrote:
 Attached are patches meant for HEAD and 9.2-9.4 respectively.
 
 BTW, has anyone got an opinion about whether to stick the full fix into
 9.4?  The argument for, of course, is that we'd get the full fix out to
 the public a year sooner.  The argument against is that someone might
 have already done compatibility testing of their application against
 9.4 betas, and then get blindsided if we change this before release.
 
 I'm inclined to think that since we expect json/jsonb usage to really
 take off with 9.4, it might be better if we get row_to_json behaving
 unsurprisingly now.  But I'm not dead set on it.

I am not one of those people who would be blindsided but I'd much rather
inconvenience our beta testers in order to get a better product in place for
the masses.

Beta testers read release notes and should be able to identify and fix any
problematic areas of their code while simultaneously being happy for the
improvements.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/row-to-json-bug-with-index-only-scans-empty-keys-tp5826070p5826338.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_to_json bug with index only scans: empty keys!

2014-11-10 Thread Ross Reedstrom
Yes, it's late beta, but especially if we're pushing json/jsonb usage
as a major feature of this release, I'd say remove surprising cases
now. It's late beta, but that's what beta is for, the last chance for
bug fixes, before we live w/ it for the support life of the release.

The affected class are people with an app that already uses json,
so 9.2 or better, have ran acceptance tests against an early beta of
9.4, and have SQL that returns madeup fieldnames instead of the appropriate
aliases? Which they were probably annoyed at when they wrote the code that
consumes that output in the first place. I think an item in the list of
compatability changes/gotchas should catch anyone who is that on the ball
as to be testing the betas anyway. Anyone pushing that envelope is going to
do the same acceptance testing against the actual release before rolling
to production.

Ross


On Mon, Nov 10, 2014 at 10:11:25AM -0500, Tom Lane wrote:
 I wrote:
  Attached are patches meant for HEAD and 9.2-9.4 respectively.
 
 BTW, has anyone got an opinion about whether to stick the full fix into
 9.4?  The argument for, of course, is that we'd get the full fix out to
 the public a year sooner.  The argument against is that someone might
 have already done compatibility testing of their application against
 9.4 betas, and then get blindsided if we change this before release.
 
 I'm inclined to think that since we expect json/jsonb usage to really
 take off with 9.4, it might be better if we get row_to_json behaving
 unsurprisingly now.  But I'm not dead set on it.
 
   regards, tom lane
 

-- 
Ross Reedstrom, Ph.D. reeds...@rice.edu
Systems Engineer  Admin, Research Scientistphone: 713-348-6166
Connexions  http://cnx.orgfax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row_to_json bug with index only scans: empty keys!

2014-11-10 Thread Andrew Dunstan


On 11/10/2014 10:19 AM, Robert Haas wrote:

On Mon, Nov 10, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:

I wrote:

Attached are patches meant for HEAD and 9.2-9.4 respectively.

BTW, has anyone got an opinion about whether to stick the full fix into
9.4?  The argument for, of course, is that we'd get the full fix out to
the public a year sooner.  The argument against is that someone might
have already done compatibility testing of their application against
9.4 betas, and then get blindsided if we change this before release.

I'm inclined to think that since we expect json/jsonb usage to really
take off with 9.4, it might be better if we get row_to_json behaving
unsurprisingly now.  But I'm not dead set on it.

I think we should put the full fix in 9.4.



+1

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warning in master branch

2014-11-10 Thread Kevin Grittner
David Rowley dgrowle...@gmail.com wrote: On Mon, Nov 10, 2014 at 4:31 PM, 
Peter Geoghegan p...@heroku.com wrote:


 I see this when I build at -O2 from master's tip:

 brin_revmap.c: In function ‘brinRevmapExtend’:
 brin_revmap.c:113:14: error: variable ‘mapBlk’ set but not used
 [-Werror=unused-but-set-variable]
  BlockNumber mapBlk;
  ^

 It would appear just to need the attached.

Confirmed and pushed.

Thanks to both of you!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Robert Haas
On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Something usable, with severe restrictions, is actually better than we
 have now. I understand the journey this work represents, so don't be
 embarrassed by submitting things with heuristics and good-enoughs in
 it. Our mentor, Mr.Lane, achieved much by spreading work over many
 releases, leaving others to join in the task.

It occurs to me that, now that the custom-scan stuff is committed, it
wouldn't be that hard to use that, plus the other infrastructure we
already have, to write a prototype of parallel sequential scan.  Given
where we are with the infrastructure, there would be a number of
unhandled problems, such as deadlock detection (needs group locking or
similar), assessment of quals as to parallel-safety (needs
proisparallel or similar), general waterproofing to make sure that
pushing down a qual we shouldn't does do anything really dastardly
like crash the server (another written but yet-to-be-published patch
adds a bunch of relevant guards), and snapshot sharing (likewise).
But if you don't do anything weird, it should basically work.

I think this would be useful for a couple of reasons.  First, it would
be a demonstrable show of progress, illustrating how close we are to
actually having something you can really deploy.  Second, we could use
it to demonstrate how the remaining infrastructure patches close up
gaps in the initial prototype.  Third, it would let us start doing
real performance testing.  It seems pretty clear that a parallel
sequential scan of data that's in memory (whether the page cache or
the OS cache) can be accelerated by having multiple processes scan it
in parallel.  But it's much less clear what will happen when the data
is being read in from disk.  Does parallelism help at all?  What
degree of parallelism helps?  Do we break OS readahead so badly that
performance actually regresses?  These are things that are likely to
need a fair amount of tuning before this is ready for prime time, so
being able to start experimenting with them in advance of all of the
infrastructure being completely ready seems like it might help.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Teaching pg_dump to use NOT VALID constraints

2014-11-10 Thread Simon Riggs
Magnus and I discussed the need for pg_dump to offer the use of NOT
VALID constraints.
Here's the patch.


 pg_dump --no-revalidaton

will add NOT VALID onto the recreation SQL for any FKs, but only for
ones that were already known to be valid.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


pg_dump_revalidation.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL information view

2014-11-10 Thread Magnus Hagander
On Mon, Jul 21, 2014 at 5:24 PM, Bernd Helmle maili...@oopsware.de wrote:


 --On 12. Juli 2014 15:08:01 +0200 Magnus Hagander mag...@hagander.net
 wrote:

 Before doing that, however, I'd like to ask for opinions :) The hack
 currently exposes a separate view that you can join to
 pg_stat_activity (or pg_stat_replication) on the pid -- this is sort
 of the same way that pg_stat_replication works in the first place. Do
 we want something similar to that for a builtin SSL view as well, or
 do we want to include the fields directly in pg_stat_activity and
 pg_stat_replication?


 I've heard more than once the wish to get this information without
 contrib..especially for the SSL version used (client and server likewise).
 So ++1 for this feature.

 I'd vote for a special view, that will keep the information into a single
 place and someone can easily join extra information together.

Here's a patch that implements that.

Docs are currently ont included because I'm waiting for the
restructuring of tha section to be done (started by me in a separate
thread) first, but the code is there for review.

Right now it just truncates the dn at NAMEDATALEN - so treating it the
same as we do with hostnames. My guess is this is not a big problem
because in the case of long DNs, most of the time the important stuff
is at the beginning anyway... (And it's not like it's actually used
for authentication, in which case it would of course be a problem).

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***
*** 648,653  CREATE VIEW pg_stat_replication AS
--- 648,664 
  WHERE S.usesysid = U.oid AND
  S.pid = W.pid;
  
+ CREATE VIEW pg_stat_ssl AS
+ SELECT
+ I.pid,
+ I.ssl,
+ I.bits,
+ I.compression,
+ I.version,
+ I.cipher,
+ I.clientdn
+ FROM pg_stat_get_sslstatus() AS I;
+ 
  CREATE VIEW pg_replication_slots AS
  SELECT
  L.slot_name,
*** a/src/backend/libpq/be-secure-openssl.c
--- b/src/backend/libpq/be-secure-openssl.c
***
*** 88,93  static void info_cb(const SSL *ssl, int type, int args);
--- 88,95 
  static void initialize_ecdh(void);
  static const char *SSLerrmessage(void);
  
+ static char *X509_NAME_to_cstring(X509_NAME *name);
+ 
  /* are we in the middle of a renegotiation? */
  static bool in_ssl_renegotiation = false;
  
***
*** 1053,1055  SSLerrmessage(void)
--- 1055,1159 
  	snprintf(errbuf, sizeof(errbuf), _(SSL error code %lu), errcode);
  	return errbuf;
  }
+ 
+ /*
+  * Return information about the SSL connection
+  */
+ int
+ be_tls_get_cipher_bits(Port *port)
+ {
+ 	int bits;
+ 
+ 	if (port-ssl)
+ 	{
+ 		SSL_get_cipher_bits(port-ssl, bits);
+ 		return bits;
+ 	}
+ 	else
+ 		return 0;
+ }
+ 
+ bool
+ be_tls_get_compression(Port *port)
+ {
+ 	if (port-ssl)
+ 		return (SSL_get_current_compression(port-ssl) != NULL);
+ 	else
+ 		return false;
+ }
+ 
+ void
+ be_tls_get_version(Port *port, char *ptr, size_t len)
+ {
+ 	if (port-ssl)
+ 		strlcpy(ptr, SSL_get_version(port-ssl), len);
+ 	else
+ 		ptr[0] = '\0';
+ }
+ 
+ void
+ be_tls_get_cipher(Port *port, char *ptr, size_t len)
+ {
+ 	if (port-ssl)
+ 		strlcpy(ptr, SSL_get_cipher(port-ssl), NAMEDATALEN);
+ 	else
+ 		ptr[0] = '\0';
+ }
+ 
+ void
+ be_tls_get_peerdn_name(Port *port, char *ptr, size_t len)
+ {
+ 	if (port-peer)
+ 		strlcpy(ptr, X509_NAME_to_cstring(X509_get_subject_name(port-peer)), NAMEDATALEN);
+ 	else
+ 		ptr[0] = '\0';
+ }
+ 
+ /*
+  * Convert an X509 subject name to a cstring.
+  *
+  */
+ static char *
+ X509_NAME_to_cstring(X509_NAME *name)
+ {
+ 	BIO		   *membuf = BIO_new(BIO_s_mem());
+ 	int			i,
+ nid,
+ count = X509_NAME_entry_count(name);
+ 	X509_NAME_ENTRY *e;
+ 	ASN1_STRING *v;
+ 	const char *field_name;
+ 	size_t		size;
+ 	char		nullterm;
+ 	char	   *sp;
+ 	char	   *dp;
+ 	char	   *result;
+ 
+ 	(void) BIO_set_close(membuf, BIO_CLOSE);
+ 	for (i = 0; i  count; i++)
+ 	{
+ 		e = X509_NAME_get_entry(name, i);
+ 		nid = OBJ_obj2nid(X509_NAME_ENTRY_get_object(e));
+ 		v = X509_NAME_ENTRY_get_data(e);
+ 		field_name = OBJ_nid2sn(nid);
+ 		if (!field_name)
+ 			field_name = OBJ_nid2ln(nid);
+ 		BIO_printf(membuf, /%s=, field_name);
+ 		ASN1_STRING_print_ex(membuf, v,
+ 			 ((ASN1_STRFLGS_RFC2253  ~ASN1_STRFLGS_ESC_MSB)
+ 			  | ASN1_STRFLGS_UTF8_CONVERT));
+ 	}
+ 
+ 	/* ensure null termination of the BIO's content */
+ 	nullterm = '\0';
+ 	BIO_write(membuf, nullterm, 1);
+ 	size = BIO_get_mem_data(membuf, sp);
+ 	dp = pg_any_to_server(sp, size - 1, PG_UTF8);
+ 
+ 	result = pstrdup(dp);
+ 	if (dp != sp)
+ 		pfree(dp);
+ 	BIO_free(membuf);
+ 
+ 	return result;
+ }
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***
*** 2386,2391  pgstat_fetch_global(void)
--- 2386,2394 
  

Re: [HACKERS] tracking commit timestamps

2014-11-10 Thread Steve Singer

On 11/10/2014 08:39 AM, Petr Jelinek wrote:

On 09/11/14 17:57, Steve Singer wrote:

On 11/07/2014 07:07 PM, Petr Jelinek wrote:

The list of what is useful might be long, but we can't have everything
there as there are space constraints, and LSN is another 8 bytes and I
still want to have some bytes for storing the origin or whatever you
want to call it there, as that's the one I personally have biggest
use-case for.
So this would be ~24bytes per txid already, hmm I wonder if we can
pull some tricks to lower that a bit.



The reason why Jim and myself are asking for the LSN and not just the
timestamp is that I want to be able to order the transactions. Jim
pointed out earlier in the thread that just ordering on timestamp allows
for multiple transactions with the same timestamp.

Maybe we don't need the entire LSN to solve that.  If you already have
the commit timestamp maybe you only need another byte or two of
granularity to order transactions that are within the same microsecond.



Hmm maybe just one part of LSN, but I don't really like that either, 
if we want to store LSN we should probably store it as is as somebody 
might want to map it to txid for other reasons.


I did the calculation above wrong btw, it's actually 20 bytes not 24 
bytes per record, I am inclined to just say we can live with that.


Since we agreed that the (B) case is not really feasible and we are 
doing the (C), I also wonder if extradata should be renamed to nodeid 
(even if it's not used at this point as nodeid). And then there is 
question about the size of it, since the nodeid itself can live with 2 
bytes probably (64k of nodes ought to be enough for everybody ;) ).
Or leave the extradata as is but use as reserved space for future use 
and not expose it at this time on SQL level at all?


I guess Andres could answer what suits him better here.



I am happy with renaming extradata to nodeid and not exposing it at this 
time.


If we feel that commit-order (ie LSN or something equivalent) is really 
a different patch/feature than commit-timestamp then I am okay with that 
also but we should make sure to warn users of the commit-timestamp in 
the documentation that two transactions might have the same timestamp 
and that the commit order might not be the same as ordering by the 
commit timestamp.






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Jim Nasby

On 11/7/14, 8:21 PM, Robert Haas wrote:

On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote:

The problem right now is there's no way to actually obtain evidence that
this is (or isn't) something to worry about, because we just silently skip
pages. If we had any kind of tracking on this we could stop guessing. :(


I could see logging it, but I agree with Andres and Alvaro that the
odds are strongly against there being any actual problem here.



I'm fine with that. Any other objections? Andres?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-11-10 Thread Robert Haas
So, summarizing the state of this patch set:

- Patches 1, 2, 3, and 5 are committed.

- Patch 4 has had some review and really, as far as I can tell, the
only really major issue that has cropped up is the possibility that
the GUC settings that are legal in backend A aren't legal in backend B
for some reason; in that case, restoring the GUC settings there will
probably fail.  While this is a legitimate concern, I think what it
really boils down to is that there's a possibility that there are
libraries loaded in the user backend that are not loaded in the
postmaster and therefore won't get loaded into the background worker.
That's an issue to which we may need to engineer some solution, but
not in this patch.  Overall, the patch's architecture is modeled
closely on the way we synchronize GUCs to new backends when using
EXEC_BACKEND, and I don't think we're going do any better than stating
with that and working to file down the rough edges as we go.  So I'd
like to go ahead and commit this.

- Patch 6, pg_background itself, has a number of outstanding issues.
Some people aren't sure it's useful enough to really be worth
bothering with; other people seem quite excited about it, even to the
point of wanting to push it all the way into core.  Whatever we
ultimately decide to do there, the patch as it stands today is clearly
not ready for commit.  The issues I know of are:

-- I still haven't written documentation for it.
-- There's some code duplication with exec_simple_query() that doesn't
feel great, but it's not clear that there's a superior alternative.  I
think we certainly don't want to complicate exec_simple_query() to
make pg_background happy.
-- We should add REVOKE to the SQL script that installs it so that
non-superusers can't use it unless the superuser decides to grant them
rights.
-- It doesn't handle types without send/receive functions.  I thought
that was tolerable since the functions without such types seem like
mostly edge cases, but Andres doesn't like it.  Apparently, BDR is
makes provisions to either ship the tuple as one big blob - if all
built-in types - or else use binary format where supported and text
format otherwise.  Since this issue is common to BDR, parallelism, and
pg_background, we might want to introduce some common infrastructure
for it, but it's not too clear to me right now what that should look
like.
-- Evil things can happen if the background process changes client_encoding.
-- Even if I fix all that stuff, it's not entirely clear that there's
a consensus in favor of the patch at all.

I certainly think that as far as this CommitFest is concerned, patch 6
ought to be considered Returned with Feedback.  Many thanks to Andres
and all for reviewing.  What I am less certain about is whether it
makes sense to spend the energy fixing the above issues in the
short-term.  I wrote pg_background mostly a demonstration that the
rest of these patches work and can be used to accomplish useful stuff,
whether as part of parallelism or otherwise; and there's a lot of work
left to be done on parallelism proper that won't necessarily benefit
from polishing pg_background first.

So, questions:

1. Any other opinions for or against pg_background as a concept?  I
thought the ability to kick of vacuums (or other stuff with
PreventTransactionChain) asynchronously was pretty cool, as we as the
ability to use it as an authentication-free loopback dblink.  But
opinions obviously vary.

2. Is anyone sufficiently interested in pg_background as a concept
that they'd be willing to take over the patch and work on the TODO
list mentioned above?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove pg_standby?

2014-11-10 Thread Fujii Masao
On Wed, Nov 5, 2014 at 6:36 AM, Peter Eisentraut pete...@gmx.net wrote:
 While we're talking about removing old things, is there any use left for
 pg_standby?

-1 for removing it. There is still the case where I'd like to use log-shipping
rather than replication. For example, it's the case where I need to
compress WAL files before streaming them via very thin network.
We can set up log-shipping using standby_mode and without using
pg_standby, but it keeps emitting the failure of restore_command while
while there is no WAL activity, and which is bothersome. So I still need
pg_standby for log-shipping.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove pg_standby?

2014-11-10 Thread Magnus Hagander
On Nov 10, 2014 6:16 PM, Fujii Masao masao.fu...@gmail.com wrote:

 On Wed, Nov 5, 2014 at 6:36 AM, Peter Eisentraut pete...@gmx.net wrote:
  While we're talking about removing old things, is there any use left for
  pg_standby?

 -1 for removing it. There is still the case where I'd like to use
log-shipping
 rather than replication. For example, it's the case where I need to
 compress WAL files before streaming them via very thin network.
 We can set up log-shipping using standby_mode and without using
 pg_standby, but it keeps emitting the failure of restore_command while
 while there is no WAL activity, and which is bothersome. So I still need
 pg_standby for log-shipping.

I didn't realize that part, but maybe we should fix that instead of keeping
pg_standby around?

(BTW, you can use streaming with compression as well using ssl of course,
but it won't get quite the same levels due to smaller block sizes. And
there are certainly still reasons for file based standbys so we should
definitely not remove that)

/Magnus


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Alvaro Herrera
Jim Nasby wrote:
 On 11/7/14, 8:21 PM, Robert Haas wrote:
 On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 The problem right now is there's no way to actually obtain evidence that
 this is (or isn't) something to worry about, because we just silently skip
 pages. If we had any kind of tracking on this we could stop guessing. :(
 
 I could see logging it, but I agree with Andres and Alvaro that the
 odds are strongly against there being any actual problem here.
 
 I'm fine with that. Any other objections? Andres?

If what we want is to quantify the extent of the issue, would it be more
convenient to save counters to pgstat?  Vacuum already sends pgstat
messages, so there's no additional traffic there.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-11-10 Thread Fujii Masao
On Mon, Nov 10, 2014 at 4:15 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 (2014/11/06 23:38), Fujii Masao wrote:

 On Tue, Nov 4, 2014 at 12:04 PM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:

 IIUC, I think that min = 0 disables fast update, so ISTM that it'd be
 appropriate to set min to some positive value.  And ISTM that the idea of
 using the min value of work_mem is not so bad.


 OK. I changed the min value to 64kB.

 *** 356,361  CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ replaceable
 class=parametername/
 --- 356,372 
/listitem
   /varlistentry
   /variablelist
 +variablelist
 +varlistentry
 + termliteralPENDING_LIST_CLEANUP_SIZE//term

 The above is still in uppercse.


 Fixed.

 Attached is the updated version of the patch. Thanks for the review!


 Thanks for the updating the patch!

 The patch looks good to me except for the following point:

Thanks for the review again!


 *** a/src/backend/access/gin/ginfast.c
 --- b/src/backend/access/gin/ginfast.c
 ***
 *** 25,30 
 --- 25,32 
   #include utils/memutils.h
   #include utils/rel.h

 + /* GUC parameter */
 + int   pending_list_cleanup_size = 0;

 I think we need to initialize the GUC to boot_val, 4096 in this case.

No, IIUC basically the variable for GUC doesn't need to be initialized
to its default value. OTOH, it's also harmless to initialize it to the default.
I like the current code a bit because we don't need to change the initial
value again when we decide to change the default value of GUC.
I have no strong opinion about this, though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_dump to use NOT VALID constraints

2014-11-10 Thread Alvaro Herrera
Simon Riggs wrote:
 Magnus and I discussed the need for pg_dump to offer the use of NOT
 VALID constraints.
 Here's the patch.
 
 
  pg_dump --no-revalidaton
 
 will add NOT VALID onto the recreation SQL for any FKs, but only for
 ones that were already known to be valid.

Well.  Constraints that haven't been validated already have a NOT VALID
emitted by ruleutils.c, yes?  So what this patch does is add such a
clause for all *other* constraints.  Right?  In other words what it aims
to do is speed up loading of data by skipping the validation step on
restore.  Is that right?

ISTM we could have the default pg_dump behavior emit NOT VALID
constraints, and add VALIDATE CONSTRAINT commands at the end; that way
the database is usable sooner but the constraints end up marked as
validated by the time the dump is finished.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove pg_standby?

2014-11-10 Thread Fujii Masao
On Tue, Nov 11, 2014 at 2:19 AM, Magnus Hagander mag...@hagander.net wrote:

 On Nov 10, 2014 6:16 PM, Fujii Masao masao.fu...@gmail.com wrote:

 On Wed, Nov 5, 2014 at 6:36 AM, Peter Eisentraut pete...@gmx.net wrote:
  While we're talking about removing old things, is there any use left for
  pg_standby?

 -1 for removing it. There is still the case where I'd like to use
 log-shipping
 rather than replication. For example, it's the case where I need to
 compress WAL files before streaming them via very thin network.
 We can set up log-shipping using standby_mode and without using
 pg_standby, but it keeps emitting the failure of restore_command while
 while there is no WAL activity, and which is bothersome. So I still need
 pg_standby for log-shipping.

 I didn't realize that part,

The log-shipping standby using standby_mode tries to execute the
restore_command to restore new WAL file but it fails and the message
meaning the failure is logged if there is no new WAL file. Then
the standby tries to execure the same restore_command after five
seconds. But it fails and the message is logged again. These happen
continuously while no new WAL file is appearing in the standby's
archival area.

 but maybe we should fix that instead of keeping
 pg_standby around?

Yes. Even if we do that, we should announce pg_standby will be
removed and wait for several releases before removing it?

 (BTW, you can use streaming with compression as well using ssl of course,
 but it won't get quite the same levels due to smaller block sizes. And there
 are certainly still reasons for file based standbys so we should definitely
 not remove that)

Yes.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove pg_standby?

2014-11-10 Thread Joshua D. Drake


On 11/04/2014 01:36 PM, Peter Eisentraut wrote:


While we're talking about removing old things, is there any use left for
pg_standby?


-1.

A lot of people, a lot of customers use log shipping for various 
creative and business requirement setups.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Teaching pg_dump to use NOT VALID constraints

2014-11-10 Thread Simon Riggs
On 10 November 2014 17:33, Alvaro Herrera alvhe...@2ndquadrant.com wrote:

  pg_dump --no-revalidaton

 will add NOT VALID onto the recreation SQL for any FKs, but only for
 ones that were already known to be valid.

 Well.  Constraints that haven't been validated already have a NOT VALID
 emitted by ruleutils.c, yes?  So what this patch does is add such a
 clause for all *other* constraints.  Right?  In other words what it aims
 to do is speed up loading of data by skipping the validation step on
 restore.  Is that right?

Correct. CHECK constraints are added onto main table so they validate at load.

 ISTM we could have the default pg_dump behavior emit NOT VALID
 constraints, and add VALIDATE CONSTRAINT commands at the end; that way
 the database is usable sooner but the constraints end up marked as
 validated by the time the dump is finished.

Yes, may be an even better idea. We'd still want the --no-revalidation
option, AFAICS.

FKs are already at the end. Perhaps we should add another
validation section?

I like the idea, just not sure how long it would take.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Jim Nasby

On 11/10/14, 11:28 AM, Alvaro Herrera wrote:

Jim Nasby wrote:

On 11/7/14, 8:21 PM, Robert Haas wrote:

On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote:

The problem right now is there's no way to actually obtain evidence that
this is (or isn't) something to worry about, because we just silently skip
pages. If we had any kind of tracking on this we could stop guessing. :(


I could see logging it, but I agree with Andres and Alvaro that the
odds are strongly against there being any actual problem here.


I'm fine with that. Any other objections? Andres?


If what we want is to quantify the extent of the issue, would it be more
convenient to save counters to pgstat?  Vacuum already sends pgstat
messages, so there's no additional traffic there.


IMHO that would be ideal, but I think Tom was leery of using more space for 
every table. If we go this route, I'm guessing we should only log pages we 
skip, and not log pages we had to wait for the lock on (in the case of a 
freeze). Also, should we still eroprt this even if we are putting it in stats?

Is there a way to avoid duplicating the entire eroprt call? I see I could call 
errstart  friends manually, but currently that's only done in elog.c.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow signal handlers to optionally use SA_SIGINFO information?

2014-11-10 Thread Andres Freund
On 2014-11-10 09:50:17 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  Posix provides information about the source of the signal when using
  SA_SIGINFO style handlers via si_code/si_pid. That information has been
  available for a *long* while
  (c.f. http://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html).
 
 That does not mean that it exists on all the platforms we support;
 Windows in particular is likely to be an issue.

Given that we essentially use a named pipe to transport signals, it
shouldn't be too hard if we decide we need it.

 I concur with Heikki that you're probably solving this at the wrong
 end anyway, since the next question you'll be asking is *why* did
 process X send that signal.

I agree that that's the most useful thing for any specific situation
fully under our control. But generally I'd have more than once found it
useful to have the sender of a signal available - even if it's just
during debugging.  I also previously would have liked to know where
INT/TERM are coming from - having the senders pid can be useful to
diagnose.

If we add it, it's surely nothing we ever are going to rely on - as you
say, there's portability issues. But printing it optionallY in a few
place at appropriate debug levels + having it available in signal
handlers during debugging sounds useful enough anyway to me.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Andres Freund
On 2014-11-10 14:28:30 -0300, Alvaro Herrera wrote:
 Jim Nasby wrote:
  On 11/7/14, 8:21 PM, Robert Haas wrote:
  On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote:
  The problem right now is there's no way to actually obtain evidence that
  this is (or isn't) something to worry about, because we just silently skip
  pages. If we had any kind of tracking on this we could stop guessing. :(
  
  I could see logging it, but I agree with Andres and Alvaro that the
  odds are strongly against there being any actual problem here.
  
  I'm fine with that. Any other objections? Andres?

If you feel that strong about the need for logging, go ahead.

 If what we want is to quantify the extent of the issue, would it be more
 convenient to save counters to pgstat?  Vacuum already sends pgstat
 messages, so there's no additional traffic there.

I'm pretty strongly against that one in isolation. They'd need to be
stored somewhere and they'd need to be queryable somewhere with enough
context to make sense.  To actually make sense of the numbers we'd also
need to report all the other datapoints of vacuum in some form. That's
quite a worthwile project imo - but *much* *much* more work than this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2014-11-10 Thread Jeff Janes
On Wed, Nov 5, 2014 at 8:49 PM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Thu, Oct 30, 2014 at 5:19 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  Updated patch is attached.
 Please find attached an updated patch with the following things changed:
 - Addition of tab completion in psql for all new commands
 - Addition of a call to WaitForLockers in index_concurrent_swap to
 ensure that there are no running transactions on the parent table
 running before exclusive locks are taken on the index and its
 concurrent entry. Previous patch versions created deadlocks because of
 that, issue spotted by the isolation tests integrated in the patch.
 - Isolation tests for reindex concurrently are re-enabled by default.
 Regards,


It looks like this needs another rebase, I get failures
on index.c, toasting.c, indexcmds.c, and index.h

Thanks,

Jeff


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-11-10 Thread Andres Freund
On 2014-11-10 12:10:49 -0500, Robert Haas wrote:
 - Patch 4 has had some review and really, as far as I can tell, the
 only really major issue that has cropped up is the possibility that
 the GUC settings that are legal in backend A aren't legal in backend B
 for some reason; in that case, restoring the GUC settings there will
 probably fail.  While this is a legitimate concern, I think what it
 really boils down to is that there's a possibility that there are
 libraries loaded in the user backend that are not loaded in the
 postmaster and therefore won't get loaded into the background worker.

I'm not too concerned about that one.

 That's an issue to which we may need to engineer some solution, but
 not in this patch.  Overall, the patch's architecture is modeled
 closely on the way we synchronize GUCs to new backends when using
 EXEC_BACKEND, and I don't think we're going do any better than stating
 with that and working to file down the rough edges as we go.  So I'd
 like to go ahead and commit this.

Did you check out whether PGC_BACKEND GUCs work? Other than that I agree
that we can just solve further issues as we notice them. This isn't
particularly intrustive.

 -- There's some code duplication with exec_simple_query() that doesn't
 feel great, but it's not clear that there's a superior alternative.  I
 think we certainly don't want to complicate exec_simple_query() to
 make pg_background happy.

Yea, I think we can live with it if really necessary.

 -- We should add REVOKE to the SQL script that installs it so that
 non-superusers can't use it unless the superuser decides to grant them
 rights.

Right. That seems trivial enough.

 -- It doesn't handle types without send/receive functions.  I thought
 that was tolerable since the functions without such types seem like
 mostly edge cases, but Andres doesn't like it.  Apparently, BDR is
 makes provisions to either ship the tuple as one big blob - if all
 built-in types - or else use binary format where supported and text
 format otherwise.  Since this issue is common to BDR, parallelism, and
 pg_background, we might want to introduce some common infrastructure
 for it, but it's not too clear to me right now what that should look
 like.

I think we definitely need to solve this - but I'm also not at all that
sure how.

For something like pg_background it's pretty trivial to fall back to
in/out when there's no send/recv. It's just a couple of lines, and the
portal code has the necessary provisions. So solving it for
pg_background itself is pretty trivial.  But, as you say, pg_background
itself isn't a primary goal (although a nice demonstration, with some
real world applications).  There's enough differences between the
parallelism and the replication cases that I'm not entirely sure how
much common ground there is. Namely replication has the additional
concerns of version differences, trust (don't use blobs if you don't
fully trust the other side), and differences in the allocated oids
(especially relevant for arrays which, very annoyingly, embed the oid in
the send/recv format).

 1. Any other opinions for or against pg_background as a concept?  I
 thought the ability to kick of vacuums (or other stuff with
 PreventTransactionChain) asynchronously was pretty cool, as we as the
 ability to use it as an authentication-free loopback dblink.  But
 opinions obviously vary.

I think it's a cool concept, but I'm not sure if it's worth the work to
make it fully usable. Or rather, I think it's worthy enough, but I
personally would priorize other stuff.

 2. Is anyone sufficiently interested in pg_background as a concept
 that they'd be willing to take over the patch and work on the TODO
 list mentioned above?

I personally won't. If we can come up with a sketch of how to deal with
the data transport encoding issue above, I'd be willing to to work on
that specific part. But not pg_background in itself.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Alvaro Herrera
Andres Freund wrote:
 On 2014-11-10 14:28:30 -0300, Alvaro Herrera wrote:

  If what we want is to quantify the extent of the issue, would it be more
  convenient to save counters to pgstat?  Vacuum already sends pgstat
  messages, so there's no additional traffic there.
 
 I'm pretty strongly against that one in isolation. They'd need to be
 stored somewhere and they'd need to be queryable somewhere with enough
 context to make sense.  To actually make sense of the numbers we'd also
 need to report all the other datapoints of vacuum in some form. That's
 quite a worthwile project imo - but *much* *much* more work than this.

We already have last_autovacuum columns and such in pg_stat_tables et
al, which only record the last value.  My thinking regarding such
numbers is that you would save histories and put them in a chart, see
how they evolve with time.  I doubt the individual numbers are worth
much, but the trends might show something interesting.  As far as I
know, this is already true for most other pgstat values, with exception
of things such as live_tuples which are absolute numbers rather than
running counters.

I agree having more vacuuming data in general is a worthwhile project,
much larger than this one.  Wasn't Greg Smith working on that?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-11-10 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
  1. Any other opinions for or against pg_background as a concept?  I
  thought the ability to kick of vacuums (or other stuff with
  PreventTransactionChain) asynchronously was pretty cool, as we as the
  ability to use it as an authentication-free loopback dblink.  But
  opinions obviously vary.
 
 I think it's a cool concept, but I'm not sure if it's worth the work to
 make it fully usable. Or rather, I think it's worthy enough, but I
 personally would priorize other stuff.

I've not read through the whole thread/discussionm but I'd put myself in
more-or-less the same boat at this point.  I've got a number of other
things on my plate already that need to get done (more RLS cleanup /
consistency, back-patching the ereport() column-privs issue, reviewing
pgAudit, the less-than-superuser privileges work, actually helping out
with the in-progress commitfest..) and so I really doubt I'd be able to
seriously help with pg_background- at least for 9.5, which is coming up
awful fast at this point, if we're going to stick with the 'normal'
schedule and freeze in the spring.

That said, I love the concept and had really been hoping to see it in
9.5, and maybe some at or cron-like ability happening later (yes, I
absolutely feel we need this, though I know others have different
opinions..).

  2. Is anyone sufficiently interested in pg_background as a concept
  that they'd be willing to take over the patch and work on the TODO
  list mentioned above?
 
 I personally won't. If we can come up with a sketch of how to deal with
 the data transport encoding issue above, I'd be willing to to work on
 that specific part. But not pg_background in itself.

If other things get done or additional resources show up, I'd be
interested in helping, but I don't see either happening in time for 9.5.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Jim Nasby

On 11/10/14, 12:15 PM, Andres Freund wrote:

If what we want is to quantify the extent of the issue, would it be more
convenient to save counters to pgstat?  Vacuum already sends pgstat
messages, so there's no additional traffic there.

I'm pretty strongly against that one in isolation. They'd need to be
stored somewhere and they'd need to be queryable somewhere with enough
context to make sense.  To actually make sense of the numbers we'd also
need to report all the other datapoints of vacuum in some form. That's
quite a worthwile project imo - but*much*  *much*  more work than this.


We already report statistics on vacuums 
(lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding 1 or 2 
counters to that. Should we add the other counters from vacuum? That would be 
significantly more data.

Semi-related, we should probably be reporting stats from heap truncation.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-11-10 Thread Alvaro Herrera
Stephen Frost wrote:

 at least for 9.5, which is coming up awful fast at this point, if
 we're going to stick with the 'normal' schedule and freeze in the
 spring.

https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule

Doesn't look all that normal to me, with that commitfest in Feb.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove pg_standby?

2014-11-10 Thread Heikki Linnakangas

On 11/10/2014 07:50 PM, Joshua D. Drake wrote:


On 11/04/2014 01:36 PM, Peter Eisentraut wrote:


While we're talking about removing old things, is there any use left for
pg_standby?


-1.

A lot of people, a lot of customers use log shipping for various
creative and business requirement setups.


Yes, but do they use pg_standby to implement it? If they do, why?

pg_standby is more configurable than the built-in standby_mode=on. You 
can set the sleep time, for example, while standby_mode=on uses a 
hard-coded delay of 5 s. And pg_standby has a configurable maximum wait 
time. And as Fujii pointed out, the built-in system will print an 
annoying message to the log every time it attempts to restore a file. 
Nevertheless, 99% of users would probably be happy with the built-in thing.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Andres Freund
On 2014-11-10 15:36:55 -0300, Alvaro Herrera wrote:
 Andres Freund wrote:
  On 2014-11-10 14:28:30 -0300, Alvaro Herrera wrote:
 
   If what we want is to quantify the extent of the issue, would it be more
   convenient to save counters to pgstat?  Vacuum already sends pgstat
   messages, so there's no additional traffic there.
  
  I'm pretty strongly against that one in isolation. They'd need to be
  stored somewhere and they'd need to be queryable somewhere with enough
  context to make sense.  To actually make sense of the numbers we'd also
  need to report all the other datapoints of vacuum in some form. That's
  quite a worthwile project imo - but *much* *much* more work than this.
 
 We already have last_autovacuum columns and such in pg_stat_tables et
 al, which only record the last value.  My thinking regarding such
 numbers is that you would save histories and put them in a chart, see
 how they evolve with time.  I doubt the individual numbers are worth
 much, but the trends might show something interesting.

I don't think they mean anything without also reporting the number of
buffers actually scanned and other related stats.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove pg_standby?

2014-11-10 Thread Magnus Hagander
On Mon, Nov 10, 2014 at 7:48 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 11/10/2014 07:50 PM, Joshua D. Drake wrote:


 On 11/04/2014 01:36 PM, Peter Eisentraut wrote:


 While we're talking about removing old things, is there any use left for
 pg_standby?


 -1.

 A lot of people, a lot of customers use log shipping for various
 creative and business requirement setups.


 Yes, but do they use pg_standby to implement it? If they do, why?

 pg_standby is more configurable than the built-in standby_mode=on. You can
 set the sleep time, for example, while standby_mode=on uses a hard-coded
 delay of 5 s. And pg_standby has a configurable maximum wait time. And as
 Fujii pointed out, the built-in system will print an annoying message to the
 log every time it attempts to restore a file. Nevertheless, 99% of users
 would probably be happy with the built-in thing.

As long as pg_standby has features that are actually useful and that
are not in the built-in system, we shouldn't remove it. We should,
however, try to fix those in the main system so we can get rid of it
after that :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Andres Freund
On 2014-11-10 12:37:29 -0600, Jim Nasby wrote:
 On 11/10/14, 12:15 PM, Andres Freund wrote:
 If what we want is to quantify the extent of the issue, would it be more
 convenient to save counters to pgstat?  Vacuum already sends pgstat
 messages, so there's no additional traffic there.
 I'm pretty strongly against that one in isolation. They'd need to be
 stored somewhere and they'd need to be queryable somewhere with enough
 context to make sense.  To actually make sense of the numbers we'd also
 need to report all the other datapoints of vacuum in some form. That's
 quite a worthwile project imo - but*much*  *much*  more work than this.
 
 We already report statistics on vacuums
 (lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding
 1 or 2 counters to that. Should we add the other counters from vacuum?
 That would be significantly more data.

At the very least it'd require:
* The number of buffers skipped due to the vm
* The number of buffers actually scanned
* The number of full table in contrast to partial vacuums

I think it'd require a fair amount of thinking about which values are
required to make sense of the number of skipped buffers due to not being
able to acquire the cleanup lock.

If you want to do this - and I sure don't want to stop you from it - you
should look at it from a general perspective, not from the perspective
of how skipped cleanup locks are logged.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-11-10 Thread Heikki Linnakangas
Attached is a new version. It's rebased on current git master, including 
BRIN. I've also fixed the laundry list of small things you reported, as 
well as a bunch of bugs I uncovered during my own testing.


Alvaro: you still have the BRIN WAL-logging code fresh in your memory, 
so could you take a look at that part of this patch to check that I 
didn't break it? And also, how do you like the new way of writing that, 
over what you had to in HEAD ?


More below.

On 11/09/2014 11:47 PM, Andres Freund wrote:

On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:

Replying to some of your comments below. The rest were trivial issues that
I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
   XLogRegisterBuffer() the way you've done it ? If we ever actually get
   a record with a large numbers of blocks touched this issentially is
   O(touched_buffers*data_entries).


Are you worried that the linear search in XLogRegisterBufData(), to find the
right registered_buffer struct, might be expensive? If that ever becomes a
problem, a simple fix would be to start the linear search from the end, and
make sure that when you touch a large number of blocks, you do all the
XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.


Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)


I've also though about having XLogRegisterBuffer() return the pointer to the
struct, and passing it as argument to XLogRegisterBufData. So the pattern in
WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have potential
to turn XLogRegisterBufData into a macro or inline function, to eliminate
the function call overhead. I played with that a little bit, but the
difference in performance was so small that it didn't seem worth it. But
passing the registered_buffer pointer, like above, might not be a bad thing
anyway.


Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.


I ended up doing something different: The block_id is now used as the 
index into the registered_buffers array. So looking up the right struct 
is now just registered_buffers[block_id]. That obviously gets rid of 
the linear search. It doesn't seem to make any difference in the quick 
performance testing I've been doing.



* There's lots of functions like XLogRecHasBlockRef() that dig through
   the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
 XLogRecGetBlockTag(record, 1, NULL, NULL, oldblk);
else
 oldblk = newblk;

   I think doing that repeatedly is quite a bad idea. We should parse the
   record once and then use it in a sensible format. Not do it in pieces,
   over and over again. It's not like we ignore backup blocks - so doing
   this lazily doesn't make sense to me.
   Especially as ValidXLogRecord() *already* has parsed the whole damn
   thing once.


Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure.


True.


Vast majority of WAL records contain one,
maybe two, block references, so it's not that expensive to find the right
one, even if you do it several times.


I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.


I haven't done anything about this. We might want to do that, but I'd 
like to get this committed now, with all the changes to the AM's, and 
then spend some time trying to further optimize the WAL format. I might 
add the deparsed WAL record infrastructure as part of that 
optimization work.



I ran a quick performance test on WAL replay performance yesterday. I ran
pgbench for 100 transactions with WAL archiving enabled, and measured
the time it took to replay the generated WAL. I did that with and without
the patch, and I didn't see any big difference in replay times. I also ran
perf on the startup process, and the profiles looked identical. I'll do
more comprehensive testing later, with different index types, but I'm
convinced that there is no performance issue in replay that we'd need to
worry about.


Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...


Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s. 
pg_xlogdump --stats says:



Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-11-10 Thread Robert Haas
On Mon, Nov 10, 2014 at 1:29 PM, Andres Freund and...@2ndquadrant.com wrote:
 That's an issue to which we may need to engineer some solution, but
 not in this patch.  Overall, the patch's architecture is modeled
 closely on the way we synchronize GUCs to new backends when using
 EXEC_BACKEND, and I don't think we're going do any better than stating
 with that and working to file down the rough edges as we go.  So I'd
 like to go ahead and commit this.

 Did you check out whether PGC_BACKEND GUCs work? Other than that I agree
 that we can just solve further issues as we notice them. This isn't
 particularly intrustive.

Is the scenario you're concerned about this one:

1. Start postmaster.
2. Start a user session.
3. Change a PGC_BACKEND GUC in postgresql.conf.
4. Reload.
5. Launch a background worker that uses this code.

It should end up with the same values there as the active session, not
the current one from postgresql.conf.  But we want to verify that's
the behavior we actually get.   Right?

 -- It doesn't handle types without send/receive functions.  I thought
 that was tolerable since the functions without such types seem like
 mostly edge cases, but Andres doesn't like it.  Apparently, BDR is
 makes provisions to either ship the tuple as one big blob - if all
 built-in types - or else use binary format where supported and text
 format otherwise.  Since this issue is common to BDR, parallelism, and
 pg_background, we might want to introduce some common infrastructure
 for it, but it's not too clear to me right now what that should look
 like.

 I think we definitely need to solve this - but I'm also not at all that
 sure how.

 For something like pg_background it's pretty trivial to fall back to
 in/out when there's no send/recv. It's just a couple of lines, and the
 portal code has the necessary provisions. So solving it for
 pg_background itself is pretty trivial.

That's not really the interesting part of the problem, I think.  Yeah,
pg_background can be made to speak text format if we really care.  But
the real issue is that even with the binary format, converting a tuple
in on-disk format into a DataRow message so that the other side can
reconstruct the tuple and shove it into an executor slot (or wherever
it wants to shove it) seems like it might be expensive.  You might
have data on that; if it's not expensive, stop here.  If it is
expensive, what we really want to do is figure out some way to make it
safe to copy the tuple into the shared memory queue, or send it out
over the socket, and have the process on the other end use it without
having to revalidate the whole thing column-by-column.

 But, as you say, pg_background
 itself isn't a primary goal (although a nice demonstration, with some
 real world applications).  There's enough differences between the
 parallelism and the replication cases that I'm not entirely sure how
 much common ground there is. Namely replication has the additional
 concerns of version differences, trust (don't use blobs if you don't
 fully trust the other side), and differences in the allocated oids
 (especially relevant for arrays which, very annoyingly, embed the oid in
 the send/recv format).

Yeah.  It would be nice to use the same code for deciding what to do
in a particular case.  It seems like it ought to be possible to have
one rule for whether a tuple with a given set of data types is safe to
ship in the on-disk format.  For pg_background/parallelism, it's
enough if that function returns true; for BDR, there might be
additional criteria applied before deciding to do it that way.  But
they could use the same decision tree for part of it.

What would be even better is to find some way to MAKE IT SAFE to send
the undecoded tuple.  I'm not sure what that would look like.  Locking
the types against concurrent changes?  Marking dropped types as
dropped without actually dropping them, and garbage collecting them
later?  Providing safer ways to deconstruct tuples?  It might help to
start by enumerating what exactly can go wrong here.  As far as I can
see, the main risks for pg_background and parallelism are (1) that the
type might get concurrently dropped, leaving us unable to interpret
the received tuple and (2) that the type might get concurrently
modified in some way that leaves us confused about how to interpret
the tuple, as by having the type size change.  The chances of a
problem in practice seem remote, since you can't do either of those
things if a table column with that type exists, but I can't convince
myself that there's no way for the type to be modified under us in
such a way that two different backends have different ideas about
whether it exists or how wide it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Access method extendability

2014-11-10 Thread Alexander Korotkov
Hi!

Thanks everybody for discussion. Sorry for delay. I'll try to address most
of questions arised in this discussion.

In general, I'm agree with concept of marking index as invalid in certain
cases.
I see following possible consequences of buggy WAL-logged custom AM:
1) Server crash during insert/update.
2) Server crash during select.
3) Invalid query answers.
4) Server crash during vacuum.
5) Server crash in recovery.

#5 is totally unacceptable. I've tried to design generic WAL record so that
it should be always possible to purely mechanically apply the record. It's
always possible to move piece of memory inside the page. It's always
possible to copy piece of memory from WAL-record to the page. Buggy WAL for
sure could cause an index corruption as well as any other bug in AM. WAL
support doesn't bring nothing special to this expect the fact that WAL is
quite hard to debug.

However, in current implementation I missed some hidden asserts about page
structure. Page could be so corrupted that we can't, for instance, safely
call XLogReadBufferForRedo(). All this cases must be worked out. If we
can't update page during recovery, index must be marked as invalid. It's
some amount of work, but it seems to be very feasible.

#4 seems problematic too. If autovacuum crashes on some index, then
postgres can enter a crash loop. So, if autovacuum crashes on extension
provided AM, that index should be marked as invalid.

Consequences #1, #3 and #3 could be easily caused by buggy opclass as well.
The reason why we're not knee-deep in extension provied bugs in GiST/GIN
opclasses is not easyness of writing correct GiST/GIN opclasses. Actual
reason is that we have only few GiST/GIN opclasses which weren't written by
GiST/GIN authors.

So, I don't expect PostgreSQL to be flooded with buggy AMs once we add AM
extendability. Our motivation behind this proposal is that we want to be
able to develop custom extensions with AMs. We want to be able to provide
our AMs to our customers whithout having to push that into PostgreSQL core
or fork PostgreSQL. Bugs in that AMs in our responsibility to out
customers. Some commercial companies could implement patented AMs (for
instance, fractal tree), and that is their responsibility to their
customers.

Also, I think it's OK to put adopting custom AMs to changes of AM interface
to authors of those custom AMs. AM extensions anyway should be adopted to
each new major release. AFAIR, interface of RelationOpen() function has
been changed not too long ago. Custom AM would use many functions which we
use to access relations. Their interface can be changed in the next release.

PostGIS GiST opclass has bugs which are reproducable, known and not fixed.
This is responsibility of PostGIS to their customers. We can feel sorry for
PostGIS that they are so slow on fixing this. But we shouldn't feel sorry
for GiST extendability, that woulde be redicilous.

Some recearches could write their own extensions. We can hope that they are
smart enough to not recommend it for production use. We can back our hope
with warning during installing extension provided AM. That warning could
say that all corruption caused by extension provided AM is up to AM
developer. This warning could make users to beware of using extension
provided AMs in production
unless they are fully trust extension developer (have support subscription
if it's commercial).

PostgreSQL doesn't have any kind of safe extensions. Every extension must
be trusted. Every extension can break (almost) everything.When one types
CREATE EXTENSION he must completely trust extension author. This applies to
every extension.

I would be very careful with discouraging commercial AM extensions. We
should always keen in the mind how many of PostgreSQL developers are
working for companies which own their commercial PostgreSQL forks and how
big their contribution is. Patented extensions looks scary for sure. But
it's up to software patents not up to PostgreSQL extendability...

Particular steps I'm going to do on these patches:
1) Make generic_redo never crash on broken pages.
2) Make autovacuum launcher mark index as invalid if vacuum process crashed
on custom AM index. Since, we can't write something into postgres cluster
when one process has crushed, ITSM autovacuum should have some separate
place to put this information. Thus, after restart postgres can read it and
mark index as invalid.

Don't allowing CREATE ACCESS METHOD command seems problematic for me. How
could it work with pg_upgrade? pg_dump wouldn't dump extra pg_am records.
So, pg_upgrade would break at creating operator classes on new cluster. So,
I agree with dropping create am command only if we let pg_dump to dump
extra pg_am records...

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] row_to_json bug with index only scans: empty keys!

2014-11-10 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Nov 10, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, has anyone got an opinion about whether to stick the full fix into
 9.4?

 I think we should put the full fix in 9.4.

Since nobody spoke against that, I've put the full fix into HEAD and 9.4,
and the restricted version into 9.3 and 9.2.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-10 Thread Alvaro Herrera
Fujii Masao wrote:

 I got the following PANIC error in the standby server when I set up
 the replication servers and ran make installcheck. Note that I was
 repeating the manual CHECKPOINT every second while installcheck
 was running. Without the checkpoints, I could not reproduce the
 problem. I'm not sure if CHECKPOINT really triggers this problem, though.
 Anyway BRIN seems to have a problem around its WAL replay.

Hm, I think I see what's happening.  The xl_brin_update record
references two buffers, one which is target for the updated tuple and
another which is the revmap buffer.  When the update target buffer is
being first used we set the INIT bit which removes the buffer reference
from the xlog record; in that case, if the revmap buffer is first being
modified after the prior checkpoint, that revmap buffer receives backup
block number 0; but the code hardcodes it as 1 on the expectation that
the buffer that's target for the update will receive 0.  The attached
patch should fix this.

I cannot reproduce the issue after applying this patch, can you please
confirm that it fixes the issue for you as well?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index ebef984..2937068 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -60,9 +60,11 @@ brin_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record,
 	 */
 	if (record-xl_info  XLOG_BRIN_INIT_PAGE)
 	{
-		XLogReadBufferForRedoExtended(lsn, record, 0,
-	  xlrec-node, MAIN_FORKNUM, blkno,
-	  RBM_ZERO, false, buffer);
+		/*
+		 * No full-page image here.  Don't try to read it, because there
+		 * might be one for the revmap buffer, below.
+		 */
+		buffer = XLogReadBuffer(xlrec-node, blkno, true);
 		page = BufferGetPage(buffer);
 		brin_page_init(page, BRIN_PAGETYPE_REGULAR);
 		action = BLK_NEEDS_REDO;
@@ -97,7 +99,9 @@ brin_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record,
 		UnlockReleaseBuffer(buffer);
 
 	/* update the revmap */
-	action = XLogReadBufferForRedo(lsn, record, 1, xlrec-node,
+	action = XLogReadBufferForRedo(lsn, record,
+   record-xl_info  XLOG_BRIN_INIT_PAGE ? 0 : 1,
+   xlrec-node,
    xlrec-revmapBlk, buffer);
 	if (action == BLK_NEEDS_REDO)
 	{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-11-10 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Stephen Frost wrote:
  at least for 9.5, which is coming up awful fast at this point, if
  we're going to stick with the 'normal' schedule and freeze in the
  spring.
 
 https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule
 
 Doesn't look all that normal to me, with that commitfest in Feb.

I'd be more inclined to expect the late release of 9.4 to impact things
than that schedule to really change when we freeze..  Just my 2c though.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-10 Thread Alvaro Herrera
Alvaro Herrera wrote:

 Hm, I think I see what's happening.  The xl_brin_update record
 references two buffers, one which is target for the updated tuple and
 another which is the revmap buffer.  When the update target buffer is
 being first used we set the INIT bit which removes the buffer reference
 from the xlog record; in that case, if the revmap buffer is first being
 modified after the prior checkpoint, that revmap buffer receives backup
 block number 0; but the code hardcodes it as 1 on the expectation that
 the buffer that's target for the update will receive 0.  The attached
 patch should fix this.

Pushed, thanks for the report.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-10 Thread Alvaro Herrera
Greg Stark wrote:

 1) The manual describes the exensibility API including the BrinOpcInfo
 struct -- but it doesn't define the BrinDesc struct that every API
 method takes. It's not clear what exactly that argument is for or how
 to make use of it.

Hm, I guess this could use some expansion.

 2) The mention about additional opclass operators and to number them
 from 11 up is fine -- but there's no explanation of how to decide what
 operators need to be explicitly added like that. Specifically I gather
 from reading minmax that = is handled internally by Brin and you only
 need to add any other operators aside from = ? Is that right?

I think I already replied to this in the other email.

 3) It's not entirely clear in the docs when each method is will be
 invoked. Specifically it's not clear whether opcInfo is invoked once
 when the index is defined or every time the definition is loaded to be
 used. I gather it's the latter? Perhaps there needs to be a method
 that's invoked specifically when the index is defined? I'm wondering
 where I'm going to hook in the logic to determine the size and number
 of hash functions to use for the bloom filter which needs to be
 decided once when the index is created and then static for the index
 in the future.

Every time the index is accessed, yeah.  I'm not sure about figuring the
initial creation details.  Do you think we need another support
procedure to help with that?  We can add it if needed; minmax would just
define it to InvalidOid.

 4) It doesn't look like BRIN handles cross-type operators at all.

The idea here is that there is a separate opclass to handle cross-type
operators, which would be together in the same opfamily as the opclass
used to create the index.  I haven't actually tried this yet, mind you.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-10 Thread Alvaro Herrera
Greg Stark wrote:
 On Sun, Nov 9, 2014 at 5:57 PM, Greg Stark st...@mit.edu wrote:
  2) The mention about additional opclass operators and to number them
  from 11 up is fine -- but there's no explanation of how to decide what
  operators need to be explicitly added like that. Specifically I gather
  from reading minmax that = is handled internally by Brin and you only
  need to add any other operators aside from = ? Is that right?
 
 I see I totally misunderstood the use of the opclass procedure
 functions. I think I understand now but just to be sure -- If I can
 only handle BTEqualStrategyNumber keys then is it adequate to just
 define the opclass containing only the equality operator?

Yes.

I agree that this deserves some more documentation.  In a nutshell, the
opclass must provide three separate groups of items:

1. the mandatory support functions, opcInfo, addValue, Union,
Consistent.  opcInfo is invoked each time the index is accessed
(including during index creation).

2. the additional support functions; normally these are called from
within addValue, Consistent, Union.  For minmax, what we provide is the
functions that implement the inequality operators for the type, that is
 = = and .  Since minmax tries to be generic and support a whole lot
of types, this is the way that the mandatory support functions know what
functions to call to compare two given values.  If the opclass is
specific to one data type, you might not need anything here; or perhaps
you have other ways to figure out a hash function to call, etc.

3. the operators.  We only use these so that the optimizer picks up the
index for queries.


 Somehow I got confused between the amprocs that minmax uses to
 implement the consistency function and the amops that the brin index
 supports.

I think it is somewhat confusing, yeah.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-11-10 Thread Andres Freund
On 2014-11-10 14:54:15 -0500, Robert Haas wrote:
 On Mon, Nov 10, 2014 at 1:29 PM, Andres Freund and...@2ndquadrant.com wrote:
  That's an issue to which we may need to engineer some solution, but
  not in this patch.  Overall, the patch's architecture is modeled
  closely on the way we synchronize GUCs to new backends when using
  EXEC_BACKEND, and I don't think we're going do any better than stating
  with that and working to file down the rough edges as we go.  So I'd
  like to go ahead and commit this.
 
  Did you check out whether PGC_BACKEND GUCs work? Other than that I agree
  that we can just solve further issues as we notice them. This isn't
  particularly intrustive.
 
 Is the scenario you're concerned about this one:
 
 1. Start postmaster.
 2. Start a user session.
 3. Change a PGC_BACKEND GUC in postgresql.conf.
 4. Reload.
 5. Launch a background worker that uses this code.

No, that's not what I was thinking of. Consider:

export pg_libdir=$(pg_config --libdir)
mkdir $pg_libdir/plugins
ln -s $pg_libdir/auto_explain.so $pg_libdir/plugins/
PG_OPTIONS='-c local_preload_libraries=auto_explain' psql

and use pg_background() (or something else using this infrastructure) in
that context.
Without having reread your relevant code I'd expect a:
ERROR:  55P02: parameter local_preload_libraries cannot be set after 
connection start
because the code would try to import the user session's
local_preload_libraries values into the background process - after that
actually already has done its initialization.

 It should end up with the same values there as the active session, not
 the current one from postgresql.conf.  But we want to verify that's
 the behavior we actually get.   Right?

But that's also something worthwile to check.

  -- It doesn't handle types without send/receive functions.  I thought
  that was tolerable since the functions without such types seem like
  mostly edge cases, but Andres doesn't like it.  Apparently, BDR is
  makes provisions to either ship the tuple as one big blob - if all
  built-in types - or else use binary format where supported and text
  format otherwise.  Since this issue is common to BDR, parallelism, and
  pg_background, we might want to introduce some common infrastructure
  for it, but it's not too clear to me right now what that should look
  like.
 
  I think we definitely need to solve this - but I'm also not at all that
  sure how.
 
  For something like pg_background it's pretty trivial to fall back to
  in/out when there's no send/recv. It's just a couple of lines, and the
  portal code has the necessary provisions. So solving it for
  pg_background itself is pretty trivial.
 
 That's not really the interesting part of the problem, I think.  Yeah,
 pg_background can be made to speak text format if we really care.  But
 the real issue is that even with the binary format, converting a tuple
 in on-disk format into a DataRow message so that the other side can
 reconstruct the tuple and shove it into an executor slot (or wherever
 it wants to shove it) seems like it might be expensive.  You might
 have data on that; if it's not expensive, stop here.  If it is
 expensive, what we really want to do is figure out some way to make it
 safe to copy the tuple into the shared memory queue, or send it out
 over the socket, and have the process on the other end use it without
 having to revalidate the whole thing column-by-column.

Yes, sending the tuples as-is is noticeable win. I've not fully analyzed
where the difference is - I'm suspect it's to a large degree because it
requires less copying/memory allocations. But also some of the send/recv
functions aren't exactly cheap in themselves.

For BDR we've forgone problems around changing type definitions and such
by saying only builtin types get to do the 'as-is' stuff. That can be
expanded later, but that's it for now. For those there's no chance that
the type definition changes or anything.

A general problem is that you really can't allow this to happen outside
a controlled setting - manipulating data on the 'Datum' level allows you
to do bad things.

  But, as you say, pg_background
  itself isn't a primary goal (although a nice demonstration, with some
  real world applications).  There's enough differences between the
  parallelism and the replication cases that I'm not entirely sure how
  much common ground there is. Namely replication has the additional
  concerns of version differences, trust (don't use blobs if you don't
  fully trust the other side), and differences in the allocated oids
  (especially relevant for arrays which, very annoyingly, embed the oid in
  the send/recv format).
 
 Yeah.  It would be nice to use the same code for deciding what to do
 in a particular case.  It seems like it ought to be possible to have
 one rule for whether a tuple with a given set of data types is safe to
 ship in the on-disk format.  For pg_background/parallelism, it's
 enough if that function returns true; for BDR, 

Re: [HACKERS] Add CREATE support to event triggers

2014-11-10 Thread Christopher Browne
On 8 November 2014 17:49, Robert Haas robertmh...@gmail.com wrote:
  We could just integrate those parts, and be done with it. But would that
  actually be a good thing for the community? Then slony needs to do it
  and potentially others as well? Then auditing can't use it? Then
  potential schema tracking solutions can't use it?

 Do you think Slony is really going to use this?  I guess we can let
 the Slony guys speak for themselves, but I've been skeptical since day
 one that this is the best way to do DDL replication, and I still am.
 There are lots of ways that a replicated DDL statement can fail on the
 replicas, and what are you going to do then?  It's too late to show
 the user the error message, so you can throw it in a log someplace and
 hope that somebody notices, but that's it.  It makes a lot more sense
 to me to use some kind of a tool that applies the DDL in a coordinated
 fashion on all nodes - or even just do it manually, since it might
 very well be desirable to take the lock on different nodes at widely
 different times, separated by a switchover.  I certainly think there's
 a use-case for what you're trying to do here, but I don't think it'll
 be right for everyone.

 Certainly, if the Slony guys - or some other team building an
 out-of-core replication solutions says, hey, we really want this in
 core, that would considerably strengthen the argument for putting it
 there.  But I haven't heard anyone say that yet - unlike logical
 decoding, were we did have other people expressing clear interest in
 using it.

  There've been people for a long while asking about triggers on catalogs
  for that purpose. IIRC Jan was one of them.

 My impression, based on something Christopher Brown said a few years
 ago, is that Slony's DDL trigger needs are largely satisfied by the
 existing event trigger stuff.  It would be helpful to get confirmation
 as to whether that's the case.

I'm not sure that a replication system that intends to do partial
replication
(e.g. - being selective of what objects are to be replicated) will
necessarily
want to use the CREATE event triggers to capture creates.

Several cases pop up with different answers:
a) I certainly don't want to replicate temporary tables
b) I almost certainly don't want to replicate unlogged tables
c) For more ordinary tables, I'm not sure I want to extend Slony
to detect them and add them automatically, because there
are annoying sub-cases

   c.1) If I'm working on data conversion, I may create not totally
 temporary tables that are nonetheless not worthy to replicate.
 (I'm working on such right now)

Long and short: it seems likely that I'd frequently NOT want all new tables
added to replication, at least not all of them, all the time.

What would seem valuable, to me, would be to have a CREATE event
trigger that lets me know the OID and/or fully qualified name of the new
object so that perhaps the replication system:

a) Has some kind of rule system to detect if it wants to replicate it,

b) Logs the change so a human might know later that there's new stuff
that probably ought to be replicated

c) Perhaps a human might put replication into a new suggestive
mode, a bit akin to Slony's EXECUTE SCRIPT, but where the human
essentially says, Here, I'm running DDL against this connection for a
while, and I'd be grateful if Postgres told Slony to capture all the new
tables and sequences and replicated them.

There are kind of two approaches:

a) Just capture the OIDs, and have replication go back later and grab
the table definition once the dust clears on the master

b) We need to capture ALL the DDL, whether CREATE or ALTER, and
forward it, altered to have fully qualified names on everything so that
we don't need to duplicate all the set search_path requests and
such.

I suppose there's also a third...

c) Have a capability to put an event trigger function in place that makes
DDL requests fail.

That's more useful than you'd think; if, by default, we make them fail,
and with an error messages such as
  DDL request failed as it was not submitted using slonik DDL TOOL

then we have protection against uncontrolled application of DDL.

DDL TOOL would switch off the fail trigger, possibly trying to
capture the DDL, or perhaps just capturing the statements passed
to it so they get passed everywhere.   (That heads back to a) and b);
what should get captured...)

I'm not sure that all of that is totally internally coherent, but I hope
there
are some ideas worth thinking about.
--
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-10 Thread Greg Stark
On Mon, Nov 10, 2014 at 9:31 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Every time the index is accessed, yeah.  I'm not sure about figuring the
 initial creation details.  Do you think we need another support
 procedure to help with that?  We can add it if needed; minmax would just
 define it to InvalidOid.

I have a working bloom filter with hard coded filter size and hard
coded number of hash functions. I need to think about how I'm going to
make it more general now. I think the answer is that I should have an
index option that specifies the false positive rate and calculates the
optimal filter size and number of hash functions. It might possibly
need to peek at the table statistics to determine the population size
though. Or perhaps I should bite the bullet and size the bloom filters
based on the actual number of rows in a chunk since the BRIN
infrastructure does allow each summary to be a different size.

There's another API question I have. To implement Consistent I need to
call the hash function which in the case of functions like hashtext
could be fairly expensive and I even need to generate multiple hash
values(though currently I'm slicing them all from the integer hash
value so that's not too bad) and then test each of those bits. It
would be natural to call hashtext once at the start of the scan and
possibly build a bitmap and compare all of them in a single 
operation. But afaict there's no way to hook the beginning of the scan
and opaque is not associated with the specific scan so I don't think I
can cache the hash value of the scan key there safely. Is there a good
way to do it with the current API?

On a side note I'm curious about something, I was stepping through the
my code in gdb and discovered that a single row insert appeared to
construct a new summary then union it into the existing summary
instead of just calling AddValue on the existing summary. Is that
intentional? What led to that?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED

2014-11-10 Thread Fabrízio de Royes Mello
On Thu, Nov 6, 2014 at 3:42 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Sat, Sep 13, 2014 at 11:02 PM, Fabrízio de Royes Mello
 fabriziome...@gmail.com wrote:
  Patch rebased and added to commitfest [1].
 It looks like a good thing to remove ATChangeIndexesPersistence, this
 puts the persistence switch directly into reindex process.

 A couple of minor comments about this patch:
 1) Reading it, I am wondering if it would not be finally time to
 switch to a macro to get a relation's persistence, something like
 RelationGetPersistence in rel.h... Not related directly to this patch.

Good idea... I'll provide a patch soon.


 2) reindex_index has as new argument a relpersislence value for the
 new index. reindex_relation has differently a new set of flags to
 enforce the relpersistence of all the underling indexes. Wouldn't it
 be better for API consistency to pass directly a relpersistence value
 through reindex_relation? In any case, the comment block of
 reindex_relation does not contain a description of the new flags.

I did it because the ReindexDatabase build a list of oids to run
reindex_relation for each item of the list. I can change the list to store
the relpersistence also, but this can lead us to a concurrency trouble
because if one session execute REINDEX DATABASE and other session run
ALTER TABLE name SET {LOGGED|UNLOGGED} during the building of the list
the reindex can lead us to a inconsitence state.

Added comments to comment block of reindex_relation.


 3) Here you may as well just set the value and be done:
 +/*
 + * Check if need to set the new relpersistence
 + */
 +if (iRel-rd_rel-relpersistence != relpersistence)
 +iRel-rd_rel-relpersistence = relpersistence;

Hmmm... I really don't know why I did it... fixed.

Thanks!

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 912038a..50cf0ef 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1980,7 +1980,7 @@ index_build(Relation heapRelation,
 	 * created it, or truncated twice in a subsequent transaction, the
 	 * relfilenode won't change, and nothing needs to be done here.
 	 */
-	if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED 
+	if (indexRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED 
 		!smgrexists(indexRelation-rd_smgr, INIT_FORKNUM))
 	{
 		RegProcedure ambuildempty = indexRelation-rd_am-ambuildempty;
@@ -3130,7 +3130,7 @@ IndexGetRelation(Oid indexId, bool missing_ok)
  * reindex_index - This routine is used to recreate a single index
  */
 void
-reindex_index(Oid indexId, bool skip_constraint_checks)
+reindex_index(Oid indexId, bool skip_constraint_checks, char relpersistence)
 {
 	Relation	iRel,
 heapRelation;
@@ -3191,6 +3191,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 			indexInfo-ii_ExclusionStrats = NULL;
 		}
 
+		/* Set the relpersistence of the new index */
+		iRel-rd_rel-relpersistence = relpersistence;
+
 		/* We'll build a new physical relation for the index */
 		RelationSetNewRelfilenode(iRel, InvalidTransactionId,
   InvalidMultiXactId);
@@ -3310,6 +3313,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
  * performance, other callers should include the flag only after transforming
  * the data in a manner that risks a change in constraint validity.
  *
+ * REINDEX_REL_FORCE_UNLOGGED: if true, force the new rebuilded indexes to be
+ * unlogged.
+ *
+ * REINDEX_REL_FORCE_LOGGED: if true, force the new rebuilded indexes to be
+ * permanent.
+ *
  * Returns true if any indexes were rebuilt (including toast table's index
  * when relevant).  Note that a CommandCounterIncrement will occur after each
  * index rebuild.
@@ -3389,11 +3398,19 @@ reindex_relation(Oid relid, int flags)
 		foreach(indexId, indexIds)
 		{
 			Oid			indexOid = lfirst_oid(indexId);
+			char		relpersistence = rel-rd_rel-relpersistence;
 
 			if (is_pg_class)
 RelationSetIndexList(rel, doneIndexes, InvalidOid);
 
-			reindex_index(indexOid, !(flags  REINDEX_REL_CHECK_CONSTRAINTS));
+			/* Check for flags to force UNLOGGED or PERMANENT persistence */
+			if (flags  REINDEX_REL_FORCE_UNLOGGED)
+relpersistence = RELPERSISTENCE_UNLOGGED;
+			else if (flags  REINDEX_REL_FORCE_PERMANENT)
+relpersistence = RELPERSISTENCE_PERMANENT;
+
+			reindex_index(indexOid, !(flags  REINDEX_REL_CHECK_CONSTRAINTS),
+		  relpersistence);
 
 			CommandCounterIncrement();
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 6a578ec..95b9a4f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -589,7 +589,8 @@ rebuild_relation(Relation OldHeap, Oid 

Re: [HACKERS] Add CREATE support to event triggers

2014-11-10 Thread Petr Jelinek

On 10/11/14 23:37, Christopher Browne wrote:

On 8 November 2014 17:49, Robert Haas robertmh...@gmail.com
 
  My impression, based on something Christopher Brown said a few years
  ago, is that Slony's DDL trigger needs are largely satisfied by the
  existing event trigger stuff.  It would be helpful to get confirmation
  as to whether that's the case.

I'm not sure that a replication system that intends to do partial
replication
(e.g. - being selective of what objects are to be replicated) will
necessarily
want to use the CREATE event triggers to capture creates.

Several cases pop up with different answers:
a) I certainly don't want to replicate temporary tables
b) I almost certainly don't want to replicate unlogged tables
c) For more ordinary tables, I'm not sure I want to extend Slony
 to detect them and add them automatically, because there
 are annoying sub-cases

c.1) If I'm working on data conversion, I may create not totally
  temporary tables that are nonetheless not worthy to replicate.
  (I'm working on such right now)

Long and short: it seems likely that I'd frequently NOT want all new tables
added to replication, at least not all of them, all the time.


I don't see how this is problem with using CREATE event triggers, just 
put logic in your trigger that handles this, you get the object identity 
of the object that is being created/altered so you can get any info 
about it you wish and you can easily filter however you want.



There are kind of two approaches:

a) Just capture the OIDs, and have replication go back later and grab
the table definition once the dust clears on the master

b) We need to capture ALL the DDL, whether CREATE or ALTER, and
forward it, altered to have fully qualified names on everything so that
we don't need to duplicate all the set search_path requests and
such.



This is basically what this patch gives you (actually both the canonized 
command and the identity)?



I suppose there's also a third...

c) Have a capability to put an event trigger function in place that makes
DDL requests fail.

That's more useful than you'd think; if, by default, we make them fail,
and with an error messages such as
   DDL request failed as it was not submitted using slonik DDL TOOL



You can do that already, it's even the example in the event trigger 
documentation.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-11-10 Thread Andres Freund
On 2014-11-10 17:37:40 -0500, Christopher Browne wrote:
 On 8 November 2014 17:49, Robert Haas robertmh...@gmail.com wrote:
   We could just integrate those parts, and be done with it. But would that
   actually be a good thing for the community? Then slony needs to do it
   and potentially others as well? Then auditing can't use it? Then
   potential schema tracking solutions can't use it?
 
  Do you think Slony is really going to use this?  I guess we can let
  the Slony guys speak for themselves, but I've been skeptical since day
  one that this is the best way to do DDL replication, and I still am.
  There are lots of ways that a replicated DDL statement can fail on the
  replicas, and what are you going to do then?  It's too late to show
  the user the error message, so you can throw it in a log someplace and
  hope that somebody notices, but that's it.  It makes a lot more sense
  to me to use some kind of a tool that applies the DDL in a coordinated
  fashion on all nodes - or even just do it manually, since it might
  very well be desirable to take the lock on different nodes at widely
  different times, separated by a switchover.  I certainly think there's
  a use-case for what you're trying to do here, but I don't think it'll
  be right for everyone.
 
  Certainly, if the Slony guys - or some other team building an
  out-of-core replication solutions says, hey, we really want this in
  core, that would considerably strengthen the argument for putting it
  there.  But I haven't heard anyone say that yet - unlike logical
  decoding, were we did have other people expressing clear interest in
  using it.
 
   There've been people for a long while asking about triggers on catalogs
   for that purpose. IIRC Jan was one of them.
 
  My impression, based on something Christopher Brown said a few years
  ago, is that Slony's DDL trigger needs are largely satisfied by the
  existing event trigger stuff.  It would be helpful to get confirmation
  as to whether that's the case.
 
 I'm not sure that a replication system that intends to do partial
 replication
 (e.g. - being selective of what objects are to be replicated) will
 necessarily
 want to use the CREATE event triggers to capture creates.
 
 Several cases pop up with different answers:
 a) I certainly don't want to replicate temporary tables
 b) I almost certainly don't want to replicate unlogged tables

Those are quite easy to recognize and skip.

 c) For more ordinary tables, I'm not sure I want to extend Slony
 to detect them and add them automatically, because there
 are annoying sub-cases
 
c.1) If I'm working on data conversion, I may create not totally
  temporary tables that are nonetheless not worthy to replicate.
  (I'm working on such right now)

Sure. you might not want to do it automatically all the time - but I
think it's a very useful default mode. Once you can replicate CREATEs
per se, it's easy to add logic (in a couple lines of plpgsql or
whatever) to only do so in a certain schema or similar.

But the main reason all this is interesting isn't so much CREATE
itself. But that it can be (and Alvaro has mostly done it!) for ALTER as
well. And there it imo becomes really interesting. Because you can quite
easily check whether the affected relation is being replicated you can
just emit the DDL when that's the case. And that makes DDL in a
logically replicated setup *much* easier.

 Long and short: it seems likely that I'd frequently NOT want all new tables
 added to replication, at least not all of them, all the time.

Agreed. That's quite possible with the design here - you get the
creation commands and can decide whether you want to do anything with
them. You're not forced to insert them into your replication queue or
whatever you're using for that.

 What would seem valuable, to me, would be to have a CREATE event
 trigger that lets me know the OID and/or fully qualified name of the new
 object so that perhaps the replication system:
 
 a) Has some kind of rule system to detect if it wants to replicate it,

Sure.

 b) Logs the change so a human might know later that there's new stuff
 that probably ought to be replicated

Sure.

 c) Perhaps a human might put replication into a new suggestive
 mode, a bit akin to Slony's EXECUTE SCRIPT, but where the human
 essentially says, Here, I'm running DDL against this connection for a
 while, and I'd be grateful if Postgres told Slony to capture all the new
 tables and sequences and replicated them.

Sure.

Some of that already is possible with the current event triggers - and
all of it would be possible with the suggested functionality here.

An old version of bdr, employing the functionality presented here, had
the following (simplified) event trigger:

CREATE OR REPLACE FUNCTION bdr.queue_commands()
 RETURNS event_trigger
LANGUAGE plpgsql
AS $function$
DECLARE
r RECORD;
BEGIN
-- don't recursively log ddl commands
IF 

Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Andres Freund
Hi Robert, All,

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
 On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote:
  Something usable, with severe restrictions, is actually better than we
  have now. I understand the journey this work represents, so don't be
  embarrassed by submitting things with heuristics and good-enoughs in
  it. Our mentor, Mr.Lane, achieved much by spreading work over many
  releases, leaving others to join in the task.
 
 It occurs to me that, now that the custom-scan stuff is committed, it
 wouldn't be that hard to use that, plus the other infrastructure we
 already have, to write a prototype of parallel sequential scan.  Given
 where we are with the infrastructure, there would be a number of
 unhandled problems, such as deadlock detection (needs group locking or
 similar), assessment of quals as to parallel-safety (needs
 proisparallel or similar), general waterproofing to make sure that
 pushing down a qual we shouldn't does do anything really dastardly
 like crash the server (another written but yet-to-be-published patch
 adds a bunch of relevant guards), and snapshot sharing (likewise).
 But if you don't do anything weird, it should basically work.
 
 I think this would be useful for a couple of reasons.  First, it would
 be a demonstrable show of progress, illustrating how close we are to
 actually having something you can really deploy.  Second, we could use
 it to demonstrate how the remaining infrastructure patches close up
 gaps in the initial prototype.  Third, it would let us start doing
 real performance testing.

I think it might be a useful experiment - as long as it's clear that
it's that. Which is, I think, what you're thinking about?

 It seems pretty clear that a parallel sequential scan of data that's
 in memory (whether the page cache or the OS cache) can be accelerated
 by having multiple processes scan it in parallel.

Right.

 But it's much less clear what will happen when the data is being read
 in from disk.

I think that *very* heavily depends on the IO subsystem.

 Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

 What degree of parallelism helps?

That's quite a hard question. Generally the question about how much
parallelism for what is beneficial will be one of the most complicated
areas once the plumbing is in.

 Do we break OS readahead so badly that performance actually regresses?

I don't think it's likely that we break OS readahead - that works on a
per task basis at least on linux afaik. But it's nonetheless very easy
to have too many streams causing to many random reads.

 These are things that are likely to
 need a fair amount of tuning before this is ready for prime time, so
 being able to start experimenting with them in advance of all of the
 infrastructure being completely ready seems like it might help.

I'm not actually entirely sure how much that's going to help. I think
you could very well have a WIP patch ready reasonably quick that doesn't
solve the issues you mention above by patching it in. For the kind of
testing we're talking about that seems likely sufficient - a git branch
somewhere probably is actually easier to compile for people than some
contrib module that needs to be loaded...
And I *do* think that you'll very quickly hit the limits of the custom
scan API. And I'd much rather see you work on improving parallelism than
the custom scan stuff, just so you can prototype further ahead.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Haribabu Kommi
On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
 Does parallelism help at all?

 I'm pretty damn sure. We can't even make a mildly powerfull storage
 fully busy right now. Heck, I can't make my workstation's storage with a
 raid 10 out of four spinning disks fully busy.

 I think some of that benefit also could be reaped by being better at
 hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where conditions.

One more point we may need to consider, is there any overhead in passing
the data row from workers to backend? I feel this may play a major role
when the selectivity is more.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL information view

2014-11-10 Thread Michael Paquier
On Tue, Nov 11, 2014 at 1:43 AM, Magnus Hagander mag...@hagander.net
wrote:

 Right now it just truncates the dn at NAMEDATALEN - so treating it the
 same as we do with hostnames. My guess is this is not a big problem
 because in the case of long DNs, most of the time the important stuff
 is at the beginning anyway... (And it's not like it's actually used
 for authentication, in which case it would of course be a problem).

You should add it to the next CF for proper tracking, there are already
many patches in the queue waiting for reviews :)
-- 
Michael


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Jim Nasby

On 11/10/14, 12:56 PM, Andres Freund wrote:

On 2014-11-10 12:37:29 -0600, Jim Nasby wrote:

On 11/10/14, 12:15 PM, Andres Freund wrote:

If what we want is to quantify the extent of the issue, would it be more
convenient to save counters to pgstat?  Vacuum already sends pgstat
messages, so there's no additional traffic there.

I'm pretty strongly against that one in isolation. They'd need to be
stored somewhere and they'd need to be queryable somewhere with enough
context to make sense.  To actually make sense of the numbers we'd also
need to report all the other datapoints of vacuum in some form. That's
quite a worthwile project imo - but*much*  *much*  more work than this.


We already report statistics on vacuums
(lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding
1 or 2 counters to that. Should we add the other counters from vacuum?
That would be significantly more data.


At the very least it'd require:
* The number of buffers skipped due to the vm
* The number of buffers actually scanned
* The number of full table in contrast to partial vacuums


If we're going to track full scan vacuums separately, I think we'd need two separate scan 
counters. I think (for pgstats) it'd make more sense to just count initial failure to 
acquire the lock in a full scan in the 'skipped page' counter. In terms of answering the 
question how common is it not to get the lock, it's really the same event.


I think it'd require a fair amount of thinking about which values are
required to make sense of the number of skipped buffers due to not being
able to acquire the cleanup lock.

If you want to do this - and I sure don't want to stop you from it - you
should look at it from a general perspective, not from the perspective
of how skipped cleanup locks are logged.


Honestly, my desire at this point is just to see if there's actually a problem. 
Many people are asserting that this should be a very rare occurrence, but 
there's no way to know.

Towards that simple end, I'm a bit torn. My preference would be to simply log, 
and throw a warning if it's over some threshold. I believe that would give the 
best odds of getting feedback from users if this isn't as uncommon as we think.

I agree that ideally this would be tracked as another stat, but from that 
standpoint I think there's other, much more important metrics to track, and 
AFAIK the only reason we don't have them is that busy systems already push 
pgstats to it's limits. We should try and fix that, but that's a much bigger 
issue.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-10 Thread Tom Lane
Jim Nasby jim.na...@bluetreble.com writes:
 On 11/10/14, 12:56 PM, Andres Freund wrote:
 If you want to do this - and I sure don't want to stop you from it - you
 should look at it from a general perspective, not from the perspective
 of how skipped cleanup locks are logged.

 Honestly, my desire at this point is just to see if there's actually a 
 problem. Many people are asserting that this should be a very rare 
 occurrence, but there's no way to know.

 Towards that simple end, I'm a bit torn. My preference would be to simply 
 log, and throw a warning if it's over some threshold. I believe that would 
 give the best odds of getting feedback from users if this isn't as uncommon 
 as we think.

 I agree that ideally this would be tracked as another stat, but from that 
 standpoint I think there's other, much more important metrics to track, and 
 AFAIK the only reason we don't have them is that busy systems already push 
 pgstats to it's limits. We should try and fix that, but that's a much bigger 
 issue.

Yeah.  We know that per-table pgstat counters are a pretty expensive thing
in databases with many tables.  We should absolutely not be adding them on
mere speculation that the number might be interesting.

Now, that objection would not apply to a per *database* counter, but I'm
not sure if tracking the numbers at that granularity would help anyone.

On the whole, I'm +1 for just logging the events and seeing what we learn
that way.  That seems like an appropriate amount of effort for finding out
whether there is really an issue.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-11-10 Thread Amit Langote

Hi,

 From: Robert Haas [mailto:robertmh...@gmail.com]
 Sent: Saturday, November 08, 2014 5:41 AM

 I'd be in favor of that.

Thanks!

  I am not sure whether the code is close
 enough to what we need to be really useful, but that's for you to
 decide.

Hmm, I'm not entirely convinced about the patch as it stands either but, I will 
try to restate below what the patch in its current state does anyway (just to 
refresh):

The patch provides syntax to:
 * Specify partitioning key, optional partition definitions within CREATE TABLE,
 * A few ALTER TABLE commands that let you define a partitioning key 
(partitioning a table after the fact), attach/detach an existing table as a 
partition of a partitioned table,
 * CREATE PARTITION to create a new partition on a partitioned table.

Above commands are merely transformed into ALTER TABLE subcommands that arrange 
partitioned table and partitions into inheritance hierarchy, but with extra 
information, that is, allowed values for the partition in a new anyarray column 
called 'pg_inherits.values'. A special case of ATExecAddInherit() namely 
ATExecAttachPartitionI(), as part of its processing, also adds partition 
constraints in the form of appropriate CHECK constraints. So, a few of the 
manual steps are automated and additional (IMHO non-opaque) metadata (namely 
partition boundaries/list values) is added.

Additionally, defining a partitioning key (PARTITION BY) creates a pg_partition 
entry that specifies for a partitioned table the following - partition kind 
(range/list),  an opclass for the key value comparison  and a key 'expression' 
(say, colname % 10).

A few key things I can think of as needing improvement would be  (perhaps just 
reiterating a review of the patch):

 * partition pruning would still depend on constraint exclusion using the CHECK 
constraints (same old)
 * there is no tuple-routing at all (same can be said of partition pruning 
above)
 * partition pruning or tuple-routing would require a scan over pg_inherits 
(perhaps inefficient)
 * partitioning key is an expression which might not be a good idea in early 
stages of the implementation (might be better off with just the attnum of the 
column to partition on?)
 * there is no DROP PARTITION (in fact, it is suggested not to go CREATE/DROP 
PARTITION route at all) - ALTER TABLE ... ADD/DROP PARTITION?

Some other important ones:
 * dependency handling related oversights
 * constraint propagation related oversights

And then some of the oddities of behaviour that I am seeing while trying out 
things that the patch does. Please feel free to suggest those that I am not 
seeing. I am sure these improvements need more than just tablecmds.c hacking 
which is what the current patch mostly does.

The first two points could use separate follow-on patches as I feel they need 
extensive changes unless I am missing something. I will try to post possible 
solutions to these issues provided metadata in current form is OK to proceed.

 In my view, the main problem we should be trying to solve
 here is avoid relying on constraint exclusion.  In other words, the
 syntax for adding a partition should put some metadata into the system
 catalogs that lets us do partitioning pruning very very quickly,
 without theorem-proving.  For example, for list or range partitioning,
 a list of partition bounds would be just right: you could
 binary-search it.  The same metadata should also be suitable for
 routing inserts to the proper partition, and handling partition motion
 when a tuple is updated.

 Now there's other stuff we might very well want to do, but I think
 making partition pruning and tuple routing fast would be a pretty big
 win by itself.


Those are definitely the goals worth striving for.

Thanks for your time.

Regards,
Amit





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-11-10 Thread Jim Nasby

On 11/10/14, 4:58 PM, Andres Freund wrote:

But the main reason all this is interesting isn't so much CREATE
itself. But that it can be (and Alvaro has mostly done it!) for ALTER as
well. And there it imo becomes really interesting. Because you can quite
easily check whether the affected relation is being replicated you can
just emit the DDL when that's the case. And that makes DDL in a
logically replicated setup*much*  easier.


+1. Adding columns is a PITA, you have to manually ensure you do it on all 
slaves first.

Drop is somewhat worse, because you have to do it on the master first, opposite 
of the (more usual) case of adding a column.

RENAME is a complete disaster.

Handing scripts to your replication system to execute isn't a very good 
alternative either; it assumes that you actually have a script (bad assumption 
with ORMs), and that you have a reasonable way to get that script to wherever 
you run your replication system.

I will also weigh in that there are a LOT of cases that binary replication doesn't cover. 
I find it interesting that prior to creating built in replication, the community stance 
was We won't build that because there's too many different use cases, but now 
some folks are saying that everyone should just use streaming rep and be done with it. :P
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-10 Thread Alvaro Herrera
Greg Stark wrote:

 There's another API question I have. To implement Consistent I need to
 call the hash function which in the case of functions like hashtext
 could be fairly expensive and I even need to generate multiple hash
 values(though currently I'm slicing them all from the integer hash
 value so that's not too bad) and then test each of those bits. It
 would be natural to call hashtext once at the start of the scan and
 possibly build a bitmap and compare all of them in a single 
 operation. But afaict there's no way to hook the beginning of the scan
 and opaque is not associated with the specific scan so I don't think I
 can cache the hash value of the scan key there safely. Is there a good
 way to do it with the current API?

I'm not sure why you say opaque is not associated with the specific
scan.  Are you thinking we could reuse opaque for a future scan?  I
think we could consider that opaque *is* the place to cache things such
as the hashed value of the qual constants or whatever.

 On a side note I'm curious about something, I was stepping through the
 my code in gdb and discovered that a single row insert appeared to
 construct a new summary then union it into the existing summary
 instead of just calling AddValue on the existing summary. Is that
 intentional? What led to that?

That's to test the Union procedure; if you look at the code, it's just
used in assert-enabled builds.  Now that I think about it, perhaps this
can turn out to be problematic for your bloom filter opclass.  I
considered the idea of allowing the opclass to disable this testing
procedure, but it isn't done (yet.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Convert query plan to sql query

2014-11-10 Thread mariem
If what you're wishing for is that you could also capture the effects 
of planner steps that are in the nature of source-to-source 
transformations, I think that's going to be harder because no great 
effort has been made to keep those at arm's length from steps that 
don't have results describable as pure SQL.  However, you could probably 
get pretty far if you applied ruleutils.c to the modified parse tree 
after the constant-folding and join tree simplification phases. 

I'm not sure if I understand what you mean by source-to-source
transformations.
But yes, what I'm aiming is applying simplification phases and
constant-folding before transforming the query tree back to sql text query.
Thank you for the suggestions.

Best,
Mariem



--
View this message in context: 
http://postgresql.nabble.com/Convert-query-plan-to-sql-query-tp5825727p5826448.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Amit Kapila
On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com
wrote:

 On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com
wrote:
  On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
  Does parallelism help at all?
 
  I'm pretty damn sure. We can't even make a mildly powerfull storage
  fully busy right now. Heck, I can't make my workstation's storage with a
  raid 10 out of four spinning disks fully busy.
 
  I think some of that benefit also could be reaped by being better at
  hinting the OS...

 Yes, it definitely helps but not only limited to IO bound operations.
 It gives a good gain for the queries having CPU intensive where
conditions.

 One more point we may need to consider, is there any overhead in passing
 the data row from workers to backend?

I am not sure if that overhead will be too much visible if we improve the
use of I/O subsystem by making parallel tasks working on it. However
another idea here could be that instead of passing tuple data, we just
pass tuple id, but in that case we have to retain the pin on the buffer
that contains tuple untill master backend reads from it that might have
it's own kind of problems.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Haribabu Kommi
On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com
 wrote:

 On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
  Does parallelism help at all?
 
  I'm pretty damn sure. We can't even make a mildly powerfull storage
  fully busy right now. Heck, I can't make my workstation's storage with a
  raid 10 out of four spinning disks fully busy.
 
  I think some of that benefit also could be reaped by being better at
  hinting the OS...

 Yes, it definitely helps but not only limited to IO bound operations.
 It gives a good gain for the queries having CPU intensive where
 conditions.

 One more point we may need to consider, is there any overhead in passing
 the data row from workers to backend?

 I am not sure if that overhead will be too much visible if we improve the
 use of I/O subsystem by making parallel tasks working on it.

I feel there may be an overhead because of workers needs to put the result
data in the shared memory and the backend has to read from there to process
it further. If the cost of transfering data from worker to backend is more than
fetching a tuple from the scan, then the overhead is visible when the
selectivity is more.

 However
 another idea here could be that instead of passing tuple data, we just
 pass tuple id, but in that case we have to retain the pin on the buffer
 that contains tuple untill master backend reads from it that might have
 it's own kind of problems.

Transfering tuple id doesn't solve the scenarios if the node needs any
projection.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Michael Paquier
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote:

 Reminder: I maintain a slight preference for only offering one
 suggestion per relation RTE, which is what this revision does (so no
 change there). If a committer who picks this up wants me to alter
 that, I don't mind doing so; since only Michael spoke up on this, I've
 kept things my way.

Hm. The last version of this patch has not really changed since since my
first review, and I have no more feedback to provide about it except what I
already mentioned. I honestly don't think that this patch is ready for
committer as-is... If someone wants to review it further, well extra
opinions I am sure are welcome.
-- 
Michael


Re: [HACKERS] Race in tablespace test on Windows

2014-11-10 Thread Amit Kapila
On Sat, Nov 8, 2014 at 10:34 AM, Noah Misch n...@leadboat.com wrote:

 In my Windows development environment, the tablespace regression test
fails
 approximately half the time.  Buildfarm member frogmouth failed in the
same
 manner at least once:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=frogmouthdt=2014-05-21%2014%3A30%3A01

 Here is a briefer command sequence exhibiting the same problem:

 CREATE TABLESPACE testspace LOCATION '...somewhere...';
 CREATE TABLE atable (c int) tablespace testspace;
 SELECT COUNT(*) FROM atable;-- open heap
 \c -
 ALTER TABLE atable SET TABLESPACE pg_default;
 DROP TABLESPACE testspace;  -- bug: fails sometimes
 DROP TABLESPACE testspace;  -- second one ~always works
 DROP TABLE atable;


For me, it doesn't get success even second time, I am getting
the same error until I execute some command on first session
which means till first session has processed the invalidation
messages.

postgres=# Drop tablespace tbs;
ERROR:  tablespace tbs is not empty
postgres=# Drop tablespace tbs;
ERROR:  tablespace tbs is not empty

I have tested this on Windows 7.

 When we unlink an open file, Windows retains it in the directory structure
 until all processes close it.  ALTER TABLE SET TABLESPACE sends
invalidation
 messages prompting backends to do so.  The backend running the ALTER TABLE
 always processes invalidations before processing another command.  The
other
 backend, the one serving commands before \c -, may have neither exited
nor
 processed the invalidation.  When it yet holds a file descriptor for
atable,
 the DROP TABLESPACE fails.  I suspect it's possible, though more
difficult, to
 see like trouble in dbcommands.c users of
RequestCheckpoint(CHECKPOINT_WAIT).

 To make this work as well on Windows as it does elsewhere, DROP TABLESPACE
 would need to wait for other backends to close relevant unlinked files.
 Perhaps implement wait_unlinked_files(const char *dirname) to poll
unlinked,
 open files until they disappear.  (An attempt to open an unlinked file
reports
 ERROR_ACCESS_DENIED.  It might be tricky to reliably distinguish this
cause
 from other causes of that error, but it should be possible.)

I think the proposed mechanism can work but the wait can be very long
(untill the backend holding descriptor executes another command).
Can we think of some other solution like in Drop Tablespace instead of
checking if directory is empty, check if there is no object that belongs
to database/cluster, then allow to forcibly delete that directory someway.


 I propose to add
 this as a TODO, then bandage the test case with s/^\\c -$/RESET ROLE;/.

Yeah, this make sense.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2014-11-10 Thread Michael Paquier
On Tue, Nov 11, 2014 at 3:24 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Wed, Nov 5, 2014 at 8:49 PM, Michael Paquier michael.paqu...@gmail.com 
 wrote:

 On Thu, Oct 30, 2014 at 5:19 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  Updated patch is attached.
 Please find attached an updated patch with the following things changed:
 - Addition of tab completion in psql for all new commands
 - Addition of a call to WaitForLockers in index_concurrent_swap to
 ensure that there are no running transactions on the parent table
 running before exclusive locks are taken on the index and its
 concurrent entry. Previous patch versions created deadlocks because of
 that, issue spotted by the isolation tests integrated in the patch.
 - Isolation tests for reindex concurrently are re-enabled by default.
 Regards,


 It looks like this needs another rebase, I get failures on index.c, 
 toasting.c, indexcmds.c, and index.h

Indeed. There are some conflicts created by the recent modification of
index_create. Here is a rebased patch.
-- 
Michael
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index cd55be8..653b120 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -864,7 +864,8 @@ ERROR:  could not serialize access due to read/write dependencies among transact
 
 para
  Acquired by commandVACUUM/command (without optionFULL/option),
- commandANALYZE/, commandCREATE INDEX CONCURRENTLY/, and
+ commandANALYZE/, commandCREATE INDEX CONCURRENTLY/,
+ commandREINDEX CONCURRENTLY/,
  commandALTER TABLE VALIDATE/command and other
  commandALTER TABLE/command variants (for full details see
  xref linkend=SQL-ALTERTABLE).
@@ -1143,7 +1144,7 @@ ERROR:  could not serialize access due to read/write dependencies among transact
 
sect2 id=locking-pages
 titlePage-level Locks/title
-  
+
 para
  In addition to table and row locks, page-level share/exclusive locks are
  used to control read/write access to table pages in the shared buffer
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index cabae19..285f3ff 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  refsynopsisdiv
 synopsis
-REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERname/replaceable [ FORCE ]
+REINDEX { INDEX | TABLE | DATABASE | SYSTEM } [ CONCURRENTLY ] replaceable class=PARAMETERname/replaceable [ FORCE ]
 /synopsis
  /refsynopsisdiv
 
@@ -68,9 +68,12 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam
   An index build with the literalCONCURRENTLY/ option failed, leaving
   an quoteinvalid/ index. Such indexes are useless but it can be
   convenient to use commandREINDEX/ to rebuild them. Note that
-  commandREINDEX/ will not perform a concurrent build. To build the
-  index without interfering with production you should drop the index and
-  reissue the commandCREATE INDEX CONCURRENTLY/ command.
+  commandREINDEX/ will perform a concurrent build if literal
+  CONCURRENTLY/ is specified. To build the index without interfering
+  with production you should drop the index and reissue either the
+  commandCREATE INDEX CONCURRENTLY/ or commandREINDEX CONCURRENTLY/
+  command. Indexes of toast relations can be rebuilt with commandREINDEX
+  CONCURRENTLY/.
  /para
 /listitem
 
@@ -139,6 +142,21 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam
/varlistentry
 
varlistentry
+termliteralCONCURRENTLY/literal/term
+listitem
+ para
+  When this option is used, productnamePostgreSQL/ will rebuild the
+  index without taking any locks that prevent concurrent inserts,
+  updates, or deletes on the table; whereas a standard reindex build
+  locks out writes (but not reads) on the table until it's done.
+  There are several caveats to be aware of when using this option
+  mdash; see xref linkend=SQL-REINDEX-CONCURRENTLY
+  endterm=SQL-REINDEX-CONCURRENTLY-title.
+ /para
+/listitem
+   /varlistentry
+
+   varlistentry
 termliteralFORCE/literal/term
 listitem
  para
@@ -218,6 +236,194 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam
reindex anything.
   /para
 
+  refsect2 id=SQL-REINDEX-CONCURRENTLY
+   title id=SQL-REINDEX-CONCURRENTLY-titleRebuilding Indexes Concurrently/title
+
+   indexterm zone=SQL-REINDEX-CONCURRENTLY
+   primaryindex/primary
+   secondaryrebuilding concurrently/secondary
+   /indexterm
+
+   para
+Rebuilding an index can interfere with regular operation of a database.
+Normally productnamePostgreSQL/ locks the table whose index is rebuilt
+against writes and performs the entire index build with a single scan of the
+table. Other transactions can still read the table, but if they try to
+  

Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Amit Kapila
On Tue, Nov 11, 2014 at 9:42 AM, Haribabu Kommi kommi.harib...@gmail.com
wrote:

 On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi 
kommi.harib...@gmail.com
  wrote:
 
  On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com

  wrote:
   On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
   Does parallelism help at all?
  
   I'm pretty damn sure. We can't even make a mildly powerfull storage
   fully busy right now. Heck, I can't make my workstation's storage
with a
   raid 10 out of four spinning disks fully busy.
  
   I think some of that benefit also could be reaped by being better at
   hinting the OS...
 
  Yes, it definitely helps but not only limited to IO bound operations.
  It gives a good gain for the queries having CPU intensive where
  conditions.
 
  One more point we may need to consider, is there any overhead in
passing
  the data row from workers to backend?
 
  I am not sure if that overhead will be too much visible if we improve
the
  use of I/O subsystem by making parallel tasks working on it.

 I feel there may be an overhead because of workers needs to put the result
 data in the shared memory and the backend has to read from there to
process
 it further. If the cost of transfering data from worker to backend is
more than
 fetching a tuple from the scan, then the overhead is visible when the
 selectivity is more.

  However
  another idea here could be that instead of passing tuple data, we just
  pass tuple id, but in that case we have to retain the pin on the buffer
  that contains tuple untill master backend reads from it that might have
  it's own kind of problems.

 Transfering tuple id doesn't solve the scenarios if the node needs any
 projection.

Hmm, that's why I told that we need to retain buffer pin, so that we can
get the tuple data.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [v9.5] Custom Plan API

2014-11-10 Thread Amit Kapila
On Mon, Nov 10, 2014 at 6:33 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Nov 10, 2014 at 6:55 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  I thought that in general if user has the API to register the custom
path
  methods, it should have some way to unregister them and also user might
  need to register some different custom path methods after unregistering
  the previous one's.  I think we should see what Robert or others have to
  say about this point before trying to provide such an API.

 I wouldn't bother.  As KaiGai says, if you want to shut the
 functionality off, the provider itself can provide a GUC.  Also, we
 really have made no effort to ensure that loadable modules can be
 safely unloaded, or hooked functions safely-unhooked.
 ExecutorRun_hook is a good example.  Typical of hook installation is
 this:

 prev_ExecutorRun = ExecutorRun_hook;
 ExecutorRun_hook = pgss_ExecutorRun;


In this case, Extension takes care of register and unregister for
hook.  In _PG_init(), it register the hook and _PG_fini() it
unregisters the same.  So if for custom scan core pg is
providing API to register the methods, shouldn't it provide an
API to unregister the same?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED

2014-11-10 Thread Michael Paquier
Thanks for the updated patch, Fabrizio.

On Tue, Nov 11, 2014 at 7:44 AM, Fabrízio de Royes Mello 
fabriziome...@gmail.com wrote:

  A couple of minor comments about this patch:
  1) Reading it, I am wondering if it would not be finally time to
  switch to a macro to get a relation's persistence, something like
  RelationGetPersistence in rel.h... Not related directly to this patch.

 Good idea... I'll provide a patch soon.

I created a thread dedicated to that:
http://www.postgresql.org/message-id/cab7npqseb+hwitxfwkqypa7+9bjolnxio47qsro3hcbsoq0...@mail.gmail.com
Now few people cared enough to comment :)

 2) reindex_index has as new argument a relpersislence value for the
  new index. reindex_relation has differently a new set of flags to
  enforce the relpersistence of all the underling indexes. Wouldn't it
  be better for API consistency to pass directly a relpersistence value
  through reindex_relation? In any case, the comment block of
  reindex_relation does not contain a description of the new flags.

 I did it because the ReindexDatabase build a list of oids to run
 reindex_relation for each item of the list. I can change the list to store
 the relpersistence also, but this can lead us to a concurrency trouble
 because if one session execute REINDEX DATABASE and other session run
 ALTER TABLE name SET {LOGGED|UNLOGGED} during the building of the list
 the reindex can lead us to a inconsistent state.

Fair point. I forgot this code path.


Except a typo with s/rebuilded/rebuilt/ in the patch, corrected in the
patch attached with some extra comments bundled, it seems to be that this
patch can be passed to a committer, so marking it so.

Btw, perhaps this diff should be pushed as a different patch as this is a
rather different thing:
-   if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED

+   if (indexRelation-rd_rel-relpersistence ==
RELPERSISTENCE_UNLOGGED 
!smgrexists(indexRelation-rd_smgr, INIT_FORKNUM)
As this is a correctness fix, it does not seem necessary to back-patch it:
the parent relation always has the same persistence as its indexes.
Regards,
-- 
Michael
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 912038a..a0a81e8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1980,7 +1980,7 @@ index_build(Relation heapRelation,
 	 * created it, or truncated twice in a subsequent transaction, the
 	 * relfilenode won't change, and nothing needs to be done here.
 	 */
-	if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED 
+	if (indexRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED 
 		!smgrexists(indexRelation-rd_smgr, INIT_FORKNUM))
 	{
 		RegProcedure ambuildempty = indexRelation-rd_am-ambuildempty;
@@ -3130,7 +3130,7 @@ IndexGetRelation(Oid indexId, bool missing_ok)
  * reindex_index - This routine is used to recreate a single index
  */
 void
-reindex_index(Oid indexId, bool skip_constraint_checks)
+reindex_index(Oid indexId, bool skip_constraint_checks, char relpersistence)
 {
 	Relation	iRel,
 heapRelation;
@@ -3191,6 +3191,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
 			indexInfo-ii_ExclusionStrats = NULL;
 		}
 
+		/* Set the relpersistence of the new index */
+		iRel-rd_rel-relpersistence = relpersistence;
+
 		/* We'll build a new physical relation for the index */
 		RelationSetNewRelfilenode(iRel, InvalidTransactionId,
   InvalidMultiXactId);
@@ -3310,6 +3313,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks)
  * performance, other callers should include the flag only after transforming
  * the data in a manner that risks a change in constraint validity.
  *
+ * REINDEX_REL_FORCE_UNLOGGED: if true, set the persistence of the rebuilt
+ * indexes to unlogged.
+ *
+ * REINDEX_REL_FORCE_LOGGED: if true, set the persistence of the rebuilt
+ * indexes to permanent.
+ *
  * Returns true if any indexes were rebuilt (including toast table's index
  * when relevant).  Note that a CommandCounterIncrement will occur after each
  * index rebuild.
@@ -3389,11 +3398,19 @@ reindex_relation(Oid relid, int flags)
 		foreach(indexId, indexIds)
 		{
 			Oid			indexOid = lfirst_oid(indexId);
+			char		relpersistence = rel-rd_rel-relpersistence;
 
 			if (is_pg_class)
 RelationSetIndexList(rel, doneIndexes, InvalidOid);
 
-			reindex_index(indexOid, !(flags  REINDEX_REL_CHECK_CONSTRAINTS));
+			/* Check for flags to enforce UNLOGGED or PERMANENT persistence */
+			if (flags  REINDEX_REL_FORCE_UNLOGGED)
+relpersistence = RELPERSISTENCE_UNLOGGED;
+			else if (flags  REINDEX_REL_FORCE_PERMANENT)
+relpersistence = RELPERSISTENCE_PERMANENT;
+
+			reindex_index(indexOid, !(flags  REINDEX_REL_CHECK_CONSTRAINTS),
+		  relpersistence);
 
 			CommandCounterIncrement();
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 6a578ec..e067abc 100644
--- a/src/backend/commands/cluster.c
+++ 

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Peter Geoghegan
On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Hm. The last version of this patch has not really changed since since my
 first review, and I have no more feedback to provide about it except what I
 already mentioned. I honestly don't think that this patch is ready for
 committer as-is... If someone wants to review it further, well extra
 opinions I am sure are welcome.

Why not?

You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.

What does that leave?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Missing line for postgresql.auto.conf?

2014-11-10 Thread Josh Berkus
Hackers,

I thought 9.4's postgresql.conf.sample was supposed to have a
commented-out line for postgresql.auto.conf? In the includes section?

It's not there.   If we don't give folks a commented-out line to
uncomment, it'll be pretty hard for them to figure out auto.conf ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-10 Thread Peter Geoghegan
On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan p...@heroku.com wrote:
 Why not?

 You've already said that you're happy to defer to whatever committer
 picks this up with regard to whether or not more than a single
 suggestion can come from an RTE. I agreed with this (i.e. I said I'd
 defer to their opinion too), and once again drew particular attention
 to this state of affairs alongside my most recent revision.

 What does that leave?

I see you've marked this Needs Review, even though your previously
marked it Ready for Committer a few months back (Robert marked it
Waiting on Author very recently because of the compiler warning, and
then I marked it back to Ready for Committer once that was
addressed, before you finally marked it back to Needs Review and
removed yourself as the reviewer just now).

I'm pretty puzzled by this. Other than our agree to disagree and
defer to committer position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1], I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2].

Did I miss something else?

[1] 
http://www.postgresql.org/message-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-bvqzgp0qa1y_xdc...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/ca+tgmoykiiq8mc0uj5i5xfktybg1qqfn4yrckz60ydunumk...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing line for postgresql.auto.conf?

2014-11-10 Thread Josh Berkus
On 11/10/2014 10:48 PM, Josh Berkus wrote:
 Hackers,
 
 I thought 9.4's postgresql.conf.sample was supposed to have a
 commented-out line for postgresql.auto.conf? In the includes section?
 
 It's not there.   If we don't give folks a commented-out line to
 uncomment, it'll be pretty hard for them to figure out auto.conf ...

Never mind.  In all of the changes I lost track of the fact that we
dropped this and aren't requiring an include line.  Was comparing it
with an earlier snapshot ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-11-10 Thread Michael Paquier
On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 Attached is a new version. It's rebased on current git master, including
 BRIN. I've also fixed the laundry list of small things you reported, as
 well as a bunch of bugs I uncovered during my own testing.

This patch needs a small rebase, it has been broken by a590f266 that fixed
WAL replay for brin indexes:
patching file src/backend/access/brin/brin_xlog.c
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 91.
This will facilitate testing as well.
Regards
-- 
Michael


Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-11-10 Thread Kouhei Kaigai



 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Haribabu Kommi
 Sent: Tuesday, November 11, 2014 1:13 PM
 To: Amit Kapila
 Cc: Andres Freund; Robert Haas; Simon Riggs; Tom Lane;
 pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] using custom scan nodes to prototype parallel
 sequential scan
 
 On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi
  kommi.harib...@gmail.com
  wrote:
 
  On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund
  and...@2ndquadrant.com
  wrote:
   On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
   Does parallelism help at all?
  
   I'm pretty damn sure. We can't even make a mildly powerfull storage
   fully busy right now. Heck, I can't make my workstation's storage
   with a raid 10 out of four spinning disks fully busy.
  
   I think some of that benefit also could be reaped by being better
   at hinting the OS...
 
  Yes, it definitely helps but not only limited to IO bound operations.
  It gives a good gain for the queries having CPU intensive where
  conditions.
 
  One more point we may need to consider, is there any overhead in
  passing the data row from workers to backend?
 
  I am not sure if that overhead will be too much visible if we improve
  the use of I/O subsystem by making parallel tasks working on it.
 
 I feel there may be an overhead because of workers needs to put the result
 data in the shared memory and the backend has to read from there to process
 it further. If the cost of transfering data from worker to backend is more
 than fetching a tuple from the scan, then the overhead is visible when the
 selectivity is more.
 
In my experience, data copy and transformation to fit TupleTableSlot is the
biggest overhead, rather than scan or join itself...
Probably, a straight-forward way is to construct an array of values/isnull
on a shared memory segment, then the backend process just switch pointers of
tts_values/tts_isnull, with no data copy. It gave us a performance gain.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers