Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of David Rowley
> Sent: Tuesday, May 10, 2016 2:01 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] asynchronous and vectorized execution
> 
> On 10 May 2016 at 13:38, Kouhei Kaigai  wrote:
> > My concern about ExecProcNode is, it is constructed with a large switch
> > ... case statement. It involves tons of comparison operation at run-time.
> > If we replace this switch ... case by function pointer, probably, it make
> > performance improvement. Especially, OLAP workloads that process large
> > amount of rows.
> 
> I imagined that any decent compiler would have built the code to use
> jump tables for this. I have to say that I've never checked to make
> sure though.
>
Ah, indeed, you are right. Please forget above part.

In GCC 4.8.5, the case label between T_ResultState and T_LimitState were
handled using jump table.

TupleTableSlot *
ExecProcNode(PlanState *node)
{
:
  
:
switch (nodeTag(node))
  5ad361:   8b 03   mov(%rbx),%eax
  5ad363:   2d c9 00 00 00  sub$0xc9,%eax
  5ad368:   83 f8 24cmp$0x24,%eax
  5ad36b:   0f 87 4f 02 00 00   ja 5ad5c0 
  5ad371:   ff 24 c5 68 48 8b 00jmpq   *0x8b4868(,%rax,8)
  5ad378:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
  5ad37f:   00

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread David Rowley
On 10 May 2016 at 13:38, Kouhei Kaigai  wrote:
> My concern about ExecProcNode is, it is constructed with a large switch
> ... case statement. It involves tons of comparison operation at run-time.
> If we replace this switch ... case by function pointer, probably, it make
> performance improvement. Especially, OLAP workloads that process large
> amount of rows.

I imagined that any decent compiler would have built the code to use
jump tables for this. I have to say that I've never checked to make
sure though.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Greg Stark
On 9 May 2016 8:34 pm, "David Rowley"  wrote:
>
> This project does appear to require that we bloat the code with 100's
> of vector versions of each function. I'm not quite sure if there's a
> better way to handle this. The problem is that the fmgr is pretty much
> a barrier to SIMD operations, and this was the only idea that I've had
> so far about breaking through that barrier. So further ideas here are
> very welcome.

Well yes and no. In practice I think you only need to worry about
vectorised versions of integer and possibly float. For other data types
there either aren't vectorised operators or there's little using them.

And I'll make a bold claim here that the only operators I think really
matter are =

The rain is because using SIMD instructions is a minor win if you have any
further work to do per tuple. The only time it's a big win is if you're
eliminating entire tuples from consideration efficiently. = is going to do
that often, other btree operator classes might be somewhat useful, but
things like + really only would come up in odd examples.

But even that understates things. If you have column oriented storage then
= becomes even more important since every scan has a series of implied
equijoins to reconstruct the tuple. And the coup de grace is that in a
column oriented storage you try to store variable length data as integer
indexes into a dictionary of common values so *everything* is an integer =
operation.

How to do this without punching right through the executor as an
abstraction and still supporting extensible data types and operators was
puzzling me already. I do think it involves having these vector operators
in the catalogue and also some kind of compression mapping to integer
indexes. But I'm not sure that's all that would be needed.


Re: [HACKERS] between not propated into a simple equality join

2016-05-09 Thread David G. Johnston
On Mon, May 9, 2016 at 8:53 AM, Benedikt Grundmann <
bgrundm...@janestreet.com> wrote:

> We just run into a very simple query that the planner does much worse on
> than we thought it would (in production the table in question is ~ 100
> GB).  It surprised us given the planner is generally quite good, so I
> thought I share our surprise
>
> Setup:
>
> postgres_prod@proddb_testing=# select version();[1]
> version
>
>
> 
>  PostgreSQL 9.2.16 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
> 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit
> (1 row)
>
> Time: 69.246 ms
>
> postgres_prod@proddb_testing=# create table toy_data3 (the_date date, i
> int);
> CREATE TABLE
> Time: 67.096 ms
> postgres_prod@proddb_testing=# insert into toy_data3
>
>   (select current_date-(s.idx/1000), s.idx from generate_series(1,100)
> as s(idx));
> INSERT 0 100
> Time: 1617.483 ms
> postgres_prod@proddb_testing=# create index toy_data_date3 on
> toy_data3(the_date);
> CREATE INDEX
> Time: 660.166 ms
> postgres_prod@proddb_testing=# analyze toy_data3;
> ANALYZE
> Time: 294.984 ms
>
> The bad behavior:
>
> postgres_prod@proddb_testing=# explain analyze
>   select * from (
>select td1.the_date, td1.i
> from toy_data3 td1, toy_data3 td2  where td1.the_date = td2.the_date
> and td1.i = td2.i
>   ) foo
>   where the_date between current_date and current_date;
>QUERY
> PLAN
>
> ───
>  Hash Join  (cost=55.49..21980.50 rows=1 width=8) (actual
> time=0.336..179.374 rows=999 loops=1)
>Hash Cond: ((td2.the_date = td1.the_date) AND (td2.i = td1.i))
>->  Seq Scan on toy_data3 td2  (cost=0.00..14425.00 rows=100
> width=8) (actual time=0.007..72.510 rows=100 lo
>->  Hash  (cost=40.44..40.44 rows=1003 width=8) (actual
> time=0.321..0.321 rows=999 loops=1)
>  Buckets: 1024  Batches: 1  Memory Usage: 40kB
>  ->  Index Scan using toy_data_date3 on toy_data3 td1
>  (cost=0.01..40.44 rows=1003 width=8) (actual time=0.018.
>Index Cond: ((the_date >= ('now'::cstring)::date) AND
> (the_date <= ('now'::cstring)::date))
>  Total runtime: 179.440 ms
> (8 rows)
>
> Time: 246.094 ms
>
> Notice the red.  Which is sad because one would like it to realize that it
> could propagate the index constraint onto td2.  That is on both sides of
> the join do the green.
>
>
​FWIW​

​This is my plan result:
version
PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
4.8.2-19ubuntu1) 4.8.2, 64-bit
All default settings

using "BETWEEN"

​
 QUERY PLAN
Nested Loop  (cost=0.86..48.91 rows=1 width=8) (actual time=0.042..168.512
rows=999 loops=1)
  ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..8.46
rows=1 width=8) (actual time=0.022..1.388 rows=999 loops=1)
Index Cond: ((the_date >= ('now'::cstring)::date) AND (the_date <=
('now'::cstring)::date))
  ->  Index Scan using toy_data_date3 on toy_data3 td2  (cost=0.42..40.44
rows=1 width=8) (actual time=0.078..0.160 rows=1 loops=999)
Index Cond: (the_date = td1.the_date)
Filter: (td1.i = i)
Rows Removed by Filter: 998
Planning time: 0.353 ms
Execution time: 169.692 ms

​using "="​

QUERY PLAN
Hash Join  (cost=49.89..90.46 rows=1 width=8) (actual time=2.320..5.652
rows=999 loops=1)
  Hash Cond: (td1.i = td2.i)
  ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..37.37
rows=967 width=8) (actual time=0.014..1.168 rows=999 loops=1)
Index Cond: (the_date = ('now'::cstring)::date)
  ->  Hash  (cost=37.37..37.37 rows=967 width=8) (actual time=2.292..2.292
rows=999 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 48kB
->  Index Scan using toy_data_date3 on toy_data3 td2
 (cost=0.43..37.37 rows=967 width=8) (actual time=0.008..1.183 rows=999
loops=1)
  Index Cond: (the_date = ('now'::cstring)::date)
Planning time: 0.326 ms
Execution time: 6.673 ms

I was hoping to be able to say more but alas cannot find the words.

I'm surprised by the estimate of 1 rows for the td1 index scan in my 9.5
query - and also why the 9.2 query would choose to sequential scan hash
join in favor of what seems to be a superior index scan nested loop on a
fraction of the table.

The fact that the between doesn't get transitively applied to td2 through
the td1=td2 condition, not as much...though whether the limitation is due
to theory or implementation I do not know.

I do suspect that at least part of the issue is that the computation of
"the_date" makes the two columns highly correlated while the planner
assumes independence.

David J.


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618

2016-05-09 Thread Michael Paquier
On Mon, May 9, 2016 at 4:17 PM, Michael Paquier
 wrote:
> On Tue, Mar 22, 2016 at 1:56 PM, Amit Kapila  wrote:
>> So as far as I can see there are two ways to resolve this issue, one is to
>> retry generation of dsm name if CreateFileMapping returns EACCES and second
>> is to append data_dir name to dsm name as the same is done for main shared
>> memory, that will avoid the error to occur.  First approach has minor flaw
>> that if CreateFileMapping returns EACCES due to reason other then duplicate
>> dsm name which I am not sure is possible to identify, then we should report
>> error instead try to regenerate the name
>>
>> Robert and or others, can you share your opinion on what is the best way to
>> proceed for this issue.
>
> For my 2c here, the approach using GetSharedMemName to identify the
> origin of a dynamic shared memory segment looks more solid in terms of
> robustness and collision handling. Retrying a segment is never going
> to be completely water-proof.

So, I have been hacking that a bit more and finished with the
attached, which seem to address the issue here. Some of the code paths
of dsm_impl.c are done in such a way that we can fail a dsm allocation
and still continue to move on. I have taken that into account by using
palloc_extended with NO_OOM. palloc instead of malloc is a good fit
anyway to prevent leaks in error code paths.

Thoughts?
-- 
Michael
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 173b982..060d0cb 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -62,6 +62,7 @@
 #include 
 #endif
 
+#include "miscadmin.h"
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
@@ -80,6 +81,7 @@ static bool dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size,
 			  Size *mapped_size, int elevel);
 #endif
 #ifdef USE_DSM_WINDOWS
+static char *get_segment_name(dsm_handle handle);
 static bool dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size,
  void **impl_private, void **mapped_address,
  Size *mapped_size, int elevel);
@@ -590,6 +592,36 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size,
 
 #ifdef USE_DSM_WINDOWS
 /*
+ * Generate shared memory segment name, using the data directory to generate
+ * a unique identifier. This segment is stored in Global\ namespace to allow
+ * any other process to have an access to it.
+ */
+static char *
+get_segment_name(dsm_handle handle)
+{
+	char	   *res;
+
+	/* let enough place for prefix and handle */
+	res = palloc_extended(strlen(DataDir) + 64, MCXT_ALLOC_NO_OOM);
+
+	if (res == NULL)
+		return NULL;
+
+	/*
+	 * Storing the shared memory segment in the Global\ namespace, can allow
+	 * any process running in any session to access that file mapping object
+	 * provided that the caller has the required access rights. But to avoid
+	 * issues faced in main shared memory, we are using the naming convention
+	 * similar to main shared memory. We can change here once issue mentioned
+	 * in GetSharedMemName is resolved.
+	 */
+	snprintf(res, strlen(DataDir) + 64, "%s.%s.%u",
+			 SEGMENT_NAME_PREFIX, DataDir, handle);
+
+	return res;
+}
+
+/*
  * Operating system primitives to support Windows shared memory.
  *
  * Windows shared memory implementation is done using file mapping
@@ -609,7 +641,7 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size,
 {
 	char	   *address;
 	HANDLE		hmap;
-	char		name[64];
+	char	   *name;
 	MEMORY_BASIC_INFORMATION info;
 
 	/* Resize is not supported for Windows shared memory. */
@@ -623,15 +655,13 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size,
 	if (op == DSM_OP_ATTACH && *mapped_address != NULL)
 		return true;
 
-	/*
-	 * Storing the shared memory segment in the Global\ namespace, can allow
-	 * any process running in any session to access that file mapping object
-	 * provided that the caller has the required access rights. But to avoid
-	 * issues faced in main shared memory, we are using the naming convention
-	 * similar to main shared memory. We can change here once issue mentioned
-	 * in GetSharedMemName is resolved.
-	 */
-	snprintf(name, 64, "%s.%u", SEGMENT_NAME_PREFIX, handle);
+	name = get_segment_name(handle);
+
+	if (name == NULL)
+	{
+		elog(elevel, "could not allocate memory for shared memory segment name");
+		return false;
+	}
 
 	/*
 	 * Handle teardown cases.  Since Windows automatically destroys the object
@@ -647,6 +677,7 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size,
 	(errcode_for_dynamic_shared_memory(),
    errmsg("could not unmap shared memory segment \"%s\": %m",
 		  name)));
+			pfree(name);
 			return false;
 		}
 		if (*impl_private != NULL
@@ -657,12 +688,14 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size,
 	(errcode_for_dynamic_shared_memory(),
   errmsg("could not remove 

Re: [HACKERS] Stopping logical replication protocol

2016-05-09 Thread Craig Ringer
On 10 May 2016 at 09:50, Craig Ringer  wrote:


> I outlined how I think WalSndWaitForWal() should be handled earlier. Test 
> streamingDoneReceiving
> and streamingDoneSending in logical_read_xlog_page and return -1.
>

OK, so thinking about this some more, I see why you've added the callback
within the reorder buffer code. You want to stop processing of a
transaction after we've decoded the commit and are looping over the changes
within ReorderBufferCommit(...), which doesn't know anything about the
walsender. So we could loop for a long time within WalSndLoop ->
XLogSendLogical -> LogicalDecodingProcessRecord if the record is a commit,
as we process each change and send it to the client.

So aborting processing and returning between individual changes in
ReorderBufferCommit(...) seems very reasonable. I agree that some kind of
callback is needed because the walsender uses static globals to control its
send/receive stop logic. I don't like the naming "is_active"; maybe reverse
the sense and call it "stop_decoding_cb" ? Or "continue_decoding_cb"?
Unsure.


Anyway, here's a draft patch that does the parts other than the reorder
buffer processing stop. It passes 'make check' and the src/test/recovery
tests, but I haven't written anything to verify the client-initiated abort
handling. You have test code for this and I'd be interested to see the
results.

This patch doesn't attempt to allow decoding to be aborted during
processing of an xlog record, including a commit. So it'll still attempt to
send whole transactions. But it should allow clients to send CopyDone when
we're waiting for new WAL to be generated and return to command mode then.

I think it's worth making the next step, where you allow reorder buffer
commit processing to be interrupted, into a separate patch on top of this
one. They're two separate changes IMO.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 14c0ac1aacfacb3b6b42bf0c5e453ab4e989aa2c Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 10 May 2016 10:34:10 +0800
Subject: [PATCH] Respect client-initiated CopyDone in walsender

---
 src/backend/replication/walsender.c | 32 +++-
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 926a247..4e2164e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -759,6 +759,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	/* make sure we have enough WAL available */
 	flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
 
+	/*
+	 * If the client sent CopyDone while we were waiting,
+	 * bail out so we can wind up the decoding session.
+	 */
+	if (streamingDoneSending)
+		return -1;
+
 	/* more than one block available */
 	if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
 		count = XLOG_BLCKSZ;
@@ -1220,8 +1227,11 @@ WalSndWaitForWal(XLogRecPtr loc)
 		 * It's important to do this check after the recomputation of
 		 * RecentFlushPtr, so we can send all remaining data before shutting
 		 * down.
+		 *
+		 * We'll also exit here if the client sent CopyDone because it wants
+		 * to return to command mode.
 		 */
-		if (walsender_ready_to_stop)
+		if (walsender_ready_to_stop || streamingDoneReceiving)
 			break;
 
 		/*
@@ -1789,7 +1799,14 @@ WalSndCheckTimeOut(TimestampTz now)
 	}
 }
 
-/* Main loop of walsender process that streams the WAL over Copy messages. */
+/*
+ * Main loop of walsender process that streams the WAL over Copy messages.
+ *
+ * The send_data callback must enqueue complete CopyData messages to libpq
+ * using pq_putmessage_noblock or similar, since the walsender loop may send
+ * CopyDone then exit and return to command mode in response to a client
+ * CopyDone between calls to send_data.
+ */
 static void
 WalSndLoop(WalSndSendDataCallback send_data)
 {
@@ -1852,10 +1869,15 @@ WalSndLoop(WalSndSendDataCallback send_data)
 		 * some more.  If there is some, we don't bother to call send_data
 		 * again until we've flushed it ... but we'd better assume we are not
 		 * caught up.
+		 *
+		 * If we're trying to finish sending and exit we shouldn't enqueue more
+		 * data to libpq. We need to finish writing out whatever we already
+		 * have in libpq's send buffer to maintain protocol sync so we still
+		 * need to loop until it's flushed.
 		 */
-		if (!pq_is_send_pending())
+		if (!pq_is_send_pending() && !streamingDoneSending)
 			send_data();
-		else
+		else if (!streamingDoneSending)
 			WalSndCaughtUp = false;
 
 		/* Try to flush pending output to the client */
@@ -2911,7 +2933,7 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
 		return;
 
-	if (waiting_for_ping_response)
+	if (waiting_for_ping_response || streamingDoneSending)
 		return;
 
 	/*

Re: [HACKERS] Stopping logical replication protocol

2016-05-09 Thread Craig Ringer
>
> ProcessRepliesIfAny also now executes in WalSdnWriteData. Because during
> send data we should also check message from client(client can send
> CopyDone, KeepAlive, Terminate).
>
>
Ah, I didn't spot that ProcessRepliesIfAny() is already called from
WalSndWriteData in the current codebase.

I agree that it would be useful to check for client-initiated CopyDone
there, clear the walsender write queue, close the decoding session and
return to command mode.

Nothing gets sent until we've formatted the data we want to send into a
StringInfo, though. The comments on WalSndPrepareWrite even note that
there's no guarantee anything will get done with the data. There should be
no need to test whether we're processing a Datum, since we won't call
ProcessRepliesIfAny() or WalSndWriteData() while we're doing that.

I think all you should need to do is test streamingDoneSending and
streamingDoneReceiving in WalSndWriteData() and WalSndWaitForWal().

I outlined how I think WalSndWaitForWal() should be handled earlier.
Test streamingDoneReceiving
and streamingDoneSending in logical_read_xlog_page and return -1.

For WalSndWriteData() it's more complicated because there's data waiting to
flush. I don't see any way to remove data from the send buffer in the libpq
async API, so you'll have to send whatever's already been passed to libpq
for sending using pq_putmessage_noblock(...). This is necessary for
protocol integrity too, since you need to send a complete CopyData packet.
The length is specified at the start of the CopyData, so once you start
sending it you are committed to finishing it. Otherwise the client has no
way to know when to stop expecting CopyData and look for the next protocol
message. It can wait until its socket read blocks, but that doesn't mean
there isn't more of the CopyData in the upstream server's send buffers,
buffered in a router, etc.

If you want a way to abort in the middle of sending a datum (or in fact an
individual logical change) you'd need a much bigger change to the
replication protocol, where we chunk up big Datum values into multiple
CopyData messages with continuations. That'd be quite nice to have,
especially if it let us stream the Datum out without assembling the whole
thing in memory in a StringInfo first, but it's a much bigger change.

If you just return from WalSndWriteData(), the data is still queued by
libpq, and libpq will still finish sending it when you try to send
something else. So I don't see any point taking much action in response to
CopyDone in WalSndWriteData(). Maybe you could check if the message we're
about to send is large and if it is, check for client input before we start
sending and bail out before the pq_putmessage_noblock() call. But once
we've done that we're committed to finishing sending that message and have
to wait until pq_is_send_pending() return false before we can send our own
CopyDone and exit.

Since WalSndWriteData() must write a whole message from the decoding plugin
before it can return, and when it returns we'll return
from LogicalDecodingProcessRecord(..) to XLogSendLogical(...), we can just
test for streamingDoneReceiving and streamingDoneSending there.


I guess if you add a flag or callback to the logical decoding context you
could have the logical decoding plugin interrupt processing of an
insert/update/delete row change message between processing of individual
datums within it and return without calling OutputPluginWrite(...) so the
data is never queued for WalSndWriteData(...). This would make a difference
if you have wide rows... but don't currently process client writes during
output plugin callbacks. You'd have to add a logical decoding API function
clients could call to process client replies and set the flag. (They can't
call ProcessRepliesIfAny from walsender.c as it's static and not meant to
be called from within a decoding plugin anyway).


> I want reuse same connection without reopen it, because open new
connection takes too long.

Fair enough. Though I don't understand why you'd be doing this often enough
that you'd care about reopening connections. What is the problem you are
trying to solve with this? The underlying reason you need this change?

> Is it correct use case or CopyDone it side effect of copy protocol and
for complete replication need use always Terminate package and reopen
connection?

Client-initiated CopyDone for COPY BOTH should be just fine in protocol
terms. There's partial support for it in the walsender already.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, May 10, 2016 2:34 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] asynchronous and vectorized execution
> 
> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
> 
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
>
Is the parallel aware Append node sufficient to run multiple nodes
asynchronously? (Sorry, I couldn't have enough time to code the feature
even though we had discussion before.)
If a part of child-nodes are blocked by I/O or other heavy stuff, it
cannot enqueue the results into shm_mq, thus, Gather node naturally
skip nodes that are not ready.
In the above example, scan on foreign-table takes longer lead time than
local scan. If Append can map every child nodes on individual workers,
local scan worker begins to return tuples at first, then, mixed tuples
shall be returned eventually.

However, the process internal asynchronous execution may be also beneficial
in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
are given to worker process). I think it allows to implement pre-fetching
very naturally.

> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior,

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

> since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.
>
I can agree with the above concern from my experience. Each HashJoin
step needs to fill up TupleTableSlot for each depth. Mostly, it is
just relocation of the attributes in case of multi-tables joins.

If HashJoin could gather five underlying 

[HACKERS] Please help update the "How to beta Test" page

2016-05-09 Thread Josh berkus
Folks,

Please help update the wiki page around how to beta test.  Particularly,
please update it with particular things we'd like to see users test for,
like data corruption related to freezing (with some notes on how to test
for this).

https://wiki.postgresql.org/wiki/HowToBetaTest

Thanks!

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread David Rowley
On 10 May 2016 at 05:33, Robert Haas  wrote:
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.   But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join.  The inner loop fetches the next tuple from the
> probe table and does a hash lookup.  Right now, fetching the next
> tuple from the probe table means calling a function which in turn
> calls another function which probably calls another function which
> probably calls another function and now about 4 layers down we
> actually get the next tuple.  If the scan returned a batch of tuples
> to the hash join, fetching the next tuple from the batch would
> probably be 0 or 1 function calls rather than ... more.  Admittedly,
> you've got to consider the cost of marshaling the batches but I'm
> optimistic that there are cycles to be squeezed out here.  We might
> also want to consider storing batches of tuples in a column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.

It's interesting that you mention this. We identified this as a pain
point during our work on column stores last year. Simply passing
single tuples around the executor is really unfriendly towards L1
instruction cache, plus also the points you mention about L3 cache and
hash tables and tuple stores. I really think that we're likely to see
significant gains by processing >1 tuple at a time, so this topic very
much interests me.

On researching this we've found that other peoples research does
indicate that there are gains to be had:
http://www.openlinksw.com/weblog/oerling/

In that blog there's a table that indicates that this row-store
database saw a 4.4x performance improvement from changing from a
tuple-at-a-time executor to a batch tuple executor.

Batch Size 1 tuple = 122 seconds
Batch Size 10k tuples = 27.7 seconds

When we start multiplying those increases with the increases with
something like parallel query then we're starting to see very nice
gains in performance.

Alvaro, Tomas and I had been discussing this and late last year I did
look into what would be required to allow this to happen in Postgres.
Basically there's 2 sub-projects, I'll describe what I've managed to
learn so far about each, and the rough plan that I have to implement
them:


1. Batch Execution:

a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
indicate if the struct contains a single or a multiple tuples.
Multiple tuples may need to be deformed in a non-lazy fashion in order
to prevent too many buffers from having to be pinned at once. Tuples
will be deformed into arrays of each column rather than arrays for
each tuple (this part is important to support the next sub-project)
c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
process a batch TupleTableSlot. This will require some tight loop to
aggregate the entire TupleTableSlot at once before returning.
d. Add function in execAmi.c which returns true or false depending on
if the node supports batch TupleTableSlots or not.
e. At executor startup determine if the entire plan tree supports
batch TupleTableSlots, if so enable batch scan mode.

That at least is my ideas for stage 1. There's still more to work out.
e.g should batch mode occur when the query has a LIMIT? we might not
want to waste time gather up extra tuples when we're just going to
stop after the first one. So perhaps 'e' above should be up to the
planner instead. Further development work here might add a node type
that de-batches a TupleTableSlot to allow nodes which don't support
batching to be in the plan, i.e "mixed execution mode". I'm less
excited about this as it may be difficult to cost that 

Re: [HACKERS] Reviewing freeze map code

2016-05-09 Thread Ants Aasma
On Mon, May 9, 2016 at 10:53 PM, Robert Haas  wrote:
> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada  
> wrote:
>> Attached draft patch adds SCANALL option to VACUUM in order to scan
>> all pages forcibly while ignoring visibility map information.
>> The option name is SCANALL for now but we could change it after got 
>> consensus.
>
> If we're going to go that way, I'd say it should be scan_all rather
> than scanall.  Makes it clearer, at least IMHO.

Just to add some diversity to opinions, maybe there should be a
separate command for performing integrity checks. Currently the best
ways to actually verify database correctness do so as a side effect.
The question that I get pretty much every time after I explain why we
have data checksums, is "how do I check that they are correct" and we
don't have a nice answer for that now. We could also use some ways to
sniff out corrupted rows that don't involve crashing the server in a
loop. Vacuuming pages that supposedly don't need vacuuming just to
verify integrity seems very much in the same vein.

I know right now isn't exactly the best time to hastily slap on such a
feature, but I just wanted the thought to be out there for
consideration.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-09 Thread Robert Haas
On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada  wrote:
> Attached draft patch adds SCANALL option to VACUUM in order to scan
> all pages forcibly while ignoring visibility map information.
> The option name is SCANALL for now but we could change it after got consensus.

If we're going to go that way, I'd say it should be scan_all rather
than scanall.  Makes it clearer, at least IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GIN logging GIN_SEGMENT_UNMODIFIED actions?

2016-05-09 Thread Andres Freund
Hi,

trying to debug something I saw the following in pg_xlogdump output:

rmgr: Gin len (rec/tot):  0/   274, tx:  0, lsn: 
1C/DF28AEB0, prev 1C/DF289858, desc: VACUUM_DATA_LEAF_PAGE  3 segments: 5 
unknown action 0 ???, blkref #0: rel 1663/16384/16435 blk 310982

note the "segments: 5 unknown action 0 ???" bit.  That doesn't seem
right, given:
#define GIN_SEGMENT_UNMODIFIED  0   /* no action (not used in WAL 
records) */

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Simon Riggs
On 9 May 2016 at 19:33, Robert Haas  wrote:


> I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).



> 1. asynchronous execution
>

Not looking at that.


> 2. vectorized execution...



> We might also want to consider storing batches of tuples in a
> column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.
>

Team is about 2 years into research and coding prototype on those topics at
this point, with agreed time for further work over next 2 years.

I'll let my colleagues chime in with details since I'm not involved at that
level any more.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Change error code for hstore syntax error

2016-05-09 Thread Sherrylyn Branchaw
Finally returning to this...


> Hm, class 42 is generally meant for SQL-level syntax errors.  Are these
> errors not coming from subroutines of hstore_in()?  I think our usual
> convention is to use ERRCODE_INVALID_TEXT_REPRESENTATION for complaints
> that a data value does not meet its type's conventions.  In any case
> it seems closer to class 22 than 42.
>

I see, thanks for pointing that out. That seems like a useful distinction
to preserve. I've updated it to ERRCODE_INVALID_TEXT_REPRESENTATION
accordingly.

These seem to have multiple problems, starting with incorrect
> capitalization and extending to failure to quote the whole string
> being complained of.


I've modified the messages to comply with the guidelines, with a little
help from a colleague who actually knows C.

I'm attaching a revised patch; please let me know if there are any other
issues before I submit to the commitfest.

Best,
Sherrylyn
diff --git a/contrib/hstore/hstore_io.c b/contrib/hstore/hstore_io.c
index 0c1d99a..4ee3d05 100644
--- a/contrib/hstore/hstore_io.c
+++ b/contrib/hstore/hstore_io.c
@@ -81,7 +81,10 @@ get_val(HSParser *state, bool ignoreeq, bool *escaped)
 			}
 			else if (*(state->ptr) == '=' && !ignoreeq)
 			{
-elog(ERROR, "Syntax error near '%c' at position %d", *(state->ptr), (int32) (state->ptr - state->begin));
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("syntax error at position %d: \"%s\"",
+	  (int32) (state->ptr - state->begin), state->begin;
 			}
 			else if (*(state->ptr) == '\\')
 			{
@@ -138,7 +141,9 @@ get_val(HSParser *state, bool ignoreeq, bool *escaped)
 			}
 			else if (*(state->ptr) == '\0')
 			{
-elog(ERROR, "Unexpected end of string");
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("unexpected end of string: \"%s\"", state->begin;
 			}
 			else
 			{
@@ -150,7 +155,9 @@ get_val(HSParser *state, bool ignoreeq, bool *escaped)
 		else if (st == GV_WAITESCIN)
 		{
 			if (*(state->ptr) == '\0')
-elog(ERROR, "Unexpected end of string");
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("unexpected end of string: \"%s\"", state->begin;
 			RESIZEPRSBUF;
 			*(state->cur) = *(state->ptr);
 			state->cur++;
@@ -159,14 +166,16 @@ get_val(HSParser *state, bool ignoreeq, bool *escaped)
 		else if (st == GV_WAITESCESCIN)
 		{
 			if (*(state->ptr) == '\0')
-elog(ERROR, "Unexpected end of string");
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("unexpected end of string: \"%s\"", state->begin;
 			RESIZEPRSBUF;
 			*(state->cur) = *(state->ptr);
 			state->cur++;
 			st = GV_INESCVAL;
 		}
 		else
-			elog(ERROR, "Unknown state %d at position line %d in file '%s'", st, __LINE__, __FILE__);
+			elog(ERROR, "unknown state %d at position line %d in file \"%s\"", st, __LINE__, __FILE__);
 
 		state->ptr++;
 	}
@@ -216,11 +225,16 @@ parse_hstore(HSParser *state)
 			}
 			else if (*(state->ptr) == '\0')
 			{
-elog(ERROR, "Unexpected end of string");
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("unexpected end of string: \"%s\"", state->begin;
 			}
 			else if (!isspace((unsigned char) *(state->ptr)))
 			{
-elog(ERROR, "Syntax error near '%c' at position %d", *(state->ptr), (int32) (state->ptr - state->begin));
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("syntax error at position %d: \"%s\"",
+	  (int32) (state->ptr - state->begin), state->begin;
 			}
 		}
 		else if (st == WGT)
@@ -231,17 +245,24 @@ parse_hstore(HSParser *state)
 			}
 			else if (*(state->ptr) == '\0')
 			{
-elog(ERROR, "Unexpected end of string");
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("unexpected end of string: \"%s\"", state->begin;
 			}
 			else
 			{
-elog(ERROR, "Syntax error near '%c' at position %d", *(state->ptr), (int32) (state->ptr - state->begin));
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("syntax error at position %d: \"%s\"",
+	  (int32) (state->ptr - state->begin), state->begin;
 			}
 		}
 		else if (st == WVAL)
 		{
 			if (!get_val(state, true, ))
-elog(ERROR, "Unexpected end of string");
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 (errmsg("unexpected end of string: \"%s\"", state->begin;
 			state->pairs[state->pcur].val = state->word;
 			state->pairs[state->pcur].vallen = hstoreCheckValLen(state->cur - state->word);
 			state->pairs[state->pcur].isnull = false;
@@ -268,11 +289,14 @@ parse_hstore(HSParser *state)
 			}
 			else if (!isspace((unsigned char) *(state->ptr)))
 			{
-elog(ERROR, "Syntax error near '%c' at position %d", *(state->ptr), (int32) (state->ptr - state->begin));
+ereport(ERROR,
+		

Re: [HACKERS] Patch for German translation

2016-05-09 Thread Peter Eisentraut

On 5/9/16 10:25 AM, Christian Ullrich wrote:

here is a patch for the German translation that removes (all) five
instances of *the* most annoying mistake ever.


fixed

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add TAP tests for pg_dump

2016-05-09 Thread Stephen Frost
Catalin,

* Catalin Iacob (iacobcata...@gmail.com) wrote:
> On 5/9/16, Stephen Frost  wrote:
> >> And what if the tests are buggy? New test suites should go through a
> >> CF first I think for proper review. And this is clearly one.
> >
> > They still won't result in data loss, corruption, or other direct impact
> > on our users, even if they are buggy.  They also won't destablize the
> > code base or introduce new bugs into the code.
> 
> Yes, but they could make things worse long term. For example if
> introduce intermittent failures or if they're hard to understand or if
> they test internals too deeply and don't let those internals evolve
> because the tests break. All of them reasons why it's good that
> they're reviewed.

I agree that such tests could be bad but I'm unconvinced that's a
justification for having to have all new testing go through the CF
process.

Further, this test framework was under discussion on-list and commented
on by at least one other committer prior to being committed.  It was not
entirely without review.

> > There is pretty clearly a lack of consensus on the question about adding
> > new tests.
> 
> > What *should* we be doing right now?  Reviewing code and testing the
> > system is what I had thought but perhaps I'm wrong.  To the extent that
> > we're testing the system, isn't it better to write repeatable tests,
> > particularly ones which add code coverage, than one-off tests that only
> > check that the current snapshot of the code is operating correctly?
> 
> I also think that testing *and* keeping those tests for the future is
> more valuable than just testing. But committers can just push their
> tests while non committers need to wait for the next CF to get their
> new tests committed.

I'd actually love to have non-committers reviewing and testing code and
submitting their tests for review and commit by a committer.  This
entire discussion is about when it's appropriate to do that and I'm
trying to argue that now is the right time for that review and testing
to happen.

If the consensus is that new tests are good and that we can add them
during this period of development, then I don't see why that would be
any different for committers vs. non-committers.  We certainly don't
require that non-committers go through the CF process for bug fixes
(although we encourage them to add it to the CF if it isn't picked up
immediately, to ensure it isn't forgotten).

Honestly, this argument appears to be primairly made on the basis of
"committers shouldn't get to do things differently from non-committers"
and, further, that the CF process should be an all-encompassing
bureaucratic process for every commit that goes into PG simply because
we have a CF system.  I don't agree with either of those.

> And committers pushing tests means they bypass
> the review process which, as far as I know, doesn't happen for regular
> patches: committers usually only directly commit small fixes not lots
> of new (test) code.

As mentioned, the test suite wasn't simply pushed without any review or
discussion.  That didn't happen through the CF process, but that doesn't
mean it didn't happen.

> So, crazy idea: what about test only CommitFests in between now and
> the release? The rules would be: only new tests and existing test
> improvements are allowed. For the rest, the CF would be the same as
> usual: have an end date, committers agree to go through each patch and
> commit or return it with feedback etc. Stephen would have added his
> tests to the upcoming June test CF, Michael would have reviewed them
> and maybe add his own and so on.

If having a process for adding tests would be beneficial then I'm happy
to support it.  I agree that having specific goals for both committers
and non-committers during this time in the development cycle is a good
thing and having a "test-only" or perhaps "review/test-only"
commitfest sounds like it would encourage those as goals for this
period.

> This does create work for committers (go through the submitted
> patches) but acts as an incentive for test writers. Want your test
> code committed? Well, there will be a CF very soon where it will get
> committer attention so better write that test. It also gives a clear
> date after which test churn also stops in order to not destabilize the
> buildfarm right before a release.

I don't have any issue with that and would be happy to support such an
approach, as a committer.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] parallel.c is not marked as test covered

2016-05-09 Thread Robert Haas
On Mon, May 9, 2016 at 11:11 AM, Tom Lane  wrote:
> Andres Freund  writes:
>> I think it's a good idea to run a force-parallel run on some buildfarm
>> members.
>
> Noah's already doing that on at least one of his critters.  But some more
> wouldn't hurt.

I agree.

>> But I'm rather convinced that the core tests run by all animals
>> need some minimal coverage of parallel queries. Both because otherwise
>> it'll be hard to get some coverage of unusual platforms, and because
>> it's imo something rather relevant to test during development.
>
> +1.  Experimenting with what we might do, it seems like it's harder to get
> the planner to use a parallel plan than you would think.
>
> regression=# explain select count(*) from tenk1;
> QUERY PLAN
>
> 
> ---
>  Aggregate  (cost=295.29..295.30 rows=1 width=8)
>->  Index Only Scan using tenk1_thous_tenthous on tenk1  
> (cost=0.29..270.29 r
> ows=1 width=0)
> (2 rows)
>
> regression=# set enable_indexscan TO 0;
> SET
> regression=# explain select count(*) from tenk1;
>QUERY PLAN
> -
>  Aggregate  (cost=483.00..483.01 rows=1 width=8)
>->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=0)
> (2 rows)
>
> regression=# set force_parallel_mode TO on;
> SET
> regression=# explain select count(*) from tenk1;
>QUERY PLAN
> -
>  Aggregate  (cost=483.00..483.01 rows=1 width=8)
>->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=0)
> (2 rows)
>
> Methinks force_parallel_mode is a bit broken.

Hmm, that is strange.  I would have expected that to stuff a Gather on
top of the Aggregate.  I wonder why it's not doing that.

> Also, once you *do* get it to make a parallel plan:
>
> regression=# create table foo as select generate_series(1,100) g;
> SELECT 100
> regression=# analyze foo;
> ANALYZE
> regression=# explain select count(*) from foo;
>   QUERY PLAN
> --
>  Finalize Aggregate  (cost=10633.55..10633.56 rows=1 width=8)
>->  Gather  (cost=10633.33..10633.54 rows=2 width=8)
>  Workers Planned: 2
>  ->  Partial Aggregate  (cost=9633.33..9633.34 rows=1 width=8)
>->  Parallel Seq Scan on foo  (cost=0.00..8591.67 rows=416667 
> width=0)
> (5 rows)
>
> regression=# explain (costs off) select count(*) from foo;
>  QUERY PLAN
> 
>  Finalize Aggregate
>->  Gather
>  Workers Planned: 2
>  ->  Partial Aggregate
>->  Parallel Seq Scan on foo
> (5 rows)
>
> That's not going to do for a regression-test case because it will break
> anytime somebody changes the value of max_parallel_degree.  Maybe we
> should suppress the "Workers Planned" field in costs-off display mode?

That seems reasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] between not propated into a simple equality join

2016-05-09 Thread Benedikt Grundmann
We just run into a very simple query that the planner does much worse on
than we thought it would (in production the table in question is ~ 100
GB).  It surprised us given the planner is generally quite good, so I
thought I share our surprise

Setup:

postgres_prod@proddb_testing=# select version();[1]
version


 PostgreSQL 9.2.16 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit
(1 row)

Time: 69.246 ms

postgres_prod@proddb_testing=# create table toy_data3 (the_date date, i
int);
CREATE TABLE
Time: 67.096 ms
postgres_prod@proddb_testing=# insert into toy_data3

  (select current_date-(s.idx/1000), s.idx from generate_series(1,100)
as s(idx));
INSERT 0 100
Time: 1617.483 ms
postgres_prod@proddb_testing=# create index toy_data_date3 on
toy_data3(the_date);
CREATE INDEX
Time: 660.166 ms
postgres_prod@proddb_testing=# analyze toy_data3;
ANALYZE
Time: 294.984 ms

The bad behavior:

postgres_prod@proddb_testing=# explain analyze
  select * from (
   select td1.the_date, td1.i
from toy_data3 td1, toy_data3 td2  where td1.the_date = td2.the_date
and td1.i = td2.i
  ) foo
  where the_date between current_date and current_date;
   QUERY
PLAN
───
 Hash Join  (cost=55.49..21980.50 rows=1 width=8) (actual
time=0.336..179.374 rows=999 loops=1)
   Hash Cond: ((td2.the_date = td1.the_date) AND (td2.i = td1.i))
   ->  Seq Scan on toy_data3 td2  (cost=0.00..14425.00 rows=100
width=8) (actual time=0.007..72.510 rows=100 lo
   ->  Hash  (cost=40.44..40.44 rows=1003 width=8) (actual
time=0.321..0.321 rows=999 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 40kB
 ->  Index Scan using toy_data_date3 on toy_data3 td1
 (cost=0.01..40.44 rows=1003 width=8) (actual time=0.018.
   Index Cond: ((the_date >= ('now'::cstring)::date) AND
(the_date <= ('now'::cstring)::date))
 Total runtime: 179.440 ms
(8 rows)

Time: 246.094 ms

Notice the red.  Which is sad because one would like it to realize that it
could propagate the index constraint onto td2.  That is on both sides of
the join do the green.

As it does correctly when one explicitly uses equality (bold below) (but of
course we sometimes have multiple day ranges in production and we only used
a single date range above to make it extra interesting for the planner to
NOT do a seqscan):

postgres_prod@proddb_testing=# explain analyze
  select * from (
   select td1.the_date, td1.i
from toy_data3 td1, toy_data3 td2  where td1.the_date = td2.the_date
and td1.i = td2.i) foo
  where *the_date = current_date*;
   QUERY
PLAN
───
 Hash Join  (cost=50.47..92.17 rows=1 width=8) (actual time=0.300..0.652
rows=999 loops=1)
   Hash Cond: (td1.i = td2.i)
   ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.00..37.93
rows=1003 width=8) (actual time=0.023..0.169
 Index Cond: (the_date = ('now'::cstring)::date)
   ->  Hash  (cost=37.93..37.93 rows=1003 width=8) (actual
time=0.270..0.270 rows=999 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 40kB
 ->  Index Scan using toy_data_date3 on toy_data3 td2
 (cost=0.00..37.93 rows=1003 width=8) (actual time=0.007.
   Index Cond: (the_date = ('now'::cstring)::date)
 Total runtime: 0.713 ms
(9 rows)

Time: 66.904 ms

Cheers,

Bene


Re: [HACKERS] parallel.c is not marked as test covered

2016-05-09 Thread Tom Lane
Andres Freund  writes:
> I think it's a good idea to run a force-parallel run on some buildfarm
> members.

Noah's already doing that on at least one of his critters.  But some more
wouldn't hurt.

> But I'm rather convinced that the core tests run by all animals
> need some minimal coverage of parallel queries. Both because otherwise
> it'll be hard to get some coverage of unusual platforms, and because
> it's imo something rather relevant to test during development.

+1.  Experimenting with what we might do, it seems like it's harder to get
the planner to use a parallel plan than you would think.

regression=# explain select count(*) from tenk1;
QUERY PLAN  
   

---
 Aggregate  (cost=295.29..295.30 rows=1 width=8)
   ->  Index Only Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..270.29 r
ows=1 width=0)
(2 rows)

regression=# set enable_indexscan TO 0;
SET
regression=# explain select count(*) from tenk1;
   QUERY PLAN
-
 Aggregate  (cost=483.00..483.01 rows=1 width=8)
   ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=0)
(2 rows)

regression=# set force_parallel_mode TO on;
SET
regression=# explain select count(*) from tenk1;
   QUERY PLAN
-
 Aggregate  (cost=483.00..483.01 rows=1 width=8)
   ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=0)
(2 rows)

Methinks force_parallel_mode is a bit broken.

Also, once you *do* get it to make a parallel plan:

regression=# create table foo as select generate_series(1,100) g;
SELECT 100
regression=# analyze foo;
ANALYZE
regression=# explain select count(*) from foo;
  QUERY PLAN
  
--
 Finalize Aggregate  (cost=10633.55..10633.56 rows=1 width=8)
   ->  Gather  (cost=10633.33..10633.54 rows=2 width=8)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=9633.33..9633.34 rows=1 width=8)
   ->  Parallel Seq Scan on foo  (cost=0.00..8591.67 rows=416667 
width=0)
(5 rows)

regression=# explain (costs off) select count(*) from foo;
 QUERY PLAN 

 Finalize Aggregate
   ->  Gather
 Workers Planned: 2
 ->  Partial Aggregate
   ->  Parallel Seq Scan on foo
(5 rows)

That's not going to do for a regression-test case because it will break
anytime somebody changes the value of max_parallel_degree.  Maybe we
should suppress the "Workers Planned" field in costs-off display mode?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add TAP tests for pg_dump

2016-05-09 Thread Catalin Iacob
On 5/9/16, Stephen Frost  wrote:
>> And what if the tests are buggy? New test suites should go through a
>> CF first I think for proper review. And this is clearly one.
>
> They still won't result in data loss, corruption, or other direct impact
> on our users, even if they are buggy.  They also won't destablize the
> code base or introduce new bugs into the code.

Yes, but they could make things worse long term. For example if
introduce intermittent failures or if they're hard to understand or if
they test internals too deeply and don't let those internals evolve
because the tests break. All of them reasons why it's good that
they're reviewed.

> There is pretty clearly a lack of consensus on the question about adding
> new tests.

> What *should* we be doing right now?  Reviewing code and testing the
> system is what I had thought but perhaps I'm wrong.  To the extent that
> we're testing the system, isn't it better to write repeatable tests,
> particularly ones which add code coverage, than one-off tests that only
> check that the current snapshot of the code is operating correctly?

I also think that testing *and* keeping those tests for the future is
more valuable than just testing. But committers can just push their
tests while non committers need to wait for the next CF to get their
new tests committed. And committers pushing tests means they bypass
the review process which, as far as I know, doesn't happen for regular
patches: committers usually only directly commit small fixes not lots
of new (test) code.

So, crazy idea: what about test only CommitFests in between now and
the release? The rules would be: only new tests and existing test
improvements are allowed. For the rest, the CF would be the same as
usual: have an end date, committers agree to go through each patch and
commit or return it with feedback etc. Stephen would have added his
tests to the upcoming June test CF, Michael would have reviewed them
and maybe add his own and so on.

This does create work for committers (go through the submitted
patches) but acts as an incentive for test writers. Want your test
code committed? Well, there will be a CF very soon where it will get
committer attention so better write that test. It also gives a clear
date after which test churn also stops in order to not destabilize the
buildfarm right before a release.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel.c is not marked as test covered

2016-05-09 Thread Andres Freund
On 2016-05-08 22:20:55 -0300, Alvaro Herrera wrote:
> David Rowley wrote:
> 
> > I'm not entirely sure which machine generates that coverage output,
> > but the problem with it is that it's just one machine. We do have at
> > least one buildfarm member which runs with force_parallel_mode =
> > regress.
> 
> It's not a buildfarm machine, but a machine setup specifically for
> coverage reports.  It runs "make check-world" only.  I can add some
> additional command(s) to run, if somebody can suggest something.

I think it's a good idea to run a force-parallel run on some buildfarm
members. But I'm rather convinced that the core tests run by all animals
need some minimal coverage of parallel queries. Both because otherwise
it'll be hard to get some coverage of unusual platforms, and because
it's imo something rather relevant to test during development.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for German translation

2016-05-09 Thread Euler Taveira
On 09-05-2016 11:25, Christian Ullrich wrote:
> here is a patch for the German translation that removes (all) five
> instances of *the* most annoying mistake ever.
> 
We generally don't work with patches for translations. Instead, we send
all PO files. We have a separate repository [1] to deal with
translations that is merged with upstream before each release.

> By the way, is there any documentation anywhere about how and where to
> send translation patches? I'm just guessing here.
> 
Take a look at https://wiki.postgresql.org/wiki/NLS

Subscribe to pgsql-translators [2] to send corrections / translations.


[1] http://git.postgresql.org/gitweb/?p=pgtranslation/messages.git;a=summary
[2] http://www.postgresql.org/list/pgsql-translators/


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for German translation

2016-05-09 Thread Tom Lane
Christian Ullrich  writes:
> By the way, is there any documentation anywhere about how and where to 
> send translation patches? I'm just guessing here.

https://wiki.postgresql.org/wiki/NLS

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch for German translation

2016-05-09 Thread Christian Ullrich

Hello,

here is a patch for the German translation that removes (all) five 
instances of *the* most annoying mistake ever.


By the way, is there any documentation anywhere about how and where to 
send translation patches? I'm just guessing here.


--
Christian

diff --git a/de/postgres.po b/de/postgres.po
index 6dd941d..f20cdc5 100644
--- a/de/postgres.po
+++ b/de/postgres.po
@@ -21038,7 +21038,7 @@ msgstr "zu viele Verbindungen von Rolle „%s“"
 #: utils/init/miscinit.c:618
 #, c-format
 msgid "permission denied to set session authorization"
-msgstr "keine Berechtigung, um Sitzungsauthorisierung zu setzen"
+msgstr "keine Berechtigung, um Sitzungsautorisierung zu setzen"
 
 #: utils/init/miscinit.c:701
 #, c-format
@@ -21174,7 +21174,7 @@ msgstr "Bibliothek „%s“ geladen"
 #: utils/init/postinit.c:252
 #, c-format
 msgid "replication connection authorized: user=%s SSL enabled (protocol=%s, 
cipher=%s, compression=%s)"
-msgstr "Replikationsverbindung authorisiert: Benutzer=%s SSL an (Protokoll=%s, 
Verschlüsselungsmethode=%s, Komprimierung=%s)"
+msgstr "Replikationsverbindung autorisiert: Benutzer=%s SSL an (Protokoll=%s, 
Verschlüsselungsmethode=%s, Komprimierung=%s)"
 
 #: utils/init/postinit.c:254 utils/init/postinit.c:268
 msgid "off"
@@ -21187,17 +21187,17 @@ msgstr "an"
 #: utils/init/postinit.c:258
 #, c-format
 msgid "replication connection authorized: user=%s"
-msgstr "Replikationsverbindung authorisiert: Benutzer=%s"
+msgstr "Replikationsverbindung autorisiert: Benutzer=%s"
 
 #: utils/init/postinit.c:266
 #, c-format
 msgid "connection authorized: user=%s database=%s SSL enabled (protocol=%s, 
cipher=%s, compression=%s)"
-msgstr "Verbindung authorisiert: Benutzer=%s Datenbank=%s SSL an 
(Protokoll=%s, Verschlüsselungsmethode=%s, Komprimierung=%s)"
+msgstr "Verbindung autorisiert: Benutzer=%s Datenbank=%s SSL an (Protokoll=%s, 
Verschlüsselungsmethode=%s, Komprimierung=%s)"
 
 #: utils/init/postinit.c:272
 #, c-format
 msgid "connection authorized: user=%s database=%s"
-msgstr "Verbindung authorisiert: Benutzer=%s Datenbank=%s"
+msgstr "Verbindung autorisiert: Benutzer=%s Datenbank=%s"
 
 #: utils/init/postinit.c:304
 #, c-format

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add TAP tests for pg_dump

2016-05-09 Thread Stephen Frost
* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Sun, May 8, 2016 at 6:44 AM, Stephen Frost  wrote:
> > I do think that now is a good time for people to be reviewing what's
> > been committed, which includes writing tests (either automated ones,
> > which can be included in our test suite, or one-off's for testing
> > specific things but which don't make sense to include).
> 
> And what if the tests are buggy? New test suites should go through a
> CF first I think for proper review. And this is clearly one.

They still won't result in data loss, corruption, or other direct impact
on our users, even if they are buggy.  They also won't destablize the
code base or introduce new bugs into the code.  Those are the things we
need to be concerned about when it comes to introducing new features
into the system and why we take a break from doing so before a release.

> > Regarding when we should stop adding new tests, I can't agree with the
> > notion that they should be treated as new features.  New tests may break
> > the buildfarm, but they do not impact the stability of committed code,
> > nor do they introduce new bugs into the code which has been committed
> > (if anything, they expose committed bugs, as the set of tests we're
> > discussing did, which should then be fixed).
> 
> Yes, new tests do not impact the core code, but they could hide
> existing bugs, which is what I think Peter is arguing about here, and
> why it is not a good idea to pushed in a complete new test suite just
> before a beta release. The buildfarm is made so as a run stops once of
> the tests turns red, not after all of them have run.

I disagree that new tests are going to hide existing bugs.  The case
which I believe you're making for that happening is where the buildfarm
is red for an extended period of time, but I've agreed that we don't
want the buildfarm to be red and have said as much, and that's different
from the question about adding new tests and simply means the same thing
it's always meant- anything which makes the buildfarm red should be
addressed in short order.

> > As such, I'd certainly encourage you to propose new tests, even now, but
> > not changes to the PG code, except for bug fixes, or changes agreed to
> > by the RMT.  How you prioritise that with the release preparation work
> > is up to you, of course, though if I were in your shoes, I'd take care
> > of the release prep first, as we're quite close to doing a set of
> > releases.  I'm happy to try and help with that effort myself, though
> > I've not been involved very much in release prep and am not entirely
> > sure what I can do to help.
> 
> Adding new tests that are part of an existing bug fix is fine because
> the failure of such a test would mean that the bug fix is not working
> as intended. Based on my reading of the code this adds test coverage
> that is larger than the basic test for a bug fix.

Yes, which is an all-together good thing, I believe.  We need more and
better test coverage of a *lot* of the system and this is just the start
of adding that coverage for pg_dump.

There is pretty clearly a lack of consensus on the question about adding
new tests.  It seems like it'd be a good topic for the dev meeting, but
I don't think everyone involved in the discussion so far is going to be
there, unfortunately.  I'm open to still proposing it as a topic, though
I dislike having to leave people out of it.

I'm not quite sure what the best way to put this is, but I can't help
but feel the need to comment on it in at least some way:

What *should* we be doing right now?  Reviewing code and testing the
system is what I had thought but perhaps I'm wrong.  To the extent that
we're testing the system, isn't it better to write repeatable tests,
particularly ones which add code coverage, than one-off tests that only
check that the current snapshot of the code is operating correctly?  I
feel like that's what we should be encouraging, but this position is
suggesting that we should be doing that independently and holding all of
those new tests until the next commitfest, which is the same as we're
asked to do for new development at this point.  Perhaps it's not ideal,
but I feel it's necessary to point out that between writing code
coverage tests and doing new development, other things being equal, you
can guess which would be preferred.  I'd rather we encourage more of the
former rather than acting as if they should be treated the same during
this phase of the development cycle.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] force_parallel_mode uniqueness

2016-05-09 Thread Robert Haas
On Sun, May 8, 2016 at 2:23 PM, David G. Johnston
 wrote:
> I was thinking more along the lines that it should be called:
>
> parallel_mode (enum)
>
> It would default to "on"
>
> "off" would turn it off (instead of having to set parallel_degree to 0)
>
> And it would have additional enum values for:
>
> "always" - basically what on means in the current setup
> "regress" - the same as the current regress.

So, right now, most people can totally ignore force_parallel_mode.
Under your proposal, parallel query could be disabled either by
setting parallel_mode=off or by setting max_parallel_degree=0 (or 1,
after we do that renumbering).  That does not seem like a usability
improvement.

>> We could put max_parallel_degree under "other planner options" rather
>> than "asynchronous behavior".  However, I wonder what behavior people
>> will want for parallel operations that are not queries.  For example,
>> suppose we have parallel CREATE INDEX.  Should the number of workers
>> for that operation also be controlled by max_parallel_degree?  If yes,
>> then this shouldn't be a query planner option, because CREATE INDEX is
>> not a query.
>
> Like I said, it isn't clear-cut.  But at the moment it is just for queries -
> it could be moved if it gets dual purposed as you describe.

That's true.  But it could also be left where it is, and then we
wouldn't have to move it back.  I believe that at least some parallel
utility commands are going to arrive in 9.7 - for example, I think
Peter Geoghegan (whom my wife accurately dubbed the Sultan of Sort) is
interested in parallel CREATE INDEX and parallel CLUSTER.  Now I don't
know yet whether max_parallel_degree will affect those things or not,
and if it works out that we never use max_parallel_degree for anything
other than queries, then maybe I'll regret putting it where I did.
But I don't think it makes much sense to move it at this point.  It
isn't a clear improvement, and we've got plenty of things to tinker
with that are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A population of population counts

2016-05-09 Thread Robert Haas
On Sun, May 8, 2016 at 7:46 PM, Thomas Munro
 wrote:
> My aim with this thread was mainly reducing code duplication and
> needless code: perhaps at least the other ideas in the attached
> sketch, namely using ffs instead of the rightmost_one_pos table loop
> and consolidation of popcount into a reusable API (without trying to
> get hardware support) could be worth polishing for the next CF?
> Annoyingly, it seems Windows doesn't have POSIX/SUSv2 ffs, though
> apparently it can reach that instruction with MSVC intrinsic
> _BitScanReverse or MingW __builtin_ffs.

I think my_log2() is the same thing as one of ffs() and fls() - I can
never keep those straight.  It seems like it wouldn't he hard to clean
things up at least that much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2016-05-09 Thread Ashutosh Bapat
Hi Amit,
I am trying multi-column/expression partitions.

create table t1_multi_col (a int, b int) partition by range (a, b);
create table t1_mc_p1 partition of t1_multi_col for values start (1, 200)
end (100, 300);
create table t1_mc_p2 partition of t1_multi_col for values start (200, 1)
end (300, 100);
insert into t1_multi_col values (1, 250);
insert into t1_multi_col values (250, 1);
insert into t1_multi_col values (100, 100);
select tableoid::regclass, * from t1_multi_col;
 tableoid |  a  |  b
--+-+-
 t1_mc_p1 |   1 | 250
 t1_mc_p1 | 100 | 100
 t1_mc_p2 | 250 |   1
The row (100, 100) landed in t1_mc_p1 which has partition bounds as (1,
200) and (100, 300) which should not accept a row with b = 100. It looks
like the binary search got confused with the reversed order of ranges
(should that be allowed?)

Symantec of multiple columns for ranges (may be list as well) looks
confusing. The current scheme doesn't allow overlapping range for one of
the partitioning keys even if the combined range is non-overlapping.
create table t1_multi_col (a int, b int) partition by range (a, b);
create table t1_mc_p1 partition of t1_multi_col for values start (1, 100)
end (100, 200);
create table t1_mc_p2 partition of t1_multi_col for values start (1, 200)
end (100, 300);
ERROR:  new partition's range overlaps with that of partition "t1_mc_p1" of
"t1_multi_col"
HINT:  Please specify a range that does not overlap with any existing
partition's range.
create table t1_mc_p2 partition of t1_multi_col for values start (1, 300)
end (100, 400);
ERROR:  new partition's range overlaps with that of partition "t1_mc_p1" of
"t1_multi_col"
HINT:  Please specify a range that does not overlap with any existing
partition's range.

That should be better realised using subpartitioning on b. The question is,
if one column's value is enough to identify partition (since they can not
contain overlapping values for that column), why do we need mutliple
columns/expressions as partition keys? IIUC, all the other column does is
to disallow certain range of values for that column, which can better be
done by a CHECK constraint. It looks like Oracle looks at combined range
and not just one column.


On Thu, Apr 21, 2016 at 7:35 AM, Amit Langote  wrote:

>
> Hi Ildar,
>
> On 2016/04/21 1:06, Amit Langote wrote:
> > On Wed, Apr 20, 2016 at 11:46 PM, Ildar Musin 
> wrote:
> >> Crash occurs in get_check_expr_from_partbound(). It seems that function
> is
> >> not yet expecting an expression key and designed to handle only simple
> >> attributes keys. Backtrace:
> >
> > You're right, silly mistake. :-(
> >
> > Will fix
>
> Attached updated version fixes this.  I'll take time to send the next
> version but I'd very much appreciate it if you keep reporting anything
> that doesn't look/work right like you did so far.
>
> Thanks,
> Amit
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Stopping logical replication protocol

2016-05-09 Thread Vladimir Gordiychuk
>
> What's your PostgreSQL community username?

gordiychuk

It seems like what you're also trying to allow interruption deeper than
> that, when we're in the middle of processing a reorder buffer commit record
> and streaming that to the client. You're introducing an is_active member
> (actually a callback, though name suggests it's a flag) in struct
> ReorderBuffer to check whether a CopyDone is received, and you're skipping
> ReorderBuffer commit processing when the callback returns false. The
> callback returns "!streamingDoneReceiving && !streamingDoneSending" i.e.
> it's false if either end has sent CopyDone. streamingDoneSending and
> streamingDoneSending are only set in ProcessRepliesIfAny, called by
> WalSndLoop and WalSndWaitForWal. So the idea is, presumably, that if we're
> waiting for WAL from XLogSendLogical we skip processing of any commit
> records and exit.
>
> That seems overcomplicated.
>
> When WalSndWaitForWAL is called
> by logical_read_xlog_page, logical_read_xlog_page can just test
> streamingDoneReceiving and streamingDoneSending. If they're set it can skip
> the page read and return -1, which will cause the xlogreader to return a
> null record to XLogSendLogical. That'll skip the decoding calls and return
> to WalSndLoop, where we'll notice it's time to exit.
>

ProcessRepliesIfAny also now executes in WalSdnWriteData. Because during
send data we should also check message from client(client can send
CopyDone, KeepAlive, Terminate).

@@ -1086,14 +1089,6 @@ WalSndWriteData(LogicalDecodingContext *ctx,
XLogRecPtr lsn, TransactionId xid,
  memcpy(>out->data[1 + sizeof(int64) + sizeof(int64)],
tmpbuf.data, sizeof(int64));

- /* fast path */
- /* Try to flush pending output to the client */
- if (pq_flush_if_writable() != 0)
- WalSndShutdown();
-
- if (!pq_is_send_pending())
- return;
-


The main idea is that we can get CopyDone from client in the next
functions: WalSdnLoop, WalSndWaitForWal, WalSndWriteData. All of this
methods can take a long time, because WalSndWaitForWal can wait new
transaction and on not active db it can take long enough, WalSndWriteData
can send big transaction that also lead to ignore messages from client
until long time(In my example above for 1 million object changes, walsender
ignore messages 13 seconds and not allow reuse connection). When client
send CopyDone they don't want receive message anymore for current LSN. For
example physical replication can be interrupt in the middle of transaction
that affect more than one LSN.

Maybe I not correct undestand documentation, but I want reuse same
connection without reopen it, because open new connection takes too long.
Is it correct use case or CopyDOne it side effect of copy protocol and for
complete replication need use always Terminate package and reopen
connection?


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618

2016-05-09 Thread Michael Paquier
On Tue, Mar 22, 2016 at 1:56 PM, Amit Kapila  wrote:
> So as far as I can see there are two ways to resolve this issue, one is to
> retry generation of dsm name if CreateFileMapping returns EACCES and second
> is to append data_dir name to dsm name as the same is done for main shared
> memory, that will avoid the error to occur.  First approach has minor flaw
> that if CreateFileMapping returns EACCES due to reason other then duplicate
> dsm name which I am not sure is possible to identify, then we should report
> error instead try to regenerate the name
>
> Robert and or others, can you share your opinion on what is the best way to
> proceed for this issue.

For my 2c here, the approach using GetSharedMemName to identify the
origin of a dynamic shared memory segment looks more solid in terms of
robustness and collision handling. Retrying a segment is never going
to be completely water-proof.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_shmem_allocations view

2016-05-09 Thread Michael Paquier
On Fri, May 6, 2016 at 8:01 AM, Andres Freund  wrote:
> Here's a rebased version.  I remember why I didn't call the column
> "offset" (as Michael complained about upthread), it's a keyword...

Oh, an old patch resurrected from the dead... Reading again the patch

+* Increase number of initilized slots if we didn't reuse a previously
+* used one.
s/initilized/initialized

+   Number of backends attached to this segment (1 signals a moribund
+   entry, 2 signals one user, ...).
moribund? Or target for removal.

REVOKE ALL .. FROM PUBLIC;
REVOKE EXECUTE .. ON FUNCTION FROM PUBLIC;
Revoking he execution of those views and their underlying functions
would be a good thing I think, this can give hints on the system
activity, particularly for DSM segments.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers