Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Josh,


So I think that you're confusing the roles of bgwriter vs. spread
checkpoint.  What you're experiencing above is pretty common for
nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
no matter how fast the disks are), or for attempts to do spread
checkpoint on filesystems which don't support it (e.g. Ext3, HFS+).  In
either case, what's happening is that the *OS* is freezing all logical
and physical IO while it works to write out all of RAM, which makes me
suspect you're using Ext3 or HFS+.


I'm using ext4 on debian wheezy with postgresqk 9.4b2.

I agree that the OS may be able to help, but this aspect does not 
currently work for me at all out of the box. The all of RAM is really a 
few thousands 8 kB pages written randomly, a few dozen MB.


Also, if pg needs advanced OS tweaking to handle a small load, ISTM that 
it fails at simplicity:-(


As for checkpoint spreading, raising checkpoint_completion_target to 0.9 
degrades the situation (20% of transactions are more than 200 ms late 
instead of 10%, bgwriter wrote less that 1 page per second, on on 500s 
run). So maybe there is a bug here somewhere.



Making the bgwriter more aggressive adds a significant risk of writing
the same pages multiple times between checkpoints, so it's not a simple fix.


Hmmm... This must be balanced with the risk of being offline. Not all 
people are interested in throughput at the price of latency, so there 
could be settings that help latency, even at the price of reducing 
throughput (average tps). After that, it is the administrator choice to 
set pg for higher throughput or lower latency.


Note that writing some least recently used page multiple times does not 
seems to be any issue at all for me under small/medium load, especially as 
the system has nothing else to do: if you have nothing else to do, there 
is no cost in writing a page, even if you may have to write it again some 
time later, and it helps prevent dirty pages accumulation. So it seems to 
me that pg can help, it is not only/merely an OS issue.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Escaping from blocked send() reprised.

2014-08-26 Thread Kyotaro HORIGUCHI
Sorry, I was absorbed by other tasks..

Thank you for reviewing thiis.

 On 07/01/2014 06:26 AM, Kyotaro HORIGUCHI wrote:
  At Mon, 30 Jun 2014 11:27:47 -0400, Robert Haas
  robertmh...@gmail.com wrote in
  CA+TgmoZfcGzAEmtbyoCe6VdHnq085x+ox752zuJ2AKN=wc8...@mail.gmail.com
  1. I think it's the case that there are platforms around where a
  signal won't cause send() to return EINTR and I'd be entirely
  unsurprised if SSL_write() doesn't necessarily return EINTR in that
  case.  I'm not sure what, if anything, we can do about that.
 
 We use a custom write routine with SSL_write, where we call send()
 ourselves, so that's not a problem as long as we put the check in the
 right place (in secure_raw_write(), after my recent SSL refactoring -
 the patch needs to be rebased).
 
  man 2 send on FreeBSD has not description about EINTR.. And even
  on linux, send won't return EINTR for most cases, at least I
  haven't seen that. So send()=-1,EINTR seems to me as only an
  equivalent of send() = 0. I have no idea about what the
  implementer thought the difference is.
 
 As the patch stands, there's a race condition: if the SIGTERM arrives
 *before* the send() call, the send() won't return EINTR anyway. So
 there's a chance that you still block. Calling pq_terminate_backend()
 again will dislodge it (assuming send() returns with EINTR on signal),

Yes, that window would'nt be extinguished without introducing
something more. EINTR is set only when nothing sent by the
call. So AFAIS the chance of getting EINTR is far small than
expectation.

 but I don't think we want to define the behavior as usually,
 pq_terminate_backend() will kill a backend that's blocked on sending
 to the client, but sometimes you have to call it twice (or more!) to
 really kill it.

I agree that it is desirable behavior, if any measure to avoid
that. But I think it's better than doing kill -9 engulfing all
innocent backends.

 A more robust way is to set ImmediateInterruptOK before calling
 send(). That wouldn't let you send data that can be sent without
 blocking though. For that, you could put the socket to non-blocking
 mode, and sleep with select(), also waiting for the process' latch at
 the same time (die() sets the latch, so that will wake up the select()
 if a termination request arrives).

I condiered it but select() frequently (rather in most cases when
send() blocks by send buffer exhaustion) fails to predict that
following send() will be blocked. (If my memory is correct.)  So
the final problem would be blocked send()...

 Is it actually safe to process the die-interrupt where send() is
 called? ProcessInterrupts() does ereport(FATAL, ...), which will
 attempt to send a message to the client. If that happens in the middle
 of constructing some other message, that will violate the protocol.

So I strongly agree to you if select() works as the impression
when reading the man document.

  2. I think it would be reasonable to try to kill off the connection
  without notifying the client if we're unable to send the data to the
  client in a reasonable period of time.  But I'm unsure what a
  reasonable period of time means.  This patch would basically do it
  after no delay at all, which seems like it might be too aggressive.
  However, I'm not sure.
 
  I think there's no such a reasonable time.
 
 I agree it's pretty hard to define any reasonable timeout here. I
 think it would be fine to just cut the connection; even if you don't
 block while sending, you'll probably reach a CHECK_FOR_INTERRUPT()
 somewhere higher in the stack and kill the connection almost as
 abruptly anyway. (you can't violate the protocol, however)

Yes, closing the blocked connection seems one of the most smarter
way, checking the occurred interrupt could avoid protocol
violation. But the problem for that is that there seems no means
to close sockets elsewhere the blocking handle. dup(2)'ed handle
cannot release the resource by only itself.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Jeff Janes
On Monday, August 25, 2014, Fabien COELHO coe...@cri.ensmp.fr wrote:


 I have not found any mean to force bgwriter to send writes when it can.
 (Well, I have: create a process which sends CHECKPOINT every 0.2
 seconds... it works more or less, but this is not my point:-)


There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes.  It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.

Cheers,

Jeff


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


[oops, wrong from, resent...]

Hello Jeff,


The culprit I found is bgwriter, which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, pg_stat_bgwriter says:


Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?


Yes, I did.


You don't want your steady state stats polluted by the bulk load.


Sure!


  buffers_checkpoint = 19046
  buffers_clean = 2995


Out of curiosity, what does buffers_backend show?


buffers_backend = 157


In any event, this almost certainly is a red herring.


Possibly. It is pretty easy to reproduce, though.

Whichever of the three ways are being used to write out the buffers, it 
is the checkpointer that is responsible for fsyncing them, and that is 
where your drama is almost certainly occurring. Writing out with one 
path rather than a different isn't going to change things, unless you 
change the fsync.


Well, I agree partially. ISTM that the OS does not need to wait for fsync 
to start writing pages if it is receiving one minute of buffer writes at 
50 writes per second, I would have thought that some scheduler should 
start handling the flow before fsync... So I thought that if bgwriter was 
to write the buffers is would help, but maybe there is a better way.



Also, are you familiar with checkpoint_completion_target, and what is it
set to?


The default 0.5. Moving to 0.9 seems to worsen the situation.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Escaping from blocked send() reprised.

2014-08-26 Thread Heikki Linnakangas

On 08/26/2014 09:17 AM, Kyotaro HORIGUCHI wrote:

but I don't think we want to define the behavior as usually,
pq_terminate_backend() will kill a backend that's blocked on sending
to the client, but sometimes you have to call it twice (or more!) to
really kill it.


I agree that it is desirable behavior, if any measure to avoid
that. But I think it's better than doing kill -9 engulfing all
innocent backends.


A more robust way is to set ImmediateInterruptOK before calling
send(). That wouldn't let you send data that can be sent without
blocking though. For that, you could put the socket to non-blocking
mode, and sleep with select(), also waiting for the process' latch at
the same time (die() sets the latch, so that will wake up the select()
if a termination request arrives).


I condiered it but select() frequently (rather in most cases when
send() blocks by send buffer exhaustion) fails to predict that
following send() will be blocked. (If my memory is correct.)  So
the final problem would be blocked send()...


My point was to put the socket in non-blocking mode, so that send() will 
return immediately with EAGAIN instead of blocking, if the send buffer 
is full. See WalSndWriteData for how that would work, it does something 
similar.



Is it actually safe to process the die-interrupt where send() is
called? ProcessInterrupts() does ereport(FATAL, ...), which will
attempt to send a message to the client. If that happens in the middle
of constructing some other message, that will violate the protocol.


So I strongly agree to you if select() works as the impression
when reading the man document.


Not sure what you mean, but the above is a fatal problem with the patch 
right now, regardless of how you do the sleeping.



2. I think it would be reasonable to try to kill off the connection
without notifying the client if we're unable to send the data to the
client in a reasonable period of time.  But I'm unsure what a
reasonable period of time means.  This patch would basically do it
after no delay at all, which seems like it might be too aggressive.
However, I'm not sure.


I think there's no such a reasonable time.


I agree it's pretty hard to define any reasonable timeout here. I
think it would be fine to just cut the connection; even if you don't
block while sending, you'll probably reach a CHECK_FOR_INTERRUPT()
somewhere higher in the stack and kill the connection almost as
abruptly anyway. (you can't violate the protocol, however)


Yes, closing the blocked connection seems one of the most smarter
way, checking the occurred interrupt could avoid protocol
violation. But the problem for that is that there seems no means
to close sockets elsewhere the blocking handle. dup(2)'ed handle
cannot release the resource by only itself.


I didn't understand that, surely you can just close() the socket? There 
is no dup(2) involved. And we don't necessarily need to close the 
socket, we just need to avoid writing to it when we're already in the 
middle of sending a message.


I'm marking this as Waiting on Author in the commitfest app, because:
1. the protocol violation needs to be avoided one way or another, and
2. the behavior needs to be consistent so that a single 
pg_terminate_backend() is enough to always kill the connection.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Verbose output of pg_dump not show schema name

2014-08-26 Thread Michael Paquier
On Tue, Aug 26, 2014 at 3:48 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 AFAICS, the namespace can never be NULL in any of these. There is a
 selectSourceSchema(fout, tbinfo-dobj.namespace-dobj.name) call before or
 after printing the message, so if tbinfo-dobj.namespace is NULL, you'll
 crash anyway. Please double-check, and remove the dead code if you agree.
Ah right, this field is used in many places. Even for
pg_backup_archiver.c, the portion of code processing data always has
the namespace set. I am sure that Fabrizio would have done that
quickly, but as I was on this thread I simplified the patch as
attached.
Regards,
-- 
Michael
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 3aebac8..7c0616d 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -546,8 +546,13 @@ RestoreArchive(Archive *AHX)
 		/* Both schema and data objects might now have ownership/ACLs */
 		if ((te-reqs  (REQ_SCHEMA | REQ_DATA)) != 0)
 		{
-			ahlog(AH, 1, setting owner and privileges for %s %s\n,
-  te-desc, te-tag);
+			/* Show namespace if available */
+			if (te-namespace)
+ahlog(AH, 1, setting owner and privileges for %s \%s\.\%s\\n,
+	  te-desc, te-namespace, te-tag);
+			else
+ahlog(AH, 1, setting owner and privileges for %s \%s\\n,
+	  te-desc, te-tag);
 			_printTocEntry(AH, te, ropt, false, true);
 		}
 	}
@@ -621,7 +626,13 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 
 	if ((reqs  REQ_SCHEMA) != 0)		/* We want the schema */
 	{
-		ahlog(AH, 1, creating %s %s\n, te-desc, te-tag);
+		/* Show namespace if available */
+		if (te-namespace)
+			ahlog(AH, 1, creating %s \%s\.\%s\\n,
+  te-desc, te-namespace, te-tag);
+		else
+			ahlog(AH, 1, creating %s \%s\\n, te-desc, te-tag);
+
 
 		_printTocEntry(AH, te, ropt, false, false);
 		defnDumped = true;
@@ -713,8 +724,9 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 	_becomeOwner(AH, te);
 	_selectOutputSchema(AH, te-namespace);
 
-	ahlog(AH, 1, processing data for table \%s\\n,
-		  te-tag);
+	/* Show namespace if available */
+	ahlog(AH, 1, processing data for table \%s\.\%s\\n,
+		  te-namespace, te-tag);
 
 	/*
 	 * In parallel restore, if we created the table earlier in
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5c0f95f..c084ee9 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -1400,7 +1400,8 @@ dumpTableData_copy(Archive *fout, void *dcontext)
 	const char *column_list;
 
 	if (g_verbose)
-		write_msg(NULL, dumping contents of table %s\n, classname);
+		write_msg(NULL, dumping contents of table \%s\.\%s\\n,
+  tbinfo-dobj.namespace-dobj.name, classname);
 
 	/*
 	 * Make sure we are in proper schema.  We will qualify the table name
@@ -5019,7 +5020,8 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 			continue;
 
 		if (g_verbose)
-			write_msg(NULL, reading indexes for table \%s\\n,
+			write_msg(NULL, reading indexes for table \%s\.\%s\\n,
+	  tbinfo-dobj.namespace-dobj.name,
 	  tbinfo-dobj.name);
 
 		/* Make sure we are in proper schema so indexdef is right */
@@ -5385,7 +5387,8 @@ getConstraints(Archive *fout, TableInfo tblinfo[], int numTables)
 			continue;
 
 		if (g_verbose)
-			write_msg(NULL, reading foreign key constraints for table \%s\\n,
+			write_msg(NULL, reading foreign key constraints for table \%s\.\%s\\n,
+	  tbinfo-dobj.namespace-dobj.name,
 	  tbinfo-dobj.name);
 
 		/*
@@ -5723,7 +5726,8 @@ getTriggers(Archive *fout, TableInfo tblinfo[], int numTables)
 			continue;
 
 		if (g_verbose)
-			write_msg(NULL, reading triggers for table \%s\\n,
+			write_msg(NULL, reading triggers for table \%s\.\%s\\n,
+	  tbinfo-dobj.namespace-dobj.name,
 	  tbinfo-dobj.name);
 
 		/*
@@ -6336,8 +6340,9 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 		 * the output of an indexscan on pg_attribute_relid_attnum_index.
 		 */
 		if (g_verbose)
-			write_msg(NULL, finding the columns and types of table \%s\\n,
-	  tbinfo-dobj.name);
+			write_msg(NULL, finding the columns and types of table \%s\.\%s\\n,
+		  tbinfo-dobj.namespace-dobj.name,
+		  tbinfo-dobj.name);
 
 		resetPQExpBuffer(q);
 
@@ -6548,8 +6553,9 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 			int			numDefaults;
 
 			if (g_verbose)
-write_msg(NULL, finding default expressions of table \%s\\n,
-		  tbinfo-dobj.name);
+write_msg(NULL, finding default expressions of table \%s\.\%s\\n,
+			  tbinfo-dobj.namespace-dobj.name,
+			  tbinfo-dobj.name);
 
 			resetPQExpBuffer(q);
 			if (fout-remoteVersion = 70300)
@@ -6672,7 +6678,8 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 			int			numConstrs;
 
 			if (g_verbose)
-write_msg(NULL, finding check constraints for table \%s\\n,
+write_msg(NULL, finding check constraints for table 

Re: [HACKERS] Escaping from blocked send() reprised.

2014-08-26 Thread Kyotaro HORIGUCHI
Hello,

  I condiered it but select() frequently (rather in most cases when
  send() blocks by send buffer exhaustion) fails to predict that
  following send() will be blocked. (If my memory is correct.)  So
  the final problem would be blocked send()...
 
 My point was to put the socket in non-blocking mode, so that send()
 will return immediately with EAGAIN instead of blocking, if the send
 buffer is full. See WalSndWriteData for how that would work, it does
 something similar.

I confused it with what I did during writing this patch. select()
- blocking send(). Sorry for confusing the discussion. I
understand correctly what you mean and It sounds reasonable.

  I agree it's pretty hard to define any reasonable timeout here. I
  think it would be fine to just cut the connection; even if you don't
  block while sending, you'll probably reach a CHECK_FOR_INTERRUPT()
  somewhere higher in the stack and kill the connection almost as
  abruptly anyway. (you can't violate the protocol, however)
 
  Yes, closing the blocked connection seems one of the most smarter
  way, checking the occurred interrupt could avoid protocol
  violation. But the problem for that is that there seems no means
  to close sockets elsewhere the blocking handle. dup(2)'ed handle
  cannot release the resource by only itself.
 
 I didn't understand that, surely you can just close() the socket?
 There is no dup(2) involved. And we don't necessarily need to close
 the socket, we just need to avoid writing to it when we're already in
 the middle of sending a message.

My assumption there was select() and *blocking* send(). So the
sentence cited is out of point from the first.

 I'm marking this as Waiting on Author in the commitfest app, because:
 1. the protocol violation needs to be avoided one way or another, and
 2. the behavior needs to be consistent so that a single
 pg_terminate_backend() is enough to always kill the connection.

Thank you for the suggestion. I think I can go forward with that
and will come up with new patch.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Commitfest status

2014-08-26 Thread Heikki Linnakangas

The first week of the commitfest is now behind us.

There are still 15 patches in Needs Review state, with no reviewer 
assigned. Please pick a patch and review!


There are 20 patches in Needs Review state, with a reviewer assigned. 
If you have signed up as the reviewer, please proceed with the review.


If you have submitted a patch for this commitfest, please check the 
status of your patch. If it is in Waiting for author state, you are 
expected to submit a new version of the patch, addressing any review 
comments you have received. If you don't have the time to update your 
patch in the next few days, please mark your patch as Returned with 
Feedback and resubmit for the next commitfest. If your patch is in 
Waiting on Author state, but you don't know what you should do to it, 
ask for clarification.


Committers: Please pick a patch that's been marked for Ready for 
Committer, verify that it has been adequately reviewed, and proceed to 
commit or bounce it back.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Amit,


I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply.


Yep. Actually it is not strictly 0, but a few tps that I rounded to 0.

  progress: 63.0 s, 47.0 tps, lat 2.810 ms stddev 5.194, lag 0.354 ms
  progress: 64.1 s, 11.9 tps, lat 81.481 ms stddev 218.026, lag 74.172 ms
  progress: 65.2 s, 1.9 tps, lat 950.455 ms stddev 125.602, lag 1421.203 ms
  progress: 66.1 s, 4.5 tps, lat 604.790 ms stddev 440.073, lag 2418.128 ms
  progress: 67.2 s, 6.0 tps, lat 322.440 ms stddev 68.276, lag 3146.302 ms
  progress: 68.0 s, 2.4 tps, lat 759.509 ms stddev 62.141, lag 4229.239 ms
  progress: 69.4 s, 3.6 tps, lat 440.335 ms stddev 369.207, lag 4842.371 ms

Transactions are 4.8 seconds behind schedule at this point.

One reason could be that backends might need to write more WAL due 
Full_Page_Writes, another could be contention around buffer 
content_lock. To dig more about the reason, the same tests can be tried 
by making Full_Page_Writes = off and/or synchronous_commit = off to see 
if WAL writes are causing tps to go down.


Given the small flow of updates, I do not think that there should be 
reason to get that big a write contention between WAL  checkpoint.


If tried with full_page_write = off for 500 seconds: same overall 
behavior, 8.5% of transactions are stuck (instead of 10%). However in 
details pg_stat_bgwriter is quite different:


  buffers_checkpoint = 13906
  buffers_clean = 20748
  buffers_backend = 472

That seems to suggest that bgwriter did some stuff for once, but that did 
not change much the result in the end. This would imply that my suggestion 
to make bgwriter write more would not fix the problem alone.


With synchronous_commit = off, the situation is much improved, with only 
0.3% of transactions stuck. Not a surprise. However, I would not recommand 
that as a solution:-)


Currently, the only way I was able to solve the issue while still 
writing to disk is to send CHECKPOINT every 0.2s, as if I had set 
checkpoint_timeout = 0.2s (although this is not currently allowed).



Similarly for checkpoints, use  checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.


I had already tried, and retried after Jeff suggestion. This does not seem 
to mitigate anything, on the contrary.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog and replication slots

2014-08-26 Thread Michael Paquier
On Tue, Aug 19, 2014 at 2:49 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Aug 18, 2014 at 4:01 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Mon, Aug 18, 2014 at 3:48 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Aug 18, 2014 at 2:38 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 And now looking at your patch there is additional refactoring possible
 with IDENTIFY_SYSTEM and pg_basebackup as well...
 And attached is a rebased patch doing so.
I reworked this patch a bit to take into account the fact that the
number of columns to check after running IDENTIFY_SYSTEM is not always
4 as pointed out by Andres:
- pg_basebackup = 3
- pg_receivexlog = 3
- pg_recvlogical = 4

Regards,
-- 
Michael
From c54c987f34fbdc7d7cdf59af114a247029be532e Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 18 Aug 2014 14:32:44 +0900
Subject: [PATCH] Add support for physical slot creation/deletion in
 pg_receivexlog

Physical slot creation can be done with --create and drop with --drop.
In both cases --slot is needed. Code for replication slot creation and
drop is refactored with what was available in pg_recvlogical, the same
applied with IDENTIFY_SYSTEM to simplify the whole set.
---
 doc/src/sgml/ref/pg_receivexlog.sgml   |  29 +++
 src/bin/pg_basebackup/pg_basebackup.c  |  21 +
 src/bin/pg_basebackup/pg_receivexlog.c | 108 +++-
 src/bin/pg_basebackup/pg_recvlogical.c | 119 --
 src/bin/pg_basebackup/streamutil.c | 150 +
 src/bin/pg_basebackup/streamutil.h |  10 +++
 6 files changed, 279 insertions(+), 158 deletions(-)

diff --git a/doc/src/sgml/ref/pg_receivexlog.sgml b/doc/src/sgml/ref/pg_receivexlog.sgml
index 5916b8f..7e46005 100644
--- a/doc/src/sgml/ref/pg_receivexlog.sgml
+++ b/doc/src/sgml/ref/pg_receivexlog.sgml
@@ -72,6 +72,35 @@ PostgreSQL documentation
   titleOptions/title
 
para
+applicationpg_receivexlog/application can run in one of two following
+modes, which control physical replication slot:
+
+variablelist
+
+ varlistentry
+  termoption--create/option/term
+  listitem
+   para
+Create a new physical replication slot with the name specified in
+option--slot/option, then exit.
+   /para
+  /listitem
+ /varlistentry
+
+ varlistentry
+  termoption--drop/option/term
+  listitem
+   para
+Drop the replication slot with the name specified
+in option--slot/option, then exit.
+   /para
+  /listitem
+ /varlistentry
+/variablelist
+
+   /para
+
+   para
 The following command-line options control the location and format of the
 output.
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 8b9acea..c5d9ec2 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1569,8 +1569,8 @@ BaseBackup(void)
 {
 	PGresult   *res;
 	char	   *sysidentifier;
-	uint32		latesttli;
-	uint32		starttli;
+	TimeLineID	latesttli;
+	TimeLineID	starttli;
 	char	   *basebkp;
 	char		escaped_label[MAXPGPATH];
 	char	   *maxrate_clause = NULL;
@@ -1624,23 +1624,8 @@ BaseBackup(void)
 	/*
 	 * Run IDENTIFY_SYSTEM so we can get the timeline
 	 */
-	res = PQexec(conn, IDENTIFY_SYSTEM);
-	if (PQresultStatus(res) != PGRES_TUPLES_OK)
-	{
-		fprintf(stderr, _(%s: could not send replication command \%s\: %s),
-progname, IDENTIFY_SYSTEM, PQerrorMessage(conn));
+	if (!RunIdentifySystem(conn, sysidentifier, latesttli, NULL, 3))
 		disconnect_and_exit(1);
-	}
-	if (PQntuples(res) != 1 || PQnfields(res)  3)
-	{
-		fprintf(stderr,
-_(%s: could not identify system: got %d rows and %d fields, expected %d rows and %d or more fields\n),
-progname, PQntuples(res), PQnfields(res), 1, 3);
-		disconnect_and_exit(1);
-	}
-	sysidentifier = pg_strdup(PQgetvalue(res, 0, 0));
-	latesttli = atoi(PQgetvalue(res, 0, 1));
-	PQclear(res);
 
 	/*
 	 * Start the actual backup
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index a8b9ad3..a191c29 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -38,6 +38,8 @@ static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static int	fsync_interval = 0; /* 0 = default */
 static volatile bool time_to_abort = false;
+static bool do_create_slot = false;
+static bool do_drop_slot = false;
 
 
 static void usage(void);
@@ -78,6 +80,9 @@ usage(void)
 	printf(_(  -w, --no-password  never prompt for password\n));
 	printf(_(  -W, --password force password prompt (should happen automatically)\n));
 	printf(_(  -S, --slot=SLOTNAMEreplication slot to use\n));
+	printf(_(\nOptional actions:\n));
+	printf(_(  --create   create a new replication slot (for the slot's name see --slot)\n));
+	printf(_(  --drop 

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello again,


I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends CHECKPOINT every 0.2
seconds... it works more or less, but this is not my point:-)


There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes.  It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.


I recompiled with the variable coldly set to 1000 instead of 12. The 
situation is slightly degraded (15% of transactions were above 200 ms 
late). However it seems that bgwriter did not write much more pages:


  buffers_checkpoint = 26065
  buffers_clean = 5263
  buffers_backend = 367

Or I may have a problem interpreting pg_stat_bgwriter.

It seems that changing this value is not enough to persuade bgwriter to 
write more pages. Or I may have done something wrong, but I do not know 
what.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Concurrently option for reindexdb

2014-08-26 Thread Andres Freund
On 2014-08-26 12:44:43 +0900, Michael Paquier wrote:
 On Tue, Aug 26, 2014 at 12:28 PM, Fujii Masao masao.fu...@gmail.com wrote:
  +many. Although I'm not sure if we managed to find a safe relation swap.
 
 Well we didn't AFAIK. With the latest patch provided I could not
 really find any whole in the logic, and Andres felt that something may
 be wrong miles away. If I'd revisit the patch now with a rebased
 version maybe I may find smth...

I don't think it was miles away, but I'll look into the rebased version.

  That safe relation swap is possible if an AccessExclusive lock is taken. 
  Right?
  That means that REINDEX CONCURRENTLY is not completely-concurrently, but
  I think that many users are satisfied with even this feature.
 
 This would block as well isolation tests on this feature, something
 not that welcome for a feature calling itself concurrently,

Right. But it's much better than what we have now. Possibly we can
rename the feature... :/

 but it
 would deadly simplify the patch and reduce deadlock occurrences if
 done right with the exclusive locks (no need to check for past
 snapshots necessary when using ShareUpdateExclusiveLock?).

I'm not sure if you really can get rid of the waiting for past snapshots
without making the feature much more heavyweight htan necessary.

 Reading this thread, the consensus would be to use an exclusive lock
 for swap and be done. Well if there are enough votes for this approach
 I wouldn't mind resending an updated patch for the next CF.

I always was of the opinion that a exclusive lock is still *MUCH* better
than what we have today.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 08:12:48 +0200, Fabien COELHO wrote:
 As for checkpoint spreading, raising checkpoint_completion_target to 0.9
 degrades the situation (20% of transactions are more than 200 ms late
 instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run).
 So maybe there is a bug here somewhere.

What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?

Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by time and not by xlog.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-08-26 Thread Erik Rijkers
On Mon, August 25, 2014 07:21, Andrew Gierth wrote:
 Here is the new version of our grouping sets patch. This version
 supersedes the previous post.

The patches did not apply anymore so I applied at 73eba19aebe0.  There they 
applied OK, and make  make check was OK.


drop   table if exists items_sold;
create table   items_sold as
select * from (
  values
   ('Foo', 'L', 10)
 , ('Foo', 'M', 20)
 , ('Bar', 'M', 15)
 , ('Bar', 'L',  5)
) as f(brand, size, sales) ;

select brand, size, grouping(brand, size), sum(sales) from items_sold group by 
rollup(brand, size);
--  WARNING:  unrecognized node type: 347


I suppose that's not correct.


thanks,

Erik Rijkers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Andres,


checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).


That's ext4 I guess?


Yes!


Did you check whether xfs yields a, err, more predictable performance?


No. I cannot test that easily without reinstalling the box. I did some 
quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in 
the very same conditions. Maybe I could try again.


[...] Note that it would *not* be a good idea to make the bgwriter write 
out everything, as much as possible - that'd turn sequential write io 
into random write io.


Hmmm. I'm not sure it would be necessary the case, it depends on how 
bgwriter would choose the pages to write? If they are chosen randomly then 
indeed that could be bad. If there is a big sequential write, should not 
the backend do the write directly anyway? ISTM that currently checkpoint 
is mostly random writes anyway, at least with the OLTP write load of 
pgbench. I'm just trying to be able to start them ealier so that they can 
be completed quickly.


So although bgwriter is not the solution, ISTM that pg has no reason to 
wait for minutes before starting to write dirty pages, if it has nothing 
else to do. If the OS does some retention later and cannot spread the 
load, as Josh suggest, this could also be a problem, but currently the OS 
seems not to have much to write (but WAL) till the checkpoint.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote:
 Did you check whether xfs yields a, err, more predictable performance?
 
 No. I cannot test that easily without reinstalling the box. I did some quick
 tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very
 same conditions. Maybe I could try again.

After Robert and I went to LSF/MM this spring I sent a test program for
precisely this problem and while it could *crash* machines when using
ext4, xfs yielded much more predictable performance. There's a problem
with priorization of write vs read IO that's apparently FS dependent.

 [...] Note that it would *not* be a good idea to make the bgwriter write
 out everything, as much as possible - that'd turn sequential write io into
 random write io.
 
 Hmmm. I'm not sure it would be necessary the case, it depends on how
 bgwriter would choose the pages to write? If they are chosen randomly then
 indeed that could be bad.

The essentially have to be random to fulfil it's roal of reducing the
likelihood of a backend having to write out a buffer itself. Consider
how the clock sweep algorithm (not that I am happy with it) works. When
looking for a new victim buffer all backends scan the buffer cache in
one continuous cycle. If they find a buffer with a usagecount==0 they'll
use that one and throw away its contents. Otherwise they reduce
usagecount by 1 and move on. What the bgwriter *tries* to do is to write
out buffers with usagecount==0 that are dirty and will soon be visited
in the clock cycle. To avoid having the backends to do that.

 If there is a big sequential write, should not the
 backend do the write directly anyway? ISTM that currently checkpoint is
 mostly random writes anyway, at least with the OLTP write load of pgbench.
 I'm just trying to be able to start them ealier so that they can be
 completed quickly.

If the IO scheduling worked - which it really doesn't in many cases -
there'd really be no need to make it finish fast. I think you should try
to tune spread checkpoints to have less impact, not make bgwriter do
something it's not written for.

 So although bgwriter is not the solution, ISTM that pg has no reason to wait
 for minutes before starting to write dirty pages, if it has nothing else to
 do.

That precisely *IS* a spread checkpoint.

 If the OS does some retention later and cannot spread the load, as Josh
 suggest, this could also be a problem, but currently the OS seems not to
 have much to write (but WAL) till the checkpoint.

The actual problem is that the writes by the checkpointer - done in the
background - aren't flushed out eagerly enough out of the OS's page
cache. Then, when the final phase of the checkpoint comes, where
relation files need to be fsynced, some filesystems essentially stal
while trying to write out lots and lots of dirty buffers.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-26 Thread Rukh Meski
Hi Fabien,

On Sun, Aug 24, 2014 at 9:16 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Find attached a new version:
  - fix dropped percent computation in the final report
  - simplify progress report code

I have reviewed this patch.

Is the patch in a patch format which has context?
  Yes.
Does it apply cleanly to the current git master?
  Yes.
Does it include reasonable tests, necessary doc patches, etc?
  Yes.

Does the patch actually implement that?
  Yes.
Do we want that?
  I think we do, yes.
Do we already have it?
  No.
Are there dangers?
  None that I can see.

Does the feature work as advertised?
  Almost, see below.
Are there corner cases the author has failed to consider?
  None that I can see.
Are there any assertion failures or crashes?
  No.

I can't make the -L option work at all.  If I do this:
  ./pgbench -R 100 -L 1
I get:
  pgbench: invalid option -- L
Which appears to be caused by the fact that the call to getopt_long()
has not been updated to reflect the new parameter.

Also this part:
+-L, --limit=NUM  under throttling (--rate), skip
transactions that\n
+ far behind schedule in ms
(default: do not skip)\n
I would suggest rewording this to something like skip transactions
that are more than NUM milliseconds behind schedule (default: do not
skip).

Marking Waiting for Author until these small issues have been fixed.


Thanks,

♜


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Heikki Linnakangas

On 08/16/2014 02:19 AM, Tom Lane wrote:

I think the realistic alternatives at this point are either to
switch to all-lengths as in my test patch, or to use the hybrid approach
of Heikki's test patch.  IMO the major attraction of Heikki's patch
is that it'd be upward compatible with existing beta installations,
ie no initdb required (but thus, no opportunity to squeeze in a version
identifier either).  It's not showing up terribly well in the performance
tests I've been doing --- it's about halfway between HEAD and my patch on
that extract-a-key-from-a-PLAIN-stored-column test.  But, just as with my
patch, there are things that could be done to micro-optimize it by
touching a bit more code.

I did some quick stats comparing compressed sizes for the delicio.us
data, printing quartiles as per Josh's lead:

all-lengths {440,569,609,655,1257}
Heikki's patch  {456,582,624,671,1274}
HEAD{493,636,684,744,1485}

(As before, this is pg_column_size of the jsonb within a table whose rows
are wide enough to force tuptoaster.c to try to compress the jsonb;
otherwise many of these values wouldn't get compressed.)  These documents
don't have enough keys to trigger the first_success_by issue, so that
HEAD doesn't look too awful, but still there's about an 11% gain from
switching from offsets to lengths.  Heikki's method captures much of
that but not all.

Personally I'd prefer to go to the all-lengths approach, but a large
part of that comes from a subjective assessment that the hybrid approach
is too messy.  Others might well disagree.


It's not too pretty, no. But it would be nice to not have to make a 
tradeoff between lookup speed and compressibility.


Yet another idea is to store all lengths, but add an additional array of 
offsets to JsonbContainer. The array would contain the offset of, say, 
every 16th element. It would be very small compared to the lengths 
array, but would greatly speed up random access on a large array/object.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO



What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?


They simply are the defaults:

  checkpoint_segments = 3
  checkpoint_timeout = 5min
  wal_buffers = -1

I did some test checkpoint_segments = 1, the problem is just more frequent 
but shorter. I also reduced wal_segsize down to 1MB, which also made it 
even more frequent but much shorter, so the overall result was an 
improvement with 5% to 3% of transactions lost instead of 10-14%, if I 
recall correctly. I have found no solution on this path.



Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by time and not by xlog.


I do 500 seconds tests, so there could be at most 2 timeout triggered 
checkpoints. Given the write load it takes about 2 minutes to fill the 3 
16 MB buffers (8 kb * 50 tps (there is one page modified per transaction) 
* 120 s ~ 48 MB), so checkpoints are triggered by xlog. The maths are 
consistent with logs (not sure which prooves which, though:-):


  LOG:  received SIGHUP, reloading configuration files
  LOG:  parameter log_checkpoints changed to on
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 5713 buffers (34.9%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=51.449 s, sync=4.857 s,
total=56.485 s; sync files=12, longest=2.160 s, average=0.404 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6235 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.500 s, sync=5.102 s,
total=58.670 s; sync files=8, longest=2.689 s, average=0.637 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6250 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.888 s, sync=4.504 s,
total=58.495 s; sync files=8, longest=2.627 s, average=0.563 s
  LOG:  checkpoint starting: xlog
  LOG: checkpoint complete: wrote 6148 buffers (37.5%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.313 s, sync=6.437 s,
total=59.834 s; sync files=8, longest=3.680 s, average=0.804 s
  LOG:  checkpoint starting: xlog
  LOG:  checkpoint complete: wrote 6240 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=149.008 s, sync=5.448 s,
total=154.566 s; sync files=9, longest=3.788 s, average=0.605 s

Note that my current effective solution is to do as if 
checkpoints_timeout = 0.2s: it works fine if I do my own spreading.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Verbose output of pg_dump not show schema name

2014-08-26 Thread Heikki Linnakangas

On 08/26/2014 10:10 AM, Michael Paquier wrote:

On Tue, Aug 26, 2014 at 3:48 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

AFAICS, the namespace can never be NULL in any of these. There is a
selectSourceSchema(fout, tbinfo-dobj.namespace-dobj.name) call before or
after printing the message, so if tbinfo-dobj.namespace is NULL, you'll
crash anyway. Please double-check, and remove the dead code if you agree.

Ah right, this field is used in many places. Even for
pg_backup_archiver.c, the portion of code processing data always has
the namespace set. I am sure that Fabrizio would have done that
quickly, but as I was on this thread I simplified the patch as


Ok thanks, committed.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 10:49:31 +0200, Fabien COELHO wrote:
 
 What are the other settings here? checkpoint_segments,
 checkpoint_timeout, wal_buffers?
 
 They simply are the defaults:
 
   checkpoint_segments = 3
   checkpoint_timeout = 5min
   wal_buffers = -1
 
 I did some test checkpoint_segments = 1, the problem is just more frequent
 but shorter. I also reduced wal_segsize down to 1MB, which also made it even
 more frequent but much shorter, so the overall result was an improvement
 with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly.
 I have found no solution on this path.

Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance you likely will have to make checkpoint_timeout
*longer* and increase checkpoint_segments until *all* checkpoints are
started because of time.

There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.

b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.

c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-08-26 Thread Andrew Gierth
 Erik == Erik Rijkers e...@xs4all.nl writes:

 Erik The patches did not apply anymore so I applied at 73eba19aebe0.
 Erik There they applied OK, and make  make check was OK.

I'll look and rebase if need be.

 -- WARNING:  unrecognized node type: 347

Can't reproduce this - are you sure it's not a mis-build?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote:
 Attached is a new patch. It now keeps the current pg_clog unchanged, but 
 adds a new pg_csnlog besides it. pg_csnlog is more similar to 
 pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and 
 segments older than GlobalXmin can be truncated.

It appears that this patch weakens the idea of hint bits. Even if
HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the
snapshot.

That's fast if the xid is less than snap-xmin, but otherwise it needs
to do a fetch from the csnlog, which is exactly what the hint bits are
designed to avoid. And we can't get around this, because the whole point
of this patch is to remove the xip array from the snapshot.

If the transaction was committed a long time ago, then we could set
PD_ALL_VISIBLE and the VM bit, and a scan wouldn't even look at the hint
bit. If it was committed recently, then it's probably greater than the
recentXmin. I think there's still room for a hint bit to technically be
useful, but it seems quite narrow unless I'm missing something (and a
narrowly-useful hint bit doesn't seem to be useful at all).

I'm not complaining, and I hope this is not a showstopper for this
patch, but I think it's worth discussing.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replicating DROP commands across servers

2014-08-26 Thread Andres Freund
On 2014-06-13 15:50:50 -0400, Alvaro Herrera wrote:
 Here's a patch implementing the proposed idea.  This is used in the
 Bidirectional Replication stuff by Simon/Andres; it works well.

I think there's been some changes to this patch since july, care to
resend a new version?

I think it's appropriate to mark the patch as Waiting for Author
instead of Ready for Committer till then.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-26 Thread Fabien COELHO


Hello Rukh,


I have reviewed this patch.


Thanks!


[...] I get: pgbench: invalid option -- L
Which appears to be caused by the fact that the call to getopt_long()
has not been updated to reflect the new parameter.


Indeed, I only tested/used it with the --limit= syntax.


Also this part:
+-L, --limit=NUM  under throttling (--rate), skip
transactions that\n
+ far behind schedule in ms
(default: do not skip)\n
I would suggest rewording this to something like skip transactions
that are more than NUM milliseconds behind schedule (default: do not
skip).


Done, with milliseconds written as ms to keep it short.


Marking Waiting for Author until these small issues have been fixed.


Please find attached a new version which fixes these two points.

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 2f7d80e..de77817 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -141,6 +141,13 @@ double		sample_rate = 0.0;
 int64		throttle_delay = 0;
 
 /*
+ * When under throttling, execution time slots which are more than
+ * this late (in us) are skipped, and the corresponding transaction
+ * will be counted as somehow aborted.
+ */
+int64		throttle_latency_limit = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -238,6 +245,7 @@ typedef struct
 	int64		throttle_trigger;		/* previous/next throttling (us) */
 	int64		throttle_lag;	/* total transaction lag behind throttling */
 	int64		throttle_lag_max;		/* max transaction lag */
+	int64		throttle_latency_skipped; /* lagging transactions skipped */
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -250,6 +258,7 @@ typedef struct
 	int64		sqlats;
 	int64		throttle_lag;
 	int64		throttle_lag_max;
+	int64		throttle_latency_skipped;
 } TResult;
 
 /*
@@ -367,6 +376,8 @@ usage(void)
 		   -f, --file=FILENAME  read transaction script from FILENAME\n
 		 -j, --jobs=NUM   number of threads (default: 1)\n
 		 -l, --logwrite transaction times to log file\n
+		 -L, --limit=NUM  under throttling (--rate), skip transactions that\n
+		  are more than NUM ms behind schedule (default: do not skip)\n
 		 -M, --protocol=simple|extended|prepared\n
 		  protocol for submitting queries (default: simple)\n
 		 -n, --no-vacuum  do not run VACUUM before tests\n
@@ -1016,6 +1027,24 @@ top:
 
 		thread-throttle_trigger += wait;
 
+		if (throttle_latency_limit)
+		{
+			instr_time	now;
+			int64		now_us;
+			INSTR_TIME_SET_CURRENT(now);
+			now_us = INSTR_TIME_GET_MICROSEC(now);
+			while (thread-throttle_trigger  now_us - throttle_latency_limit)
+			{
+/* if too far behind, this slot is skipped, and we
+ * iterate till the next nearly on time slot.
+ */
+int64 wait = (int64) (throttle_delay *
+	1.00055271703 * -log(getrand(thread, 1, 1) / 1.0));
+thread-throttle_trigger += wait;
+thread-throttle_latency_skipped ++;
+			}
+		}
+
 		st-until = thread-throttle_trigger;
 		st-sleeping = 1;
 		st-throttling = true;
@@ -2351,7 +2380,8 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time, instr_time conn_total_time,
 			 int64 total_latencies, int64 total_sqlats,
-			 int64 throttle_lag, int64 throttle_lag_max)
+			 int64 throttle_lag, int64 throttle_lag_max,
+			 int64 throttle_latency_skipped)
 {
 	double		time_include,
 tps_include,
@@ -2417,6 +2447,10 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 		 */
 		printf(rate limit schedule lag: avg %.3f (max %.3f) ms\n,
 			   0.001 * throttle_lag / normal_xacts, 0.001 * throttle_lag_max);
+		if (throttle_latency_limit)
+			printf(number of skipped transactions:  INT64_FORMAT  (%.3f %%)\n,
+   throttle_latency_skipped,
+   100.0 * throttle_latency_skipped / (throttle_latency_skipped + normal_xacts));
 	}
 
 	printf(tps = %f (including connections establishing)\n, tps_include);
@@ -2505,6 +2539,7 @@ main(int argc, char **argv)
 		{sampling-rate, required_argument, NULL, 4},
 		{aggregate-interval, required_argument, NULL, 5},
 		{rate, required_argument, NULL, 'R'},
+		{limit, required_argument, NULL, 'L'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -2534,6 +2569,7 @@ main(int argc, char **argv)
 	int64		total_sqlats = 0;
 	int64		throttle_lag = 0;
 	int64		throttle_lag_max = 0;
+	int64		throttle_latency_skipped = 0;
 
 	int			i;
 
@@ -2578,7 +2614,7 @@ main(int argc, char **argv)
 	state = (CState *) pg_malloc(sizeof(CState));
 	memset(state, 0, sizeof(CState));
 
-	while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:R:, long_options, optindex)) != -1)
+	while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:R:L:, long_options, optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -2775,6 +2811,18 @@ main(int argc, char **argv)
 	throttle_delay = 

Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-08-26 Thread Andrew Gierth
 Andrew == Andrew Gierth and...@tao11.riddles.org.uk writes:

 Erik == Erik Rijkers e...@xs4all.nl writes:

 Erik The patches did not apply anymore so I applied at 73eba19aebe0.
 Erik There they applied OK, and make  make check was OK.

 Andrew I'll look and rebase if need be.

They apply cleanly for me at 2bde297 whether with git apply or patch,
except for the contrib one (which you don't need unless you want to
run the contrib regression tests without applying the gsp-u patch).

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What in the world is happening with castoroides and protosciurus?

2014-08-26 Thread Dave Page
On Tue, Aug 26, 2014 at 1:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 For the last month or so, these two buildfarm animals (which I believe are
 the same physical machine) have been erratically failing with errors that
 reflect low-order differences in floating-point calculations.

 A recent example is at

 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52

 where the only regression diff is

 *** 
 /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out
Mon Aug 25 11:41:00 2014
 --- 
 /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out
 Mon Aug 25 11:57:26 2014
 ***
 *** 171,179 
   SELECT h.seqno AS i8096, h.random AS f1234_1234
  FROM hash_f8_heap h
  WHERE h.random = '-1234.1234'::float8;
 !  i8096 | f1234_1234
 ! ---+
 !   8906 | -1234.1234
   (1 row)

   UPDATE hash_f8_heap
 --- 171,179 
   SELECT h.seqno AS i8096, h.random AS f1234_1234
  FROM hash_f8_heap h
  WHERE h.random = '-1234.1234'::float8;
 !  i8096 |f1234_1234
 ! ---+---
 !   8906 | -1234.12356777216
   (1 row)

   UPDATE hash_f8_heap

 ... a result that certainly makes no sense.  The results are not
 repeatable, failing in equally odd ways in different tests on different
 runs.  This is happening in all the back branches too, not just HEAD.

 Has there been a system software update on this machine a month or so ago?
 If not, it's hard to think anything except that the floating point
 hardware on this box has developed problems.

There hasn't been a software update, but something happened about two
months ago, and we couldn't get to the bottom of exactly what it was -
essentially, castoroides started failing with C compiler cannot
create executables. It appeared that the compiler was missing from
the path, however the config hadn't changed. Our working theory is
that there was previously a symlink to the compiler in one of the
directories in the path, that somehow got removed. The issue was fixed
by adding the actual compiler location to the path.

However, that would have only affected castoroides, and not
protosciurus which runs under a different environment config. I have
no idea what is causing the current issue - the machine is stable
software-wise, and only has private builds of dependency libraries
update periodically (which are not used for the buildfarm). If I had
to hazard a guess, I'd suggest this is an early symptom of an old
machine which is starting to give up.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO



Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance


I do not seek high performance per se, I seek lower maximum latency.

I think that the current settings and parameters are designed for high 
throughput, but do not allow to control the latency even with a small 
load.


you likely will have to make checkpoint_timeout *longer* and increase 
checkpoint_segments until *all* checkpoints are started because of 
time.


Well, as I want to test a *small* load in a *reasonable* time, so I did 
not enlarge the number of segments, otherwise it would take ages.


If I put a checkpoint_timeout = 1min and checkpoint_completion_target = 
0.9 so that the checkpoints are triggered by the timeout,


  LOG:  checkpoint starting: time
  LOG:  checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
  ...

The result is basically the same (well 18% transactions lost, but the 
result do not seem to be stable one run after the other), only there are 
more checkpoints.


I fail to understand how multiplying both the segments and time would 
solve the latency problem. If I set 30 segments than it takes 20 minutes 
to fill them, and if I put timeout to 15min then I'll have to wait for 15 
minutes to test.



There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.


Why would they be written by the kernel if bgwriter has not sent them??


b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.


Sure, I understand that, but ISTM that this test does not exercise this 
issue, the load is small, the full page writes do not matter much.



c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.


I think that what you suggest would only delay the issue, not solve it.

I'll try to ran a long test.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench throttling latency limit

2014-08-26 Thread Fabien COELHO



Marking Waiting for Author until these small issues have been fixed.


I've put it back to Needs review. Feel free to set it to Ready if it 
is ok for you.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-26 Thread Heikki Linnakangas

Summary of this thread so far:

There was a lot of discussion comparing this with Tomas Vondra's Hash 
Join patch. The conclusion was that while it would be nice to be able to 
dump transition state to disk, for aggregates like array_agg, the patch 
is fine as it is. Dumping transition states would require much more 
work, and this is already useful without it. Moreover, solving the 
array_agg problem later won't require a rewrite; rather, it'll build on 
top of this.


You listed a number of open items in the original post, and these are 
still outstanding:



* costing
* EXPLAIN details for disk usage
* choose number of partitions intelligently
* performance testing


I think this is enough for this commitfest - we have consensus on the 
design. For the next one, please address those open items, and resubmit.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-08-26 Thread Emre Hasegeli
 I agree with you that we can support other join type and anti join later,
 If others don’t have any objection in doing other parts later I will mark as 
 Ready For Committer.

I updated the patch to cover semi and anti joins with eqjoinsel_semi().
I think it is better than returning a constant.  The new version
attached with the new version of the test script.  Can you please
look at it again and mark it as ready for committer if it seems okay
to you?
diff --git a/src/backend/utils/adt/network_selfuncs.c 
b/src/backend/utils/adt/network_selfuncs.c
index d0d806f..eca9e7c 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -1,32 +1,669 @@
 /*-
  *
  * network_selfuncs.c
  *   Functions for selectivity estimation of inet/cidr operators
  *
- * Currently these are just stubs, but we hope to do better soon.
+ * Estimates are based on null fraction, distinct value count, most common
+ * values, and histogram of inet/cidr datatypes.
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  *
  * IDENTIFICATION
  *   src/backend/utils/adt/network_selfuncs.c
  *
  *-
  */
 #include postgres.h
 
+#include math.h
+
+#include access/htup_details.h
+#include catalog/pg_collation.h
+#include catalog/pg_operator.h
+#include catalog/pg_statistic.h
+#include utils/lsyscache.h
 #include utils/inet.h
+#include utils/selfuncs.h
 
 
+/* Default selectivity constant for the inet overlap operator */
+#define DEFAULT_OVERLAP_SEL 0.01
+
+/* Default selectivity constant for the other operators */
+#define DEFAULT_INCLUSION_SEL 0.005
+
+/* Default selectivity for given operator */
+#define DEFAULT_SEL(operator) \
+   ((operator) == OID_INET_OVERLAP_OP ? \
+   DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL)
+
+static Selectivity networkjoinsel_inner(Oid operator,
+VariableStatData *vardata1, 
VariableStatData *vardata2);
+extern double eqjoinsel_semi(Oid operator, VariableStatData *vardata1,
+  VariableStatData *vardata2, RelOptInfo *inner_rel);
+extern RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
+static short int inet_opr_order(Oid operator);
+static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues,
+Datum *constvalue, short int 
opr_order);
+static Selectivity inet_mcv_join_selec(Datum *values1, float4 *numbers1,
+   int nvalues1, Datum *values2, float4 
*numbers2,
+   int nvalues2, Oid operator);
+static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers,
+  int mcv_nvalues, Datum *his_values, int 
his_nvalues,
+  int red_nvalues, short int opr_order,
+  Selectivity *max_selec_pointer);
+static Selectivity inet_his_inclusion_join_selec(Datum *his1_values,
+ int his1_nvalues, Datum *his2_values, 
int his2_nvalues,
+ int red_nvalues, 
short int opr_order);
+static short int inet_inclusion_cmp(inet *left, inet *right,
+  short int opr_order);
+static short int inet_masklen_inclusion_cmp(inet *left, inet *right,
+  short int opr_order);
+static short int inet_his_match_divider(inet *boundary, inet *query,
+  short int opr_order);
+
+/*
+ * Selectivity estimation for the subnet inclusion operators
+ */
 Datum
 networksel(PG_FUNCTION_ARGS)
 {
-   PG_RETURN_FLOAT8(0.001);
+   PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+   Oid operator = PG_GETARG_OID(1);
+   List   *args = (List *) PG_GETARG_POINTER(2);
+   int varRelid = PG_GETARG_INT32(3),
+   his_nvalues;
+   VariableStatData vardata;
+   Node   *other;
+   boolvaronleft;
+   Selectivity selec,
+   max_mcv_selec;
+   Datum   constvalue,
+  *his_values;
+   Form_pg_statistic stats;
+   double  nullfrac;
+   FmgrInfoproc;
+
+   /*
+* If expression is not (variable op something) or (something op
+* variable), then punt and return a default estimate.
+*/
+   if (!get_restriction_variable(root, args, varRelid,
+ vardata, 
other, varonleft))
+   PG_RETURN_FLOAT8(DEFAULT_SEL(operator));
+
+   

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Andres Freund
On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote:
 
 Uh. I'm not surprised you're facing utterly horrible performance with
 this. Did you try using a *large* checkpoints_segments setting? To
 achieve high performance
 
 I do not seek high performance per se, I seek lower maximum latency.

So?

 I think that the current settings and parameters are designed for high
 throughput, but do not allow to control the latency even with a small load.

The way're you're setting them is tuned for 'basically no write
activity'.

 you likely will have to make checkpoint_timeout *longer* and increase
 checkpoint_segments until *all* checkpoints are started because of time.
 
 Well, as I want to test a *small* load in a *reasonable* time, so I did not
 enlarge the number of segments, otherwise it would take ages.

Well, that way you're testing something basically meaningless. That's
not helpful either.

 If I put a checkpoint_timeout = 1min and checkpoint_completion_target =
 0.9 so that the checkpoints are triggered by the timeout,
 
   LOG:  checkpoint starting: time
   LOG:  checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
 file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
 total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
   ...

 The result is basically the same (well 18% transactions lost, but the result
 do not seem to be stable one run after the other), only there are more
 checkpoints.

With these settings you're fsyncing the entire data directy once a
minute. Nearly entirely from the OS's buffer cache, because the OS's
writeback logic didn't have time to kick in.

 I fail to understand how multiplying both the segments and time would solve
 the latency problem. If I set 30 segments than it takes 20 minutes to fill
 them, and if I put timeout to 15min then I'll have to wait for 15 minutes to
 test.

a) The kernel's writeback logic only kicks in with delay. b) The amount
of writes you're doing with short checkpoint intervals is overall
significantly higher than with longer intervals. That obviously has
impact on latency as well as throughput. c) the time it fills for
segments to be filled is mostly irrelevant. The phase that's very likely
causing troubles for you is the fsyncs issued at the end of a
checkpoint.

 There's three reasons:
 a) if checkpoint_timeout + completion_target is large and the checkpoint
 isn't executed prematurely, most of the dirty data has been written out
 by the kernel's background flush processes.
 
 Why would they be written by the kernel if bgwriter has not sent them??

I think you're misunderstanding how spread checkpoints work. When the
checkpointer process starts a spread checkpoint it first writes all
buffers to the kernel in a paced manner. That pace is determined by
checkpoint_completion_target and checkpoint_timeout. Once all buffers
that are old enough to need to be checkpointed written out, the
checkpointer fsync()s all the on disk files. That part is *NOT*
paced. Then it can go on to remove old WAL files.

The latency problem is almost guaranteedly created by the fsync()s
mentioned above. When they're execute the kernel starts flushing out a
lot of dirty buffers at once - creating very deep IO queues which makes
it take long to process synchronous additions (WAL flushes, reads) to
that queue.

 c) If checkpoint's are infrequent enough, the penalty of them causing
 problems, especially if not using ext4, plays less of a role overall.
 
 I think that what you suggest would only delay the issue, not solve it.

The amount of dirty data that needs to be flushed is essentially
bounded. If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Compute attr_needed for child relations (was Re: [HACKERS] inherit support for foreign tables)

2014-08-26 Thread Shigeru Hanada
Hi Fujita-san,

I reviewed the v4 patch, and here are some comments from me.

* After applying this patch, pull_varattnos() should not called in
unnecessary places.  For developers who want list of
columns-to-be-processed for some purpose, it would be nice to mention
when they should use pull_varattnos() and when they should not, maybe
at the comments of pull_varattnos() implementation.

* deparseReturningList() and postgresGetForeignRelSize() in
contrib/postgres_fdw/ also call pull_varattnos() to determine which
column to be in the SELECT clause of remote query.  Shouldn't these be
replaced in the same manner?  Other pull_varattnos() calls are for
restrictions, so IIUC they can't be replaced.

* Through this review I thought up that lazy evaluation approach might
fit attr_needed.  I mean that we leave attr_needed for child relations
empty, and fill it at the first request for it.  This would avoid
useless computation of attr_needed for child relations, though this
idea has been considered but thrown away before...


2014-08-20 18:55 GMT+09:00 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
 Hi Ashutish,


 (2014/08/14 22:30), Ashutosh Bapat wrote:

 On Thu, Aug 14, 2014 at 10:05 AM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp wrote:

 (2014/08/08 18:51), Etsuro Fujita wrote:
   (2014/06/30 22:48), Tom Lane wrote:
   I wonder whether it isn't time to change that.  It was coded
 like that
   originally only because calculating the values would've been a
 waste of
   cycles at the time.  But this is at least the third place
 where it'd be
   useful to have attr_needed for child rels.

   I've revised the patch.

 There was a problem with the previous patch, which will be described
 below.  Attached is the updated version of the patch addressing that.


 Here are some more comments


 Thank you for the further review!


 attr_needed also has the attributes used in the restriction clauses as
 seen in distribute_qual_to_rels(), so, it looks unnecessary to call
 pull_varattnos() on the clauses in baserestrictinfo in functions
 check_selective_binary_conversion(), remove_unused_subquery_outputs(),
 check_index_only().


 IIUC, I think it's *necessary* to do that in those functions since the
 attributes used in the restriction clauses in baserestrictinfo are not added
 to attr_needed due the following code in distribute_qual_to_rels.

 /*
  * If it's a join clause (either naturally, or because delayed by
  * outer-join rules), add vars used in the clause to targetlists of
 their
  * relations, so that they will be emitted by the plan nodes that scan
  * those relations (else they won't be available at the join node!).
  *
  * Note: if the clause gets absorbed into an EquivalenceClass then this
  * may be unnecessary, but for now we have to do it to cover the case
  * where the EC becomes ec_broken and we end up reinserting the original
  * clauses into the plan.
  */
 if (bms_membership(relids) == BMS_MULTIPLE)
 {
 List   *vars = pull_var_clause(clause,
PVC_RECURSE_AGGREGATES,
PVC_INCLUDE_PLACEHOLDERS);

 add_vars_to_targetlist(root, vars, relids, false);
 list_free(vars);

 }

 Although in case of RTE_RELATION, the 0th entry in attr_needed
 corresponds to FirstLowInvalidHeapAttributeNumber + 1, it's always safer
 to use it is RelOptInfo::min_attr, in case get_relation_info() wants to
 change assumption or somewhere down the line some other part of code
 wants to change attr_needed information. It may be unlikely, that it
 would change, but it will be better to stick to comments in RelOptInfo
   443 AttrNumber  min_attr;   /* smallest attrno of rel (often
 0) */
   444 AttrNumber  max_attr;   /* largest attrno of rel */
   445 Relids *attr_needed;/* array indexed [min_attr ..
 max_attr] */


 Good point!  Attached is the revised version of the patch.


 Thanks,

 Best regards,
 Etsuro Fujita



-- 
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-08-26 Thread Erik Rijkers
On Tue, August 26, 2014 11:13, Andrew Gierth wrote:
 Andrew == Andrew Gierth and...@tao11.riddles.org.uk writes:

 Erik == Erik Rijkers e...@xs4all.nl writes:

  Erik The patches did not apply anymore so I applied at 73eba19aebe0.
  Erik There they applied OK, and make  make check was OK.

  Andrew I'll look and rebase if need be.

 They apply cleanly for me at 2bde297 whether with git apply or patch,
 except for the contrib one (which you don't need unless you want to
 run the contrib regression tests without applying the gsp-u patch).


Ah, I had not realised that.  Excluding that contrib-patch and only applying 
these three:

gsp1.patch
gsp2.patch
gsp-doc.patch

does indeed work (applies, compiles).

Thank you,


Erik Rijkers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Heikki Linnakangas

On 08/26/2014 12:03 PM, Jeff Davis wrote:

On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote:

Attached is a new patch. It now keeps the current pg_clog unchanged, but
adds a new pg_csnlog besides it. pg_csnlog is more similar to
pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and
segments older than GlobalXmin can be truncated.


It appears that this patch weakens the idea of hint bits. Even if
HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the
snapshot.

That's fast if the xid is less than snap-xmin, but otherwise it needs
to do a fetch from the csnlog, which is exactly what the hint bits are
designed to avoid. And we can't get around this, because the whole point
of this patch is to remove the xip array from the snapshot.


Yeah. This patch in the current state is likely much much slower than 
unpatched master, except in extreme cases where you have thousands of 
connections and short transactions so that without the patch, you spend 
most of the time acquiring snapshots.


My current thinking is that access to the csnlog will need to be made 
faster. Currently, it's just another SLRU, but I'm sure we can do better 
than that. Or add a backend-private cache on top of it; that might 
already alleviate it enough..


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-26 Thread Fujii Masao
On Wed, Aug 20, 2014 at 1:14 PM, Michael Paquier
michael.paqu...@gmail.com wrote:



 On Wed, Aug 20, 2014 at 2:06 AM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Aug 16, 2014 at 10:27 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  I think ideally it would have been better if we could have logged
  replication commands under separate log_level, but as still there
  is no consensus on extending log_statement and nobody is even
  willing to pursue, it seems okay to go ahead and log these under
  'all' level.

 I think the consensus is clearly for a separate GUC.

 +1.

Okay. Attached is the updated version of the patch which I posted before.
This patch follows the consensus and adds separate parameter
log_replication_command.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 4558,4563  FROM pg_stat_activity;
--- 4558,4579 
/listitem
   /varlistentry
  
+  varlistentry id=guc-log-replication-command xreflabel=log_replication_command
+   termvarnamelog_replication_command/varname (typeboolean/type)
+   indexterm
+primaryvarnamelog_replication_command/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Causes each replication command to be logged in the server log.
+ See xref linkend=protocol-replication for more information about
+ replication command. The default value is literaloff/.
+ Only superusers can change this setting.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-log-temp-files xreflabel=log_temp_files
termvarnamelog_temp_files/varname (typeinteger/type)
indexterm
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***
*** 108,113  bool		am_db_walsender = false;	/* Connected to a database? */
--- 108,114 
  int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
  int			wal_sender_timeout = 60 * 1000;		/* maximum time to send one
   * WAL data message */
+ bool		log_replication_command = false;
  
  /*
   * State for WalSndWakeupRequest
***
*** 1268,1280  exec_replication_command(const char *cmd_string)
  	MemoryContext old_context;
  
  	/*
  	 * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next
  	 * command arrives. Clean up the old stuff if there's anything.
  	 */
  	SnapBuildClearExportedSnapshot();
  
- 	elog(DEBUG1, received replication command: %s, cmd_string);
- 
  	CHECK_FOR_INTERRUPTS();
  
  	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
--- 1269,1287 
  	MemoryContext old_context;
  
  	/*
+ 	 * Log replication command if log_replication_command is enabled.
+ 	 * Even when it's disabled, log the command with DEBUG1 level for
+ 	 * backward compatibility.
+ 	 */
+ 	ereport(log_replication_command ? LOG : DEBUG1,
+ 			(errmsg(received replication command: %s, cmd_string)));
+ 
+ 	/*
  	 * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next
  	 * command arrives. Clean up the old stuff if there's anything.
  	 */
  	SnapBuildClearExportedSnapshot();
  
  	CHECK_FOR_INTERRUPTS();
  
  	cmd_context = AllocSetContextCreate(CurrentMemoryContext,
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 925,930  static struct config_bool ConfigureNamesBool[] =
--- 925,939 
  		NULL, NULL, NULL
  	},
  	{
+ 		{log_replication_command, PGC_SUSET, LOGGING_WHAT,
+ 			gettext_noop(Logs each replication command.),
+ 			NULL
+ 		},
+ 		log_replication_command,
+ 		false,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{debug_assertions, PGC_INTERNAL, PRESET_OPTIONS,
  			gettext_noop(Shows whether the running server has assertion checks enabled.),
  			NULL,
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***
*** 431,436 
--- 431,437 
  	# e.g. '%u%%%d '
  #log_lock_waits = off			# log lock waits = deadlock_timeout
  #log_statement = 'none'			# none, ddl, mod, all
+ #log_replication_command = off
  #log_temp_files = -1			# log temporary files equal or larger
  	# than the specified size in kilobytes;
  	# -1 disables, 0 logs all temp files
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***
*** 25,30  extern bool wake_wal_senders;
--- 25,31 
  /* user-settable parameters */
  extern int	max_wal_senders;
  extern int	wal_sender_timeout;
+ extern bool	log_replication_command;
  
  extern void InitWalSender(void);
  extern void exec_replication_command(const char *query_string);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-08-26 Thread Andres Freund
On 2014-06-29 18:54:50 +0530, Abhijit Menon-Sen wrote:
 At 2014-06-29 20:35:04 +0900, maumau...@gmail.com wrote:
 
  Thanks, I marked it as ready for committer.  I hope Fujii san or
  another committer will commit this, refining English expression if
  necessary.
 
 Since it was just a matter of editing, I went through the patch and
 corrected various minor errors (typos, awkwardness, etc.). I agree
 that this is now ready for committer.
 
 I've attached the updated diff.

I'm looking at committing this, but I wonder: Shouldn't this be
accessible from inside psql as well? I.e. as a backslash command?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-26 Thread Oskari Saarenmaa

On 25/08/14 14:35, Andres Freund wrote:

currently pg_basebackup uses fetch mode when only -x is specified -
which imo isn't a very good thing to use due to the increased risk of
not fetching everything.
How about switching to stream mode for 9.5+?


+1.  I was just wondering why it's not the default a few days ago.

/ Oskari



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Verbose output of pg_dump not show schema name

2014-08-26 Thread Fabrízio de Royes Mello
On Tue, Aug 26, 2014 at 4:10 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Tue, Aug 26, 2014 at 3:48 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  AFAICS, the namespace can never be NULL in any of these. There is a
  selectSourceSchema(fout, tbinfo-dobj.namespace-dobj.name) call
before or
  after printing the message, so if tbinfo-dobj.namespace is NULL, you'll
  crash anyway. Please double-check, and remove the dead code if you
agree.
 Ah right, this field is used in many places. Even for
 pg_backup_archiver.c, the portion of code processing data always has
 the namespace set. I am sure that Fabrizio would have done that
 quickly, but as I was on this thread I simplified the patch as
 attached.


Thanks Michael... last night I was working on a refactoring in
tablecmds.c.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-08-26 Thread Petr Jelinek

On 26/08/14 13:20, Andres Freund wrote:


I'm looking at committing this, but I wonder: Shouldn't this be
accessible from inside psql as well? I.e. as a backslash command?



+1


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-08-26 Thread Pavel Stehule
2014-08-26 13:30 GMT+02:00 Petr Jelinek p...@2ndquadrant.com:

 On 26/08/14 13:20, Andres Freund wrote:


 I'm looking at committing this, but I wonder: Shouldn't this be
 accessible from inside psql as well? I.e. as a backslash command?


 +1


have you idea about command name?  \?+

Pavel






 --
  Petr Jelinek  http://www.2ndQuadrant.com/

  PostgreSQL Development, 24x7 Support, Training  Services



Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]

2014-08-26 Thread Andres Freund
On 2014-08-26 13:44:16 +0200, Pavel Stehule wrote:
 2014-08-26 13:30 GMT+02:00 Petr Jelinek p...@2ndquadrant.com:
 
  On 26/08/14 13:20, Andres Freund wrote:
 
 
  I'm looking at committing this, but I wonder: Shouldn't this be
  accessible from inside psql as well? I.e. as a backslash command?
 
 
  +1
 
 
 have you idea about command name?  \?+

Some ideas:

\hv
\help-variables
\? set
\? variables


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-08-26 Thread Fujii Masao
On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello,
 Thank you for comments.

Could you tell me where the patch for single block in one run is?
 Please find attached patch for single block compression in one run.

Thanks! I ran the benchmark using pgbench and compared the results.
I'd like to share the results.

[RESULT]
Amount of WAL generated during the benchmark. Unit is MB.

MultipleSingle
off202.0201.5
on6051.06053.0
pglz3543.03567.0
lz43344.03485.0
snappy3354.03449.5

Latency average during the benchmark. Unit is ms.

MultipleSingle
off19.119.0
on55.357.3
pglz45.045.9
lz444.244.7
snappy43.443.3

These results show that FPW compression is really helpful for decreasing
the WAL volume and improving the performance.

The compression ratio by lz4 or snappy is better than that by pglz. But
it's difficult to conclude which lz4 or snappy is best, according to these
results.

ISTM that compression-of-multiple-pages-at-a-time approach can compress
WAL more than compression-of-single-... does.

[HOW TO BENCHMARK]
Create pgbench database with scall factor 1000.

Change the data type of the column filler on each pgbench table
from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
gen_random_uuid() in order to avoid empty column, e.g.,

 alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

After creating the test database, run the pgbench as follows. The
number of transactions executed during benchmark is almost same
between each benchmark because -R option is used.

  pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared

checkpoint_timeout is 5min, so it's expected that checkpoint was
executed at least two times during the benchmark.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-08-26 Thread Andrew Gierth
 Erik == Erik Rijkers e...@xs4all.nl writes:

  They apply cleanly for me at 2bde297 whether with git apply or
  patch, except for the contrib one (which you don't need unless you
  want to run the contrib regression tests without applying the
  gsp-u patch).

 Erik Ah, I had not realised that.  Excluding that contrib-patch and
 Erik only applying these three:

 Erik gsp1.patch
 Erik gsp2.patch
 Erik gsp-doc.patch

 Erik does indeed work (applies, compiles).

I put up a rebased contrib patch anyway (linked off the CF).

Did the unrecognized node type error go away, or do we still need to
look into that?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] a7ae1dc has broken the windows builds

2014-08-26 Thread David Rowley
I've attached a small patch which should get the windows builds up and
running again.

They're currently failing from this:

c:\prog\bf\root\HEAD\pgsql.build\pgsql.sln (default target) (1) -
c:\prog\bf\root\HEAD\pgsql.build\pg_upgrade_support.vcxproj (default
target) (67) -
(Link target) -
  pg_upgrade_support.obj : error LNK2001: unresolved external symbol
IsBinaryUpgrade
[c:\prog\bf\root\HEAD\pgsql.build\pg_upgrade_support.vcxproj]
  .\Release\pg_upgrade_support\pg_upgrade_support.dll : fatal error
LNK1120: 1 unresolved externals
[c:\prog\bf\root\HEAD\pgsql.build\pg_upgrade_support.vcxproj]

0 Warning(s)
2 Error(s)

Regards

David Rowley


IsBinaryUpgrade_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-08-26 Thread David Rowley
On Tue, Aug 5, 2014 at 10:35 PM, David Rowley dgrowle...@gmail.com wrote:


 Currently most of my changes are in analyzejoin.c, but I did also have to
 make changes to load the foreign key constraints so that they were
 available to the planner. One thing that is currently lacking, which would
 likely be needed, before the finished patch is ready, would be a
 relhasfkeys column in pg_class. Such a column would mean that it would be
 possible to skip scanning pg_constraint for foreign keys when there's none
 to find. I'll delay implementing that until I get a bit more feedback to
 weather this patch would be a welcome addition to the existing join removal
 code or not.


I've modified this patch to include a new relhasfkey column in pg_class,
and then only attempt to load the foreign keys in get_relation_info() if
the pg_class flag is true.

Currently what I'm not quite sure on is the best place to be clearing this
flag. I see that relhaspkey is cleared during vacuum, but only if there's
no indexes at all on the relation. It seems that it will remain set to
true after vacuum, if the primary key is dropped and there's still other
indexes on the relation. My guess here is that this is done so that
pg_constraint does not have to be checked to see if a PK exists, which is
why I'm not sure if this would be the correct place for me to do the same
in order to determine if there's any FKs on the relation. Though I can't
quite think where else I might clear this flag.

Any ideas or feedback on this would be welcome

Regards

David Rowley


semianti_join_removal_3ab5ccb_2014-08-27.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347

2014-08-26 Thread Erik Rijkers
On Tue, August 26, 2014 14:24, Andrew Gierth wrote:
 Erik == Erik Rijkers e...@xs4all.nl writes:

   They apply cleanly for me at 2bde297 whether with git apply or
   patch, except for the contrib one (which you don't need unless you
   want to run the contrib regression tests without applying the
   gsp-u patch).

  Erik Ah, I had not realised that.  Excluding that contrib-patch and
  Erik only applying these three:

  Erik gsp1.patch
  Erik gsp2.patch
  Erik gsp-doc.patch

  Erik does indeed work (applies, compiles).

 I put up a rebased contrib patch anyway (linked off the CF).

 Did the unrecognized node type error go away, or do we still need to
 look into that?


Yes, it did go away; looks fine now:

 select brand , size , grouping(brand, size) , sum(sales) from items_sold group 
by rollup(brand, size) ;
 brand | size | grouping | sum
---+--+--+-
 Bar   | L|0 |   5
 Bar   | M|0 |  15
 Bar   |  |1 |  20
 Foo   | L|0 |  10
 Foo   | M|0 |  20
 Foo   |  |1 |  30
   |  |3 |  50
(7 rows)


I'm a bit unclear why the bottom-row 'grouping' value is 3.  Shouldn't that be 
2?

But I'm still reading the documentation so it's perhaps too early to ask...

Thanks,

Erik Rijkers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements

2014-08-26 Thread Heikki Linnakangas

On 04/14/2014 10:31 PM, Fabrízio de Royes Mello wrote:

On Tue, Apr 1, 2014 at 2:46 PM, Robert Haas robertmh...@gmail.com wrote:



Where this is a bit more interesting is in the case of sequences, where
resetting the sequence to zero may cause further inserts into an
existing table to fail.


Yeah.  Sequences do have contained data, which makes COR harder to

define

--- that's part of the reason why we have CINE not COR for tables, and
maybe we have to do the same for sequences.  The point being exactly
that if you use CINE, you're implicitly accepting that you don't know
the ensuing state fully.


Yeah.  I think CINE is more sensible than COR for sequences, for
precisely the reason that they do have contained data (even if it's
basically only one value).



The attached patch contains CINE for sequences.

I just strip this code from the patch rejected before.


Committed with minor changes:

* The documentation promised too much. It said that it would not throw 
an error if a sequence with the same name exists. In fact, it will not 
throw an error if any relation with the same name exists. I rewrote that 
paragraph to emphasize that more, re-using the phrases from the CREATE 
TABLE manual page.


* don't call RangeVarGetAndCheckCreationNamespace unnecessarily when IF 
NOT EXISTS is not used.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add line number as prompt option to psql

2014-08-26 Thread Andres Freund
On 2014-08-21 11:43:48 +0900, Sawada Masahiko wrote:
 On Wed, Aug 20, 2014 at 9:00 PM, Jeevan Chalke
 jeevan.cha...@enterprisedb.com wrote:
  Hi,
 
  I have reviewed this:
 
  I have initialize cur_lineno to UINTMAX - 2. And then observed following
  behaviour to check wrap-around.
 
  postgres=# \set PROMPT1 '%/[%l]%R%# '
  postgres[18446744073709551613]=# \set PROMPT2 '%/[%l]%R%# '
  postgres[18446744073709551613]=# select
  postgres[18446744073709551614]-# a
  postgres[18446744073709551615]-# ,
  postgres[0]-# b
  postgres[1]-# from
  postgres[2]-# dual;
 
  It is wrapping to 0, where as line number always start with 1. Any issues?
 
  I would like to ignore this as UINTMAX lines are too much for a input
  buffer to hold. It is almost NIL chances to hit this.
 
 
  However, I think you need to use UINT64_FORMAT while printing uint64
  number. Currently it is %u which wrap-around at UINT_MAX.
  See how pset.lineno is displayed.
 
  Apart from this, I didn't see any issues in my testing.
 
  Patch applies cleanly. make/make install/initdb/make check all are well.
 
 
 Thank you for reviewing the patch!
 Attached patch is latest version patch.
 I modified the output format of cur_lineno.

I like the feature - and I wanted to commit it, but enough stuff turned
up that I needed to fix that it warrants some new testing.

Stuff I've changed:
* removed include of limits.h - that probably was a rememnant from a
  previous version
* removed a trailing whitespace
* expanded the documentation about %l. The current line number isn't
  very clear. Of a file? Of all lines ever entered in psql? It's now
  The line number inside the current statement, starting from
  literal1/.
* Correspondingly I've changed the variable's name to stmt_lineno.
* COPY FROM ... STDIN/PROMPT3 was broken because a) the promp was only generated
  once b) the lineno wasn't incremented.
* CTRL-C didn't reset the line number.
*  Unfortunately I've notice here that the prompting is broken in some
common cases:

postgres[1]=# SELECT 1,
postgres[2]-# '2
postgres[2]'# 2b
postgres[2]'# 2c
postgres[2]'# 2d',
postgres[3]-# 3;
┌──┬──┬──┐
│ ?column? │ ?column? │ ?column? │
├──┼──┼──┤
│1 │ 2   ↵│3 │
│  │ 2b  ↵│  │
│  │ 2c  ↵│  │
│  │ 2d   │  │
└──┴──┴──┘
(1 row)

postgres[1]=# SELECT 1,
'2
2b
2c
2d',
3
postgres[7]-#

That's rather inconsistent...


I've attached my version of the patch. Note that I've got rid of all the
PSCAN_INCOMPLETE checks... Please test!

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 476d799d74f2ea8eefc3480f176b3726c35cf425 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Tue, 26 Aug 2014 13:35:49 +0200
Subject: [PATCH] Add psql PROMPT variable showing which line of a statement is
 being edited.

The new %l substitution shows the line number inside a (potentially
multi-line) statement starting from one.

Author: Sawada Masahiko, editorialized by me.
Reviewed-By: Jeevan Chalke, Alvaro Herrera
---
 doc/src/sgml/ref/psql-ref.sgml |  9 +
 src/bin/psql/copy.c| 15 +--
 src/bin/psql/mainloop.c| 17 +
 src/bin/psql/prompt.c  |  5 +
 src/bin/psql/settings.h|  1 +
 5 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 74d4618..db314c3 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3316,6 +3316,15 @@ testdb=gt; userinputINSERT INTO my_table VALUES (:'content');/userinput
   /varlistentry
 
   varlistentry
+termliteral%l/literal/term
+listitem
+ para
+  The line number inside the current statement, starting from literal1/.
+ /para
+/listitem
+  /varlistentry
+
+  varlistentry
 termliteral%/literalreplaceable class=parameterdigits/replaceable/term
 listitem
 para
diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index c759abf..6908742 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -517,8 +517,8 @@ bool
 handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res)
 {
 	bool		OK;
-	const char *prompt;
 	char		buf[COPYBUFSIZ];
+	bool		showprompt = false;
 
 	/*
 	 * Establish longjmp destination for exiting from wait-for-input. (This is
@@ -540,21 +540,20 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res)
 	/* Prompt if interactive input */
 	if (isatty(fileno(copystream)))
 	{
+		showprompt = true;
 		if (!pset.quiet)
 			puts(_(Enter data to be copied followed by a newline.\n
    End with a backslash and a period on a line by itself.));
-		prompt = get_prompt(PROMPT_COPY);
 	}
-	else
-		prompt = NULL;
 
 	OK = true;
 
 	if 

Re: [HACKERS] a7ae1dc has broken the windows builds

2014-08-26 Thread Andres Freund
On 2014-08-27 00:25:43 +1200, David Rowley wrote:
 I've attached a small patch which should get the windows builds up and
 running again.

Pushed, thanks.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements

2014-08-26 Thread Fabrízio de Royes Mello
On Tue, Aug 26, 2014 at 10:20 AM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 04/14/2014 10:31 PM, Fabrízio de Royes Mello wrote:

 On Tue, Apr 1, 2014 at 2:46 PM, Robert Haas robertmh...@gmail.com
 wrote:


  Where this is a bit more interesting is in the case of sequences, where
 resetting the sequence to zero may cause further inserts into an
 existing table to fail.


 Yeah.  Sequences do have contained data, which makes COR harder to

 define

 --- that's part of the reason why we have CINE not COR for tables, and
 maybe we have to do the same for sequences.  The point being exactly
 that if you use CINE, you're implicitly accepting that you don't know
 the ensuing state fully.


 Yeah.  I think CINE is more sensible than COR for sequences, for
 precisely the reason that they do have contained data (even if it's
 basically only one value).


 The attached patch contains CINE for sequences.

 I just strip this code from the patch rejected before.


 Committed with minor changes:

 * The documentation promised too much. It said that it would not throw an
 error if a sequence with the same name exists. In fact, it will not throw
 an error if any relation with the same name exists. I rewrote that
 paragraph to emphasize that more, re-using the phrases from the CREATE
 TABLE manual page.

 * don't call RangeVarGetAndCheckCreationNamespace unnecessarily when IF
 NOT EXISTS is not used.


Thanks!


-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-08-26 Thread Heikki Linnakangas

On 08/26/2014 03:28 PM, David Rowley wrote:

Any ideas or feedback on this would be welcome


Before someone spends time reviewing this patch, are you sure this is 
worth the effort? It seems like very narrow use case to me. I understand 
removing LEFT and INNER joins, but the case for SEMI and ANTI joins 
seems a lot thinner. Unnecessary LEFT and INNER joins can easily creep 
into a query when views are used, for example, but I can't imagine that 
happening for a SEMI or ANTI join. Maybe I'm lacking imagination. If 
someone has run into a query in the wild that would benefit from this, 
please raise your hand.


If I understood correctly, you're planning to work on INNER join removal 
too. How much of the code in this patch is also required for INNER join 
removal, and how much is specific to SEMI and ANTI joins?


Just so everyone is on the same page on what kind of queries this helps 
with, here are some examples from the added regression tests:



-- Test join removals for semi and anti joins
CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY, val INT);
CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT REFERENCES b(id));
-- should remove semi join to b
EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE b_id IN(SELECT id FROM b);
  QUERY PLAN
--
 Seq Scan on a
   Filter: (b_id IS NOT NULL)
(2 rows)

-- should remove semi join to b
EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
  QUERY PLAN
--
 Seq Scan on a
   Filter: (b_id IS NOT NULL)
(2 rows)

-- should remove anti join to b
EXPLAIN (COSTS OFF)
SELECT id FROM a WHERE NOT EXISTS(SELECT 1 FROM b WHERE a.b_id = id);
QUERY PLAN
--
 Seq Scan on a
   Filter: (b_id IS NULL)
(2 rows)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replicating DROP commands across servers

2014-08-26 Thread Alvaro Herrera
Andres Freund wrote:
 On 2014-06-13 15:50:50 -0400, Alvaro Herrera wrote:
  Here's a patch implementing the proposed idea.  This is used in the
  Bidirectional Replication stuff by Simon/Andres; it works well.
 
 I think there's been some changes to this patch since july, care to
 resend a new version?

Sure, here it is.

The only difference with the previous version is that it now also
supports column defaults.  This was found to be a problem when you drop
a sequence that some column default depends on -- for example a column
declared SERIAL, or a sequence marked with ALTER SEQUENCE OWNED BY.  The
new code is able to drop both the sequence and the default value
(leaving, of course, the rest of the column intact.)  This required
adding support for such objects in get_object_address.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 17564,17569  FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
--- 17564,17582 
  entryObject sub-id (e.g. attribute number for columns)/entry
 /row
 row
+ entryliteraloriginal/literal/entry
+ entrytypebool/type/entry
+ entryFlag used to identify the root object of the deletion/entry
+/row
+row
+ entryliteralnormal/literal/entry
+ entrytypebool/type/entry
+ entry
+  Flag indicating that there's a normal dependency relationship
+  in the dependency graph leading to this object
+ /entry
+/row
+row
  entryliteralobject_type/literal/entry
  entrytypetext/type/entry
  entryType of the object/entry
***
*** 17593,17598  FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
--- 17606,17627 
   identifier present in the identity is quoted if necessary.
  /entry
 /row
+row
+ entryliteraladdress_names/literal/entry
+ entrytypetext[]/type/entry
+ entry
+  An array that, together with literaladdress_args/literal,
+  can be used by the C-language function getObjectAddress() to
+  recreate the object address in a remote server containing a similar object.
+ /entry
+/row
+row
+ entryliteraladdress_args/literal/entry
+ entrytypetext[]/type/entry
+ entry
+  See literaladdress_names/literal above.
+ /entry
+/row
/tbody
   /tgroup
  /informaltable
*** a/src/backend/catalog/dependency.c
--- b/src/backend/catalog/dependency.c
***
*** 203,218  deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel,
  	/*
  	 * Keep track of objects for event triggers, if necessary.
  	 */
! 	if (trackDroppedObjectsNeeded())
  	{
  		for (i = 0; i  targetObjects-numrefs; i++)
  		{
! 			ObjectAddress *thisobj = targetObjects-refs + i;
! 
! 			if ((!(flags  PERFORM_DELETION_INTERNAL)) 
! EventTriggerSupportsObjectClass(getObjectClass(thisobj)))
  			{
! EventTriggerSQLDropAddObject(thisobj);
  			}
  		}
  	}
--- 203,227 
  	/*
  	 * Keep track of objects for event triggers, if necessary.
  	 */
! 	if (trackDroppedObjectsNeeded()  !(flags  PERFORM_DELETION_INTERNAL))
  	{
  		for (i = 0; i  targetObjects-numrefs; i++)
  		{
! 			const ObjectAddress *thisobj = targetObjects-refs[i];
! 			const ObjectAddressExtra *extra = targetObjects-extras[i];
! 			bool	original = false;
! 			bool	normal = false;
! 
! 			if (extra-flags  DEPFLAG_ORIGINAL)
! original = true;
! 			if (extra-flags  DEPFLAG_NORMAL)
! normal = true;
! 			if (extra-flags  DEPFLAG_REVERSE)
! normal = true;
! 
! 			if (EventTriggerSupportsObjectClass(getObjectClass(thisobj)))
  			{
! EventTriggerSQLDropAddObject(thisobj, original, normal);
  			}
  		}
  	}
*** a/src/backend/catalog/objectaddress.c
--- b/src/backend/catalog/objectaddress.c
***
*** 417,422  static const ObjectPropertyType ObjectProperty[] =
--- 417,513 
  	}
  };
  
+ /*
+  * This struct maps the object types as returned by getObjectTypeDescription
+  * into ObjType enum values.  Note that some enum values can be obtained by
+  * different names, and that some string object types do not have corresponding
+  * values in the enum.  The user of this map must be careful to test for
+  * invalid values being returned.
+  *
+  * This follows the order of getObjectTypeDescription.
+  */
+ static const struct object_type_map
+ {
+ 	const char *tm_name;
+ 	ObjectType	tm_type;
+ }
+ ObjectTypeMap[] =
+ {
+ 	/* OCLASS_CLASS */
+ 	{ table, OBJECT_TABLE },
+ 	{ index, OBJECT_INDEX },
+ 	{ sequence, OBJECT_SEQUENCE },
+ 	{ toast table, -1 },		/* unmapped */
+ 	{ view, OBJECT_VIEW },
+ 	{ materialized view, OBJECT_MATVIEW },
+ 	{ composite type, OBJECT_COMPOSITE },
+ 	{ foreign table, OBJECT_FOREIGN_TABLE },
+ 

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 08/16/2014 02:19 AM, Tom Lane wrote:
 I think the realistic alternatives at this point are either to
 switch to all-lengths as in my test patch, or to use the hybrid approach
 of Heikki's test patch. ...
 Personally I'd prefer to go to the all-lengths approach, but a large
 part of that comes from a subjective assessment that the hybrid approach
 is too messy.  Others might well disagree.

 It's not too pretty, no. But it would be nice to not have to make a 
 tradeoff between lookup speed and compressibility.

 Yet another idea is to store all lengths, but add an additional array of 
 offsets to JsonbContainer. The array would contain the offset of, say, 
 every 16th element. It would be very small compared to the lengths 
 array, but would greatly speed up random access on a large array/object.

That does nothing to address my basic concern about the patch, which is
that it's too complicated and therefore bug-prone.  Moreover, it'd lose
on-disk compatibility which is really the sole saving grace of the
proposal.

My feeling about it at this point is that the apparent speed gain from
using offsets is illusory: in practically all real-world cases where there
are enough keys or array elements for it to matter, costs associated with
compression (or rather failure to compress) will dominate any savings we
get from offset-assisted lookups.  I agree that the evidence for this
opinion is pretty thin ... but the evidence against it is nonexistent.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-26 Thread Amit Kapila
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:

 Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup.  Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl-numBufferAllocs++ and
SetLatch(StrategyControl-bgwriterLatch) without any locking.  That's not
OK.  I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
will protect nextVictimBuffer and nothing else.

 Then, in StrategyGetBuffer, acquire the freelist_lck at the point where
the LWLock is acquired today.  Increment StrategyControl-numBufferAllocs;
save the values of StrategyControl-bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity.  Release the spinlock.
 Then, set the bgwriterLatch if needed.  In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock.  Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.


Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl-completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I thought it is better to mention about above points so that if you have
any different thoughts about it, then it is better to discuss them now
rather than after I take performance data with this locking protocol.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-08-26 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:
 I think you should get rid of BufFreelistLock completely and just
 decide that freelist_lck will protect everything the freeNext links, plus
 everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
 will protect nextVictimBuffer and nothing else.

 Another point is I think it will be better to protect
 StrategyControl-completePasses with victimbuf_lck rather than
 freelist_lck, as when we are going to update it we will already be
 holding the victimbuf_lck and it doesn't make much sense to release
 the victimbuf_lck and reacquire freelist_lck to update it.

I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.

A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).

But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently.  For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.

(And yes, I realize these issues already affect the freelist.  Perhaps
that's part of the reason we have performance issues with it.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #11208: Refresh Materialized View Concurrently bug using user Postgres

2014-08-26 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:
 Kevin Grittner kgri...@ymail.com wrote:
  bemanuel...@gmail.com bemanuel...@gmail.com wrote:

  tjma_dw= set role user_dw;

  tjma_dw= CREATE TABLE foo_data AS SELECT i, md5(random()::text) FROM
  generate_series(1, 10) i;
  SELECT 10
  tjma_dw= CREATE MATERIALIZED VIEW mv_foo AS SELECT * FROM foo_data;
  SELECT 10
  tjma_dw= ALTER MATERIALIZED VIEW mv_foo OWNER TO user_dw;
  ALTER MATERIALIZED VIEW
  tjma_dw= REFRESH MATERIALIZED VIEW mv_foo;
  REFRESH MATERIALIZED VIEW
  tjma_dw= ALTER TABLE foo_data OWNER TO user_dw;
  ALTER TABLE
  tjma_dw= REFRESH MATERIALIZED VIEW mv_foo;
  REFRESH MATERIALIZED VIEW
  tjma_dw= create unique index on mv_foo (i);
  CREATE INDEX

  /pgsql/pg94/bin/psql -Upostgres -p 5434 tjma_dw

  tjma_dw=# refresh materialized view CONCURRENTLY mv_foo;
  ERROR:  permission denied for relation pg_temp_432971_2
  CONTEXT:  SQL statement DELETE FROM public.mv_foo mv WHERE ctid
  OPERATOR(pg_catalog.=) ANY (SELECT diff.tid FROM  
 pg_temp_10.pg_temp_432971_2
  diff WHERE diff.tid IS NOT NULL AND diff.newdata IS NULL)

  Yeah, that's a bug

 Attached is my proposed fix.  I will push it in a day or two if there
 are no objections.

Done.  I think we will have a third beta release; which should
include this fix.

The master branch needed to be adjusted from the initially posted
patch because of changes there.  That version is attached.

Thanks for testing the beta and for the report!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

refresh-concurrently-fix-v2.patch.gz
Description: application/tgz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-26 Thread Amit Kapila
On Tue, Aug 26, 2014 at 8:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
  Another point is I think it will be better to protect
  StrategyControl-completePasses with victimbuf_lck rather than
  freelist_lck, as when we are going to update it we will already be
  holding the victimbuf_lck and it doesn't make much sense to release
  the victimbuf_lck and reacquire freelist_lck to update it.

 I'm rather concerned by this cavalier assumption that we can protect
 fields a,b,c with one lock and fields x,y,z in the same struct with some
 other lock.

In some cases, it could be beneficial especially when a,b,c are
going to be more frequently accessed as compare to x,y,z.

 A minimum requirement for that to work safely at all is that the fields
 are of atomically fetchable/storable widths; which might be okay here
 but it's a restriction that bears thinking about (and documenting).

 But quite aside from safety, the fields are almost certainly going to
 be in the same cache line which means contention between processes that
 are trying to fetch or store them concurrently.

I think patch will reduce the contention for some of such variables
(which are accessed during clock sweep) as it will minimize the need
to perform clock sweep by backends.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] pg_dump refactor patch to remove global variables

2014-08-26 Thread Peter Eisentraut
On 8/15/14 7:30 PM, Joachim Wieland wrote:
 Attached is a patch that doesn't add any new functionality or
 features, all it does is get rid of the global variables that
 pg_dump.c is full of.

I'm getting a compiler error:

In file included from pg_dump.c:60:
In file included from ./pg_backup_archiver.h:32:
./pg_backup.h:212:3: error: redefinition of typedef 'DumpOptions' is a
C11 feature [-Werror,-Wtypedef-redefinition]
} DumpOptions;
  ^
./pg_dump.h:507:29: note: previous definition is here
typedef struct _dumpOptions DumpOptions;
^
1 error generated.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-26 Thread Alvaro Herrera
Mark Kirkwood wrote:
 On 06/05/14 16:28, Amit Kapila wrote:

 On Mon, May 5, 2014 at 11:57 AM, Mark Kirkwood
 mark.kirkw...@catalyst.net.nz wrote:

 I could think of 2 ways to change this:
 
 a. if user has specified cost_limit value for table, then it just uses it
  rather than rebalancing based on value of system-wide guc variable
  autovacuum_vacuum_cost_limit
 b. another could be to restrict setting per-table value to be lesser than
  system-wide value?
 
 The former is used for auto vacuum parameters like scale_factor and
 later is used for parameters like freeze_max_age.
 
 Thoughts?
 
 Alvaro, do you think above options makes sense to solve this problem?

I've been giving some thought to this.  Really, there is no way to
handle this sensibly while at the same time keeping the documented
behavior -- or in other words, what we have documented is not useful
behavior.  Your option (b) above is an easy solution to the problem,
however it means that the user will have serious trouble configuring the
system in scenarios such as volatile tables, as Mark says -- essentially
that will foreclose the option of using autovacuum for them.

I'm not sure I like your (a) proposal much better.  One problem there is
that if you set the values for a table to be exactly the same values as
in postgresql.conf, it will behave completely different because it will
not participate in balancing.  To me this seems to violate POLA.

I checked Haribabu's latest patch in this thread, and didn't like it
much.  If you set up a table to have cost_delay=1000, it runs at that
speed when vacuumed alone; but if there are two workers, it goes at half
the speed even if the other one is configured with a very small
cost_delay (in essence, waste the allocated I/O bandwidth).  Three
workers, it goes at a third of the speed -- again, even if the other
tables are configured to go much slower than the volatile one.  This
seems too simplistic.  It might be okay when you have only one or two
very large or high-churn tables, and small numbers of workers, but it's
not unreasonable to think that you might have lots more workers if your
DB has many high-churn tables.


So my proposal is a bit more complicated.  First we introduce the notion
of a single number, to enable sorting and computations: the delay
equivalent, which is the cost_limit divided by cost_delay.  The highest
the value is for any table, the fastest it is vacuumed.  (It makes sense
in physical terms: a higher cost_limit makes it faster, because vacuum
sleeps less often; and a higher cost_delay makes it go slower, because
vacuums sleeps for longer.)  Now, the critical issue is to notice that
not all tables are equal; they can be split in two groups, those that go
faster than the global delay equivalent
(i.e. the effective values of GUC variables
autovacuum_vacuum_cost_limit/autovacuum_vacuum_cost_delay), and those
that go equal or slower.  For the latter group, the rebalancing
algorithm distributes the allocated I/O by the global vars, in a
pro-rated manner.  For the former group (tables vacuumed faster than
global delay equiv), to rebalance we don't consider the global delay
equiv but the delay equiv of the fastest table currently being vacuumed.

Suppose we have two tables, delay_equiv=10 each (which is the default
value).  If they are both vacuumed in parallel, then we distribute a
delay_equiv of 5 to each (so set cost_limit=100, cost_delay=20).  As
soon as one of them finishes, the remaining one is allowed to upgrade to
delay_equiv=10 (cost_limit=200, cost_delay=20).

Now add a third table, delay_equiv=500 (cost_limit=1, cost_delay=20;
this is Mark's volatile table).  If it's being vacuumed on its own, just
assign cost_limit=1 cost_delay=20, as normal.  If one of the other
two tables are being vacuumed, that one will use delay_equiv=10, as per
above.  To balance the volatile table, we take the delay_equiv of this
one and subtract the already handed-out delay_equiv of 10; so we set the
volatile table to delay_equiv=490 (cost_limit=9800, cost_delay=20).

If we do it this way, the whole system is running at the full speed
enabled by the fastest table we have set the per-table options, but also
we have scaled things so that the slow tables go slow and the fast
tables go fast.

As a more elaborate example, add a fourth table with delay_equiv=50
(cost_limit=1000, cost_delay=20).  This is also faster than the global
vars, so we put it in the first group.  If all four tables are being
vacuumed in parallel, we have the two slow tables going at delay_equiv=5
each (cost_limit=100, cost_delay=20); then there are delay_equiv=490 to
distribute among the remaining ones; pro-rating this we have
delay_equiv=445 (cost_limit=8900, cost_delay=20) for the volatile table
and delay_equiv=45 (cost_limit=900, cost_delay=20) for the other one.

If one of the slowest tables finished vacuuming, the other one will
speed up to delay_equiv=10, and the two fastest ones will go on
unchanged.  

Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-26 Thread Magnus Hagander
On Mon, Aug 25, 2014 at 1:35 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 currently pg_basebackup uses fetch mode when only -x is specified -
 which imo isn't a very good thing to use due to the increased risk of
 not fetching everything.
 How about switching to stream mode for 9.5+?

I think the original reasons were to not change the default behaviour
with a new feature, and secondly because defaulting to -X requires two
replication connections rather than one.

I think the first reason is gone now, and the risk/damage of the two
connections is probably smaller than running out of WAL. -x is a good
default for smaller systems, but -X is a safer one for bigger ones. So
I agree that changing the default mode would make sense.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-26 Thread Andres Freund
On 2014-08-26 18:40:27 +0200, Magnus Hagander wrote:
 On Mon, Aug 25, 2014 at 1:35 PM, Andres Freund and...@2ndquadrant.com wrote:
  Hi,
 
  currently pg_basebackup uses fetch mode when only -x is specified -
  which imo isn't a very good thing to use due to the increased risk of
  not fetching everything.
  How about switching to stream mode for 9.5+?
 
 I think the original reasons were to not change the default behaviour
 with a new feature, and secondly because defaulting to -X requires two
 replication connections rather than one.

Right.

 I think the first reason is gone now, and the risk/damage of the two
 connections is probably smaller than running out of WAL.

Especially as that will fail pretty nearly immediately instead at the
end of the basebackup...

 -x is a good
 default for smaller systems, but -X is a safer one for bigger ones. So
 I agree that changing the default mode would make sense.

Cool.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-26 Thread Magnus Hagander
On Tue, Aug 26, 2014 at 6:51 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-08-26 18:40:27 +0200, Magnus Hagander wrote:
 On Mon, Aug 25, 2014 at 1:35 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Hi,
 
  currently pg_basebackup uses fetch mode when only -x is specified -
  which imo isn't a very good thing to use due to the increased risk of
  not fetching everything.
  How about switching to stream mode for 9.5+?

 I think the original reasons were to not change the default behaviour
 with a new feature, and secondly because defaulting to -X requires two
 replication connections rather than one.

 Right.

 I think the first reason is gone now, and the risk/damage of the two
 connections is probably smaller than running out of WAL.

 Especially as that will fail pretty nearly immediately instead at the
 end of the basebackup...

Yeah, I don't think the problem was actually pg_basebackup failing as
much as pg_basebackup getting in the way of regular replication
standbys. Which I think is also a smaller problem now, given that it's
a more common thing to do backups through replication protocol.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
 My current thinking is that access to the csnlog will need to be made 
 faster. Currently, it's just another SLRU, but I'm sure we can do better 
 than that. Or add a backend-private cache on top of it; that might 
 already alleviate it enough..

Aren't those the same ideas that have been proposed for eliminating hint
bits?

If this patch makes taking snapshots much faster, then perhaps it's
enough incentive to follow through on one of those ideas. I am certainly
open to removing hint bits (which was probably clear from earlier
discussions) if there are significant wins elsewhere and the penalty is
not too great.

To continue the discussion on this patch, let's assume that we make
TransactionIdGetCommitLSN sufficiently fast. If that's the case, can't
we make HeapTupleSatisfiesMVCC look more like:

  LsnMin = TransactionIdGetCommitLSN(xmin);
  if (!COMMITLSN_IS_COMMITTED(LsnMin))
 LsnMin = BIG_LSN;

  LsnMin = TransactionIdGetCommitLSN(xmin);
  if (!COMMITLSN_IS_COMMITTED(LsnMin))
 LsnMin = BIG_LSN;

  return (snapshot-snapshotlsn = LsnMin 
  snapshot-snapshotlsn   LsnMax);

There would need to be some handling for locked tuples, or tuples
related to the current transaction, of course. But I still think it
would turn out simpler; perhaps by enough to save a few cycles.

Regards,
Jeff Davis






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Greg Stark
On Tue, Aug 26, 2014 at 11:45 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 It appears that this patch weakens the idea of hint bits. Even if
 HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the
 snapshot.

 That's fast if the xid is less than snap-xmin, but otherwise it needs
 to do a fetch from the csnlog, which is exactly what the hint bits are
 designed to avoid. And we can't get around this, because the whole point
 of this patch is to remove the xip array from the snapshot.


 Yeah. This patch in the current state is likely much much slower than
 unpatched master, except in extreme cases where you have thousands of
 connections and short transactions so that without the patch, you spend most
 of the time acquiring snapshots.


Interesting analysis.

I suppose the equivalent of hint bits would be to actually write the
CSN of the transaction into the record when the hint bit is set.

I don't immediately see how to make that practical. One thought would
be to have a list of xids in the page header with their corresponding
csn -- which starts to sound a lot like Oralce's Interested
Transaction List. But I don't see how to make that work for the
hundreds of possible xids on the page.

The worst case for visibility resolution is you have a narrow table
that has random access DDL happening all the time, each update is a
short transaction and there are a very high rate of such transactions
spread out uniformly over a very large table. That means any given
page has over 200 rows with random xids spread over a very large range
of xids.

Currently the invariant hint bits give us is that each xid needs to be
looked up in the clog only a more or less fixed number of times, in
that scenario only once since the table is very large and the
transactions short lived.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Compute attr_needed for child relations (was Re: [HACKERS] inherit support for foreign tables)

2014-08-26 Thread Tom Lane
Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes:
 [ attr_needed-v4.patch ]

I looked this over, and TBH I'm rather disappointed.  The patch adds
150 lines of dubiously-correct code in order to save ... uh, well,
actually it *adds* code, because the places that are supposedly getting
a benefit are changed like this:

*** 799,806  check_selective_binary_conversion(RelOptInfo *baserel,
}
  
/* Collect all the attributes needed for joins or final output. */
!   pull_varattnos((Node *) baserel-reltargetlist, baserel-relid,
!  attrs_used);
  
/* Add all the attributes used by restriction clauses. */
foreach(lc, baserel-baserestrictinfo)
--- 799,810 
}
  
/* Collect all the attributes needed for joins or final output. */
!   for (i = baserel-min_attr; i = baserel-max_attr; i++)
!   {
!   if (!bms_is_empty(baserel-attr_needed[i - baserel-min_attr]))
!   attrs_used = bms_add_member(attrs_used,
!   
i - FirstLowInvalidHeapAttributeNumber);
!   }
  
/* Add all the attributes used by restriction clauses. */
foreach(lc, baserel-baserestrictinfo)

That's not simpler, it's not easier to understand, and it's probably not
faster either.  We could address some of those complaints by encapsulating
the above loop into a utility function, but still, I come away with the
feeling that it's not worth changing this.  Considering that all the
places that are doing this then proceed to use pull_varattnos to add on
attnos from the restriction clauses, it seems like using pull_varattnos
on the reltargetlist isn't such a bad thing after all.

So I'm inclined to reject this.  It seemed like a good idea in the
abstract, but the concrete result isn't very attractive, and doesn't
seem like an improvement over what we have.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Josh Berkus
On 08/26/2014 07:51 AM, Tom Lane wrote:
 My feeling about it at this point is that the apparent speed gain from
 using offsets is illusory: in practically all real-world cases where there
 are enough keys or array elements for it to matter, costs associated with
 compression (or rather failure to compress) will dominate any savings we
 get from offset-assisted lookups.  I agree that the evidence for this
 opinion is pretty thin ... but the evidence against it is nonexistent.

Well, I have shown one test case which shows where lengths is a net
penalty.  However, for that to be the case, you have to have the
following conditions *all* be true:

* lots of top-level keys
* short values
* rows which are on the borderline for TOAST
* table which fits in RAM

... so that's a special case and if it's sub-optimal, no bigee.  Also,
it's not like it's an order-of-magnitude slower.

Anyway,  I called for feedback on by blog, and have gotten some:

http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 Anyway,  I called for feedback on by blog, and have gotten some:
 http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html

I was hoping you'd get some useful data from that, but so far it seems
like a rehash of points made in the on-list thread :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Josh Berkus
On 08/26/2014 11:40 AM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 Anyway,  I called for feedback on by blog, and have gotten some:
 http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html
 
 I was hoping you'd get some useful data from that, but so far it seems
 like a rehash of points made in the on-list thread :-(
 
   regards, tom lane

yah, me too. :-(

Unfortunately even the outside commentors don't seem to understand that
storage size *is* related to speed, it's exchanging I/O speed for CPU speed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 On 08/26/2014 11:40 AM, Tom Lane wrote:
 I was hoping you'd get some useful data from that, but so far it seems
 like a rehash of points made in the on-list thread :-(

 Unfortunately even the outside commentors don't seem to understand that
 storage size *is* related to speed, it's exchanging I/O speed for CPU speed.

Yeah, exactly.  Given current hardware trends, data compression is
becoming more of a win not less as time goes on: CPU cycles are cheap
even compared to main memory access, let alone mass storage.  So I'm
thinking we want to adopt a compression-friendly data format even if
it measures out as a small loss currently.

I wish it were cache-friendly too, per the upthread tangent about having
to fetch keys from all over the place within a large JSON object.

... and while I was typing that sentence, lightning struck.  The existing
arrangement of object subfields with keys and values interleaved is just
plain dumb.  We should rearrange that as all the keys in order, then all
the values in the same order.  Then the keys are naturally adjacent in
memory and object-key searches become much more cache-friendly: you
probably touch most of the key portion of the object, but none of the
values portion, until you know exactly what part of the latter to fetch.
This approach might complicate the lookup logic marginally but I bet not
very much; and it will be a huge help if we ever want to do smart access
to EXTERNAL (non-compressed) JSON values.

I will go prototype that just to see how much code rearrangement is
required.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Andres Freund
On 2014-08-26 15:01:27 -0400, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
  On 08/26/2014 11:40 AM, Tom Lane wrote:
  I was hoping you'd get some useful data from that, but so far it seems
  like a rehash of points made in the on-list thread :-(
 
  Unfortunately even the outside commentors don't seem to understand that
  storage size *is* related to speed, it's exchanging I/O speed for CPU speed.
 
 Yeah, exactly.  Given current hardware trends, data compression is
 becoming more of a win not less as time goes on: CPU cycles are cheap
 even compared to main memory access, let alone mass storage.  So I'm
 thinking we want to adopt a compression-friendly data format even if
 it measures out as a small loss currently.

On the other hand the majority of databases these day fit into main
memory due to its increasing sizes and postgres is more often CPU than
IO bound.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Laurence Rowe
On 26 August 2014 11:34, Josh Berkus j...@agliodbs.com wrote:

 On 08/26/2014 07:51 AM, Tom Lane wrote:
  My feeling about it at this point is that the apparent speed gain from
  using offsets is illusory: in practically all real-world cases where
 there
  are enough keys or array elements for it to matter, costs associated with
  compression (or rather failure to compress) will dominate any savings we
  get from offset-assisted lookups.  I agree that the evidence for this
  opinion is pretty thin ... but the evidence against it is nonexistent.

 Well, I have shown one test case which shows where lengths is a net
 penalty.  However, for that to be the case, you have to have the
 following conditions *all* be true:

 * lots of top-level keys
 * short values
 * rows which are on the borderline for TOAST
 * table which fits in RAM

 ... so that's a special case and if it's sub-optimal, no bigee.  Also,
 it's not like it's an order-of-magnitude slower.

 Anyway,  I called for feedback on by blog, and have gotten some:

 http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html


It would be really interesting to see your results with column STORAGE
EXTERNAL for that benchmark. I think it is important to separate out the
slowdown due to decompression now being needed vs that inherent in the new
format, we can always switch off compression on a per-column basis using
STORAGE EXTERNAL.


My JSON data has smallish objects with a small number of keys, it barely
compresses at all with the patch and shows similar results to Arthur's
data. Across ~500K rows I get:

encoded=# select count(properties-'submitted_by') from compressed;
 count

 431948
(1 row)

Time: 250.512 ms

encoded=# select count(properties-'submitted_by') from uncompressed;
 count

 431948
(1 row)

Time: 218.552 ms


Laurence


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-08-26 15:01:27 -0400, Tom Lane wrote:
 Yeah, exactly.  Given current hardware trends, data compression is
 becoming more of a win not less as time goes on: CPU cycles are cheap
 even compared to main memory access, let alone mass storage.  So I'm
 thinking we want to adopt a compression-friendly data format even if
 it measures out as a small loss currently.

 On the other hand the majority of databases these day fit into main
 memory due to its increasing sizes and postgres is more often CPU than
 IO bound.

Well, better data compression helps make that true ;-).  And don't forget
cache effects; actual main memory is considered slow these days.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Claudio Freire
On Tue, Aug 26, 2014 at 4:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 On 08/26/2014 11:40 AM, Tom Lane wrote:
 I was hoping you'd get some useful data from that, but so far it seems
 like a rehash of points made in the on-list thread :-(

 Unfortunately even the outside commentors don't seem to understand that
 storage size *is* related to speed, it's exchanging I/O speed for CPU speed.

 Yeah, exactly.  Given current hardware trends, data compression is
 becoming more of a win not less as time goes on: CPU cycles are cheap
 even compared to main memory access, let alone mass storage.  So I'm
 thinking we want to adopt a compression-friendly data format even if
 it measures out as a small loss currently.

 I wish it were cache-friendly too, per the upthread tangent about having
 to fetch keys from all over the place within a large JSON object.


What about my earlier proposal?

An in-memory compressed representation would greatly help cache
locality, more so if you pack keys as you mentioned.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Andres Freund
On 2014-08-26 15:17:13 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-08-26 15:01:27 -0400, Tom Lane wrote:
  Yeah, exactly.  Given current hardware trends, data compression is
  becoming more of a win not less as time goes on: CPU cycles are cheap
  even compared to main memory access, let alone mass storage.  So I'm
  thinking we want to adopt a compression-friendly data format even if
  it measures out as a small loss currently.
 
  On the other hand the majority of databases these day fit into main
  memory due to its increasing sizes and postgres is more often CPU than
  IO bound.
 
 Well, better data compression helps make that true ;-).

People disable toast compression though because it results in better
performance :(. Part of that could be fixed by a faster compression
method, part of it by decompressing less often. But still.

 And don't forget cache effects; actual main memory is considered slow
 these days.

Right. But that plays the other way round too. Compressed datums need to
be copied to be accessed uncompressed. Whereas at least in comparison to
inline compressed datums that's not necessary.

Anyway, that's just to say that I don't really agree that CPU overhead
is a worthy price to pay for storage efficiency if the gains are small.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 19:25 +0100, Greg Stark wrote:
 I don't immediately see how to make that practical. One thought would
 be to have a list of xids in the page header with their corresponding
 csn -- which starts to sound a lot like Oralce's Interested
 Transaction List. But I don't see how to make that work for the
 hundreds of possible xids on the page.

I feel like that's moving in the wrong direction. That's still causing a
lot of modifications to a data page when the data is not changing, and
that's bad for a lot of cases that I'm interested in (checksums are one
example).

We are mixing two kinds of data: user data and visibility information.
Each is changed under different circumstances and has different
characteristics, and I'm beginning to think they shouldn't be mixed at
all.

What if we just devised a structure specially designed to hold
visibility information, put all of the visibility information there, and
data pages would only change where there is a real, user-initiated
I/U/D. Vacuum could still clear out dead tuples from the data area, but
it would do the rest of its work on the visibility structure. It could
even be a clever structure that could compress away large static areas
until they become active again.

Maybe this wouldn't work for all tables, but could be an option for big
tables with low update rates.

 The worst case for visibility resolution is you have a narrow table
 that has random access DDL happening all the time, each update is a
 short transaction and there are a very high rate of such transactions
 spread out uniformly over a very large table. That means any given
 page has over 200 rows with random xids spread over a very large range
 of xids.

That's not necessarily a bad case, unless the CLOG/CSNLOG lookup is a
significant fraction of the effort to update a tuple.

That would be a bad case if you introduce scans into the equation as
well, but that's not a problem if the all-visible bit is set.

 Currently the invariant hint bits give us is that each xid needs to be
 looked up in the clog only a more or less fixed number of times, in
 that scenario only once since the table is very large and the
 transactions short lived.

A backend-local cache might accomplish that, as well (would still need
to do a lookup, but no locks or contention). There would be some
challenges around invalidation (for xid wraparound) and pre-warming the
cache (so establishing a lot of connections doesn't cause a lot of CLOG
access).

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Peter Geoghegan
On Tue, Aug 26, 2014 at 12:27 PM, Andres Freund and...@2ndquadrant.com wrote:
 Anyway, that's just to say that I don't really agree that CPU overhead
 is a worthy price to pay for storage efficiency if the gains are small.

+1

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New Model For Role Attributes and Fine Grained Permssions

2014-08-26 Thread Robert Haas
On Thu, Aug 21, 2014 at 10:49 PM, Stephen Frost sfr...@snowman.net wrote:
 * Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 On 08/19/2014 04:27 AM, Brightwell, Adam wrote:
 This is a proof-of-concept patch for a new model around role attributes
 and fine grained permissions meant to alleviate the current over dependence
 on superuser.

 Hmm. How does this get us any closer to fine-grained permissions?

 This patch, by itself, does not- but it adds the structure to allow us
 to add more permissions without having to add more fields to pg_authid,
 and we could build into pg_permission the WITH ADMIN OPTION and
 grantor bits that the normal GRANT syntax already supports- but
 without having to chew up additional bits for these granted permissions
 which are applied to roles instead of objects in the system.

 As for specific examples where this could be used beyond those
 presented, consider things like:

 CREATE EXTENSION
 CREATE TABLESPACE
 COPY (server-side)

I'm not opposed to this in theory, but I agree with Heikki that the
syntax you've picked (as well as the internal teminology of the patch)
is not sufficiently unambiguous.   Permission could refer to a lot
of things, and the GRANT syntax does a lot of things already.  Since
the patch appears to aim to supplant what we current do with ALTER
ROLE .. [NO] {CREATEDB | CREATEROLE }, how about something like:

ALTER ROLE role_name { GRANT | REVOKE } CAPABILITY capability_name;

In terms of catalog structure, I doubt that a
row-per-capability-per-user makes sense.  Why not just add a new
rolcapability column to pg_authid?  A single int64 interpreted as a
bitmask would give us room for 64 capabilities at a fraction of the
storage and lookup cost of a separate catalog.  If that's not enough,
the column could be stored as an integer array or vector or something,
but a bigger problem is that Tom's head will likely explode if you
tell him you're going to have more than 64 of these things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 12:39 +0300, Heikki Linnakangas wrote:
 I think this is enough for this commitfest - we have consensus on the 
 design. For the next one, please address those open items, and resubmit.

Agreed, return with feedback.

I need to get the accounting patch in first, which needs to address some
performance issues, but there's a chance of wrapping those up quickly.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Josh Berkus
On 08/26/2014 12:27 PM, Andres Freund wrote:
 Anyway, that's just to say that I don't really agree that CPU overhead
 is a worthy price to pay for storage efficiency if the gains are small.

But in this case the gains aren't small; we're talking up to 60% smaller
storage.

Testing STORAGE EXTENDED soon.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-08-26 Thread Fabien COELHO


Hello Jeff,


The culprit I found is bgwriter, which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, pg_stat_bgwriter says:


Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?


Yes, I did.


You don't want your steady state stats polluted by the bulk load.


Sure!


  buffers_checkpoint = 19046
  buffers_clean = 2995


Out of curiosity, what does buffers_backend show?


   buffers_backend = 157


In any event, this almost certainly is a red herring.


Possibly! It is pretty easy to reproduce... can anyone try?

Whichever of the three ways are being used to write out the buffers, it 
is the checkpointer that is responsible for fsyncing them, and that is 
where your drama is almost certainly occurring. Writing out with one 
path rather than a different isn't going to change things, unless you 
change the fsync.


Well, ISTM that the OS does not need to wait for fsync to start writing 
pages if it has received one minute of buffer writes at 50 writes per 
second, some scheduler should start handling the flow somewhere... So if 
bgwriter was to write the buffers is would help, but maybe there is a 
better way.



Also, are you familiar with checkpoint_completion_target, and what is it
set to?


The default 0.5. Moving to 0.9 seems to worsen the situation.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-08-26 Thread Robert Haas
On Fri, Aug 22, 2014 at 2:46 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Aug 22, 2014 at 7:19 AM, Robert Haas robertmh...@gmail.com wrote:
 Patch 0002 no longer applies; please rebase.

 I attach rebased patch.

 Note that there is currently a bug in the master branch:

 +   if (len2 = tss-buflen2)
 +   {
 +   pfree(tss-buf2);
 +   tss-buflen1 = Max(len2 + 1, Min(tss-buflen2 * 2, MaxAllocSize));
 +   tss-buf2 = MemoryContextAlloc(ssup-ssup_cxt, tss-buflen2);
 +   }

You didn't say what the bug is; after inspection, I believe it's that
line 4 begins with tss-buflen1 rather than tss-buflen2.

I have committed a fix for that problem.  Let me know if I missed
something else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Greg Stark
On Tue, Aug 26, 2014 at 8:32 PM, Jeff Davis pg...@j-davis.com wrote:
 We are mixing two kinds of data: user data and visibility information.
 Each is changed under different circumstances and has different
 characteristics, and I'm beginning to think they shouldn't be mixed at
 all.

 What if we just devised a structure specially designed to hold
 visibility information, put all of the visibility information there, and
 data pages would only change where there is a real, user-initiated
 I/U/D. Vacuum could still clear out dead tuples from the data area, but
 it would do the rest of its work on the visibility structure. It could
 even be a clever structure that could compress away large static areas
 until they become active again.

Well fundamentally the reason the visibility information is with the
user data is so that we don't need to do two i/os to access the data.
The whole point of hint bits is to guarantee that after some amount of
time you can read data directly out of the heap page and use it
without doing any additional I/O.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-08-26 Thread Peter Geoghegan
On Tue, Aug 26, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote:
 I have committed a fix for that problem.  Let me know if I missed
 something else.

Yes, that's all I meant.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.5] Custom Plan API

2014-08-26 Thread Robert Haas
On Fri, Aug 22, 2014 at 9:48 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 One thing I was pointed out, it is the reason why I implemented
 DDL support, is that intermediation of c-language function also
 loads the extension module implicitly. It is an advantage, but
 not sure whether it shall be supported from the beginning.

That is definitely an advantage of the DDL-based approach, but I think
it's too much extra machinery for not enough real advantage.  Sounds
like we all agree, so ...

 I'd like to follow this direction, and start stripping the DDL support.

...please make it so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 21:00 +0100, Greg Stark wrote:
 Well fundamentally the reason the visibility information is with the
 user data is so that we don't need to do two i/os to access the data.
 The whole point of hint bits is to guarantee that after some amount of
 time you can read data directly out of the heap page and use it
 without doing any additional I/O.

If the data is that static, then the visibility information would be
highly compressible, and surely in shared_buffers already.

(Yes, it would need to be pinned, which has a cost.)

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-08-26 Thread Robert Haas
On Sat, Aug 23, 2014 at 6:39 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Aug 23, 2014 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Ah.  Okay, but then what's wrong with the original proposal of use ceil()
 instead of floor()?  Basically I think the idea of treating fractions
 less than one differently from fractions greater than one is bogus; nobody
 will ever find that intuitive.

 Or make it an error to specify a value that rounds to 0 but isn't 0.

I liked David Johnston's even stronger suggestion upthread: make it an
error to specify a value requires rounding of any kind.  In other
words, if the minimum granularity is 1 minute, you can specify that as
60 seconds instead, but if you write 59 seconds, we error out.  Maybe
that seems pedantic, but I don't think users will much appreciate the
discovery that 30 seconds means 60 seconds.  They'll be happier to be
told that up front than having to work it out afterward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-08-26 Thread Andres Freund
On 2014-08-26 16:16:32 -0400, Robert Haas wrote:
 On Sat, Aug 23, 2014 at 6:39 PM, Greg Stark st...@mit.edu wrote:
  On Sat, Aug 23, 2014 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
  Ah.  Okay, but then what's wrong with the original proposal of use ceil()
  instead of floor()?  Basically I think the idea of treating fractions
  less than one differently from fractions greater than one is bogus; nobody
  will ever find that intuitive.
 
  Or make it an error to specify a value that rounds to 0 but isn't 0.
 
 I liked David Johnston's even stronger suggestion upthread: make it an
 error to specify a value requires rounding of any kind.  In other
 words, if the minimum granularity is 1 minute, you can specify that as
 60 seconds instead, but if you write 59 seconds, we error out.  Maybe
 that seems pedantic, but I don't think users will much appreciate the
 discovery that 30 seconds means 60 seconds.  They'll be happier to be
 told that up front than having to work it out afterward.

Is the whole topic actually practically relevant? Afaics there's no
guc's with a time unit bigger than GUC_UNIT_S except log_rotation_age
where it surely doesn't matter and no space unit bigger than
GUC_UNIT_BLOCKS/GUC_UNIT_XBLOCKS.
In theory I agree with you/$subject, but I don't really see this worth
much effort.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-26 Thread Peter Eisentraut
On 8/26/14 12:40 PM, Magnus Hagander wrote:
 I think the first reason is gone now, and the risk/damage of the two
 connections is probably smaller than running out of WAL. -x is a good
 default for smaller systems, but -X is a safer one for bigger ones. So
 I agree that changing the default mode would make sense.

I would seriously consider just removing one of the modes.  Having two
modes is complex enough, and then having different defaults in different
versions, and fuzzy recommendations like, it's better for smaller
systems, it's quite confusing.

I don't think it's a fundamental problem to say, you need 2 connections
to use this feature.  (For example, you need a second connection to
issue a cancel request.  Nobody has ever complained about that.)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-26 Thread Andres Freund
On 2014-08-26 16:41:44 -0400, Peter Eisentraut wrote:
 On 8/26/14 12:40 PM, Magnus Hagander wrote:
  I think the first reason is gone now, and the risk/damage of the two
  connections is probably smaller than running out of WAL. -x is a good
  default for smaller systems, but -X is a safer one for bigger ones. So
  I agree that changing the default mode would make sense.
 
 I would seriously consider just removing one of the modes.  Having two
 modes is complex enough, and then having different defaults in different
 versions, and fuzzy recommendations like, it's better for smaller
 systems, it's quite confusing.

Happy with removing the option and just accepting -X for backward
compat.

 I don't think it's a fundamental problem to say, you need 2 connections
 to use this feature.  (For example, you need a second connection to
 issue a cancel request.  Nobody has ever complained about that.)

Well, replication connections are more limited in number than normal
connections... And cancel requests are very short lived.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-08-26 Thread Peter Eisentraut
On 8/23/14 6:39 PM, Greg Stark wrote:
 Or make it an error to specify a value that rounds to 0 but isn't 0.

That's what I would prefer.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-08-26 Thread Peter Eisentraut
On 8/26/14 4:22 PM, Andres Freund wrote:
 Is the whole topic actually practically relevant?

It's clearly not all that important, or otherwise we'd have heard about
before now.

I suppose someone could do something like

wal_receiver_status_interval = 10ms

and end up silently turning the whole thing off instead of making it
very aggressive.

The mistake here is that the mathematically appropriate turn-off value
in this and similar cases is infinity, not zero.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing comment block at the top of streamutil.h and receivelog.h

2014-08-26 Thread Robert Haas
On Sat, Aug 23, 2014 at 11:59 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 As mentioned in $subject, the header files in src/bin/pg_basebackup do
 not have a comment block at the top and do not have any copyright
 text.
 Any reason for that? Shouldn't we have something for consistency with
 the other files like in the patch attached?

Probably that's a good idea, but do we really need Author: tags?  I
know we have those in a few places, but certainly not everywhere, and
as time goes by they tend to be less accurate reflections of who wrote
the latest code (as opposed to the original code).  Furthermore, every
time we include them, it tends to increase the demand to add even more
of them because, hey, everybody likes to be acknowledged.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing comment block at the top of streamutil.h and receivelog.h

2014-08-26 Thread Magnus Hagander
On Tue, Aug 26, 2014 at 11:03 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Aug 23, 2014 at 11:59 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 As mentioned in $subject, the header files in src/bin/pg_basebackup do
 not have a comment block at the top and do not have any copyright
 text.
 Any reason for that? Shouldn't we have something for consistency with
 the other files like in the patch attached?

 Probably that's a good idea, but do we really need Author: tags?  I
 know we have those in a few places, but certainly not everywhere, and
 as time goes by they tend to be less accurate reflections of who wrote
 the latest code (as opposed to the original code).  Furthermore, every
 time we include them, it tends to increase the demand to add even more
 of them because, hey, everybody likes to be acknowledged.

Given that I'm the one named in it - nah, just drop it.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?

2014-08-26 Thread Magnus Hagander
On Tue, Aug 26, 2014 at 10:46 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-08-26 16:41:44 -0400, Peter Eisentraut wrote:
 On 8/26/14 12:40 PM, Magnus Hagander wrote:
  I think the first reason is gone now, and the risk/damage of the two
  connections is probably smaller than running out of WAL. -x is a good
  default for smaller systems, but -X is a safer one for bigger ones. So
  I agree that changing the default mode would make sense.

 I would seriously consider just removing one of the modes.  Having two
 modes is complex enough, and then having different defaults in different
 versions, and fuzzy recommendations like, it's better for smaller
 systems, it's quite confusing.

 Happy with removing the option and just accepting -X for backward
 compat.

Works for me - this is really the cleaner way of doing it...

If we do that, perhaps we should backpatch a deprecation notice into
the 9.4 docs?


 I don't think it's a fundamental problem to say, you need 2 connections
 to use this feature.  (For example, you need a second connection to
 issue a cancel request.  Nobody has ever complained about that.)

 Well, replication connections are more limited in number than normal
 connections... And cancel requests are very short lived.

Yeah. But as long as we document it clearly, we should be OK I think.
And it's fairly clearly documented now - just need to be sure not to
remove that when changing the -x stuff.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-08-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I liked David Johnston's even stronger suggestion upthread: make it an
 error to specify a value requires rounding of any kind.  In other
 words, if the minimum granularity is 1 minute, you can specify that as
 60 seconds instead, but if you write 59 seconds, we error out.  Maybe
 that seems pedantic, but I don't think users will much appreciate the
 discovery that 30 seconds means 60 seconds.  They'll be happier to be
 told that up front than having to work it out afterward.

I think this is totally wrong.  The entire point of the GUC units system,
or at least a large part of the point, is that users should not have to
be intimately aware of the units in which a given value is measured
internally.  And that in turn means that the units had better be such
that users won't find them overly coarse.  If it matters a lot whether
59 seconds gets rounded to 60, then we didn't make a good choice of units
for the GUC in question; and we should fix that choice, not mess with the
rounding rules.

The case where this argument falls down is for special values, such as
where zero means something quite different from the smallest nonzero
value.  Peter suggested upthread that we should redefine any GUC values
for which that is true, but (a) I think that loses on backwards
compatibility grounds, and (b) ISTM zero is probably always special to
some extent.  A zero time delay for example is not likely to work.

Maybe we should leave the rounding behavior alone (there's not much
evidence that rounding in one direction is worse than another; although
I'd also be okay with changing to round-to-nearest), and confine ourselves
to throwing an error for the single case that an apparently nonzero input
value is truncated/rounded to zero as a result of units conversion.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-08-26 Thread David G Johnston
Tom Lane-2 wrote
 Robert Haas lt;

 robertmhaas@

 gt; writes:
 I liked David Johnston's even stronger suggestion upthread: make it an
 error to specify a value requires rounding of any kind.  In other
 words, if the minimum granularity is 1 minute, you can specify that as
 60 seconds instead, but if you write 59 seconds, we error out.  Maybe
 that seems pedantic, but I don't think users will much appreciate the
 discovery that 30 seconds means 60 seconds.  They'll be happier to be
 told that up front than having to work it out afterward.
 
 I think this is totally wrong.  The entire point of the GUC units system,
 or at least a large part of the point, is that users should not have to
 be intimately aware of the units in which a given value is measured
 internally.  And that in turn means that the units had better be such
 that users won't find them overly coarse.  If it matters a lot whether
 59 seconds gets rounded to 60, then we didn't make a good choice of units
 for the GUC in question; and we should fix that choice, not mess with the
 rounding rules.
 
 The case where this argument falls down is for special values, such as
 where zero means something quite different from the smallest nonzero
 value.  Peter suggested upthread that we should redefine any GUC values
 for which that is true, but (a) I think that loses on backwards
 compatibility grounds, and (b) ISTM zero is probably always special to
 some extent.  A zero time delay for example is not likely to work.
 
 Maybe we should leave the rounding behavior alone (there's not much
 evidence that rounding in one direction is worse than another; although
 I'd also be okay with changing to round-to-nearest), and confine ourselves
 to throwing an error for the single case that an apparently nonzero input
 value is truncated/rounded to zero as a result of units conversion.

To Andres' point:

SELECT unit, count(*) FROM pg_settings WHERE unit  '' GROUP BY unit; (9.3
/ Ubuntu)

min (1 - log_rotation_age)
s (10)
ms (13)

kb (7)
8kb (6)

I don't know about the size implications but they seem to be non-existent. 
That any setting critically matters at +/- 1s or 1ms doesn't seem likely in
practice.  Even +/- 1min for a setting, if it did matter at extreme scale,
would be recognizable by the user in practice as a rounding artifact and
compensated for.

At this point throwing an error for any precision that results in less than
the default precision is my preference.  I would not change the rounding
rules for the simple reason that there is no obvious improvement to be had
and so why introduce pointless change that - however marginal and unlikely -
will be user-visible.  

The complaint to overcome is avoiding an interpretation of zero when the
precision of the input is less than the GUC unit.  Lacking any concrete
complaints about our round-down policy I don't see where a change there is
worthwhile.  

Fixing zero as a special value falls under the same category. As
mathematically pure as using infinity may be the trade-off for practicality
and usability seems, even in light of this complaint, like the correct one
to have made.

David J.








--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/proposal-rounding-up-time-value-less-than-its-unit-tp5811102p5816409.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-26 Thread Tom Lane
I wrote:
 I wish it were cache-friendly too, per the upthread tangent about having
 to fetch keys from all over the place within a large JSON object.

 ... and while I was typing that sentence, lightning struck.  The existing
 arrangement of object subfields with keys and values interleaved is just
 plain dumb.  We should rearrange that as all the keys in order, then all
 the values in the same order.  Then the keys are naturally adjacent in
 memory and object-key searches become much more cache-friendly: you
 probably touch most of the key portion of the object, but none of the
 values portion, until you know exactly what part of the latter to fetch.
 This approach might complicate the lookup logic marginally but I bet not
 very much; and it will be a huge help if we ever want to do smart access
 to EXTERNAL (non-compressed) JSON values.

 I will go prototype that just to see how much code rearrangement is
 required.

This looks pretty good from a coding point of view.  I have not had time
yet to see if it affects the speed of the benchmark cases we've been
trying.  I suspect that it won't make much difference in them.  I think
if we do decide to make an on-disk format change, we should seriously
consider including this change.

The same concept could be applied to offset-based storage of course,
although I rather doubt that we'd make that combination of choices since
it would be giving up on-disk compatibility for benefits that are mostly
in the future.

Attached are two patches: one is a delta against the last jsonb-lengths
patch I posted, and the other is a merged patch showing the total change
from HEAD, for ease of application.

regards, tom lane

diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index e47eaea..4e7fe67 100644
*** a/src/backend/utils/adt/jsonb_util.c
--- b/src/backend/utils/adt/jsonb_util.c
***
*** 26,33 
   * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits
   * reserved for that in the JsonbContainer.header field.
   *
!  * (the total size of an array's elements is also limited by JENTRY_LENMASK,
!  * but we're not concerned about that here)
   */
  #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK))
  #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK))
--- 26,33 
   * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits
   * reserved for that in the JsonbContainer.header field.
   *
!  * (The total size of an array's or object's elements is also limited by
!  * JENTRY_LENMASK, but we're not concerned about that here.)
   */
  #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK))
  #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK))
*** findJsonbValueFromContainer(JsonbContain
*** 294,303 
  {
  	JEntry	   *children = container-children;
  	int			count = (container-header  JB_CMASK);
! 	JsonbValue *result = palloc(sizeof(JsonbValue));
  
  	Assert((flags  ~(JB_FARRAY | JB_FOBJECT)) == 0);
  
  	if (flags  JB_FARRAY  container-header)
  	{
  		char	   *base_addr = (char *) (children + count);
--- 294,309 
  {
  	JEntry	   *children = container-children;
  	int			count = (container-header  JB_CMASK);
! 	JsonbValue *result;
  
  	Assert((flags  ~(JB_FARRAY | JB_FOBJECT)) == 0);
  
+ 	/* Quick out without a palloc cycle if object/array is empty */
+ 	if (count = 0)
+ 		return NULL;
+ 
+ 	result = palloc(sizeof(JsonbValue));
+ 
  	if (flags  JB_FARRAY  container-header)
  	{
  		char	   *base_addr = (char *) (children + count);
*** findJsonbValueFromContainer(JsonbContain
*** 323,329 
  		char	   *base_addr = (char *) (children + count * 2);
  		uint32	   *offsets;
  		uint32		lastoff;
! 		int			lastoffpos;
  		uint32		stopLow = 0,
  	stopHigh = count;
  
--- 329,335 
  		char	   *base_addr = (char *) (children + count * 2);
  		uint32	   *offsets;
  		uint32		lastoff;
! 		int			i;
  		uint32		stopLow = 0,
  	stopHigh = count;
  
*** findJsonbValueFromContainer(JsonbContain
*** 332,379 
  
  		/*
  		 * We use a cache to avoid redundant getJsonbOffset() computations
! 		 * inside the search loop.  Note that count may well be zero at this
! 		 * point; to avoid an ugly special case for initializing lastoff and
! 		 * lastoffpos, we allocate one extra array element.
  		 */
! 		offsets = (uint32 *) palloc((count * 2 + 1) * sizeof(uint32));
! 		offsets[0] = lastoff = 0;
! 		lastoffpos = 0;
  
  		/* Binary search on object/pair keys *only* */
  		while (stopLow  stopHigh)
  		{
  			uint32		stopMiddle;
- 			int			index;
  			int			difference;
  			JsonbValue	candidate;
  
  			stopMiddle = stopLow + (stopHigh - stopLow) / 2;
  
- 			/*
- 			 * Compensate for the fact that we're searching through pairs (not
- 			 * entries).
- 			 */
- 			index = stopMiddle * 2;
- 
- 			/* Update the offsets cache through at least 

Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange

2014-08-26 Thread Mark Kirkwood

On 27/08/14 10:27, Alvaro Herrera wrote:

Alvaro Herrera wrote:


So my proposal is a bit more complicated.  First we introduce the notion
of a single number, to enable sorting and computations: the delay
equivalent, which is the cost_limit divided by cost_delay.


Here's a patch that implements this idea.  As you see this is quite a
bit more complicated that Haribabu's proposal.

There are two holes in this:

1. if you ALTER DATABASE to change vacuum delay for a database, those
values are not considered in the global equiv delay.  I don't think this
is very important and anyway we haven't considered this very much, so
it's okay if we don't handle it.

2. If you have a fast worker that's only slightly faster than regular
workers, it will become slower in some cases.  This is explained in a
FIXME comment in the patch.

I don't really have any more time to invest in this, but I would like to
see it in 9.4.  Mark, would you test this?  Haribabu, how open are you
to fixing point (2) above?



Thanks Alvaro - I will take a look.

regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-08-26 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:

 I think this is approaching a committable state, although I think
 it should probably be broken down to four separate patches.

And here they are.  This should net to the same set of changes as
the prior post, but the changes are logically separated.  They are
labeled as v3 to match the last post.

trigger-transition-tables allows transition table names to be
specified in a REFERENCING clause of CREATE TRIGGER, per spec, and
creates tuplestores when needed in the TriggerData structure.  It
doesn't worry about who does what with that data.  This doesn't
depend on anything else.
15 files changed, 577 insertions(+), 43 deletions(-)

spi-tuplestore-registry allows tuplestores, with associated name
and TupleDesc, to be registered with the current SPI connection.
Queries planned or executed on that connection will recognize the
name as a tuplestore relation.  It doesn't care who is registering
the tuplestores or what happens to them.  It doesn't depend on
anything else.
5 files changed, 445 insertions(+)

executor-tuplestore-relations covers parse analysis,
planner/optimizer, and executor layers.  It pulls information from
the registry in a couple places, but is not very intertwined with
SPI.  That is the only registry it knows right now, but it should
be easy to add other registries if needed.  It doesn't care where
the tuplestore came from, so we should be able to use this for
other things.  I have it in mind to use it for incremental
maintenance of materialized views, but I expect that other uses
will be found.  It has a logical dependency on the
spi-tuplestore-registry patch.  While it doesn't have a logical
dependency on trigger-transition-tables, they both modified some of
the same files, and this won't apply cleanly unless
trigger-transition-tables is applied first.  If you hand-correct
the failed hunks, it compiles and runs fine without
trigger-transition-tables.
30 files changed, 786 insertions(+), 9 deletions(-)

plpgsql-after-trigger-transition-tables takes the tuplestores from
TriggerData and registers them with SPI before trigger planning and
execution.  It depends on the trigger-transition-tables and
spi-tuplestore-registry patches to build, and won't do anything
useful at run time without the executor-tuplestore-relations patch.
3 files changed, 37 insertions(+), 11 deletions(-)

Hopefully this will make review easier.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

trigger-transition-tables-v3.patch.gz
Description: application/tgz


spi-tuplestore-registry-v3.patch.gz
Description: application/tgz


executor-tuplestore-relations-v3.patch.gz
Description: application/tgz


plpgsql-after-trigger-transition-tables-v3.patch.gz
Description: application/tgz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >