Re: [HACKERS] postgresql latency bgwriter not doing its job
Hello Josh, So I think that you're confusing the roles of bgwriter vs. spread checkpoint. What you're experiencing above is pretty common for nonspread checkpoints on slow storage (and RAID5 is slow for DB updates, no matter how fast the disks are), or for attempts to do spread checkpoint on filesystems which don't support it (e.g. Ext3, HFS+). In either case, what's happening is that the *OS* is freezing all logical and physical IO while it works to write out all of RAM, which makes me suspect you're using Ext3 or HFS+. I'm using ext4 on debian wheezy with postgresqk 9.4b2. I agree that the OS may be able to help, but this aspect does not currently work for me at all out of the box. The all of RAM is really a few thousands 8 kB pages written randomly, a few dozen MB. Also, if pg needs advanced OS tweaking to handle a small load, ISTM that it fails at simplicity:-( As for checkpoint spreading, raising checkpoint_completion_target to 0.9 degrades the situation (20% of transactions are more than 200 ms late instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run). So maybe there is a bug here somewhere. Making the bgwriter more aggressive adds a significant risk of writing the same pages multiple times between checkpoints, so it's not a simple fix. Hmmm... This must be balanced with the risk of being offline. Not all people are interested in throughput at the price of latency, so there could be settings that help latency, even at the price of reducing throughput (average tps). After that, it is the administrator choice to set pg for higher throughput or lower latency. Note that writing some least recently used page multiple times does not seems to be any issue at all for me under small/medium load, especially as the system has nothing else to do: if you have nothing else to do, there is no cost in writing a page, even if you may have to write it again some time later, and it helps prevent dirty pages accumulation. So it seems to me that pg can help, it is not only/merely an OS issue. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Escaping from blocked send() reprised.
Sorry, I was absorbed by other tasks.. Thank you for reviewing thiis. On 07/01/2014 06:26 AM, Kyotaro HORIGUCHI wrote: At Mon, 30 Jun 2014 11:27:47 -0400, Robert Haas robertmh...@gmail.com wrote in CA+TgmoZfcGzAEmtbyoCe6VdHnq085x+ox752zuJ2AKN=wc8...@mail.gmail.com 1. I think it's the case that there are platforms around where a signal won't cause send() to return EINTR and I'd be entirely unsurprised if SSL_write() doesn't necessarily return EINTR in that case. I'm not sure what, if anything, we can do about that. We use a custom write routine with SSL_write, where we call send() ourselves, so that's not a problem as long as we put the check in the right place (in secure_raw_write(), after my recent SSL refactoring - the patch needs to be rebased). man 2 send on FreeBSD has not description about EINTR.. And even on linux, send won't return EINTR for most cases, at least I haven't seen that. So send()=-1,EINTR seems to me as only an equivalent of send() = 0. I have no idea about what the implementer thought the difference is. As the patch stands, there's a race condition: if the SIGTERM arrives *before* the send() call, the send() won't return EINTR anyway. So there's a chance that you still block. Calling pq_terminate_backend() again will dislodge it (assuming send() returns with EINTR on signal), Yes, that window would'nt be extinguished without introducing something more. EINTR is set only when nothing sent by the call. So AFAIS the chance of getting EINTR is far small than expectation. but I don't think we want to define the behavior as usually, pq_terminate_backend() will kill a backend that's blocked on sending to the client, but sometimes you have to call it twice (or more!) to really kill it. I agree that it is desirable behavior, if any measure to avoid that. But I think it's better than doing kill -9 engulfing all innocent backends. A more robust way is to set ImmediateInterruptOK before calling send(). That wouldn't let you send data that can be sent without blocking though. For that, you could put the socket to non-blocking mode, and sleep with select(), also waiting for the process' latch at the same time (die() sets the latch, so that will wake up the select() if a termination request arrives). I condiered it but select() frequently (rather in most cases when send() blocks by send buffer exhaustion) fails to predict that following send() will be blocked. (If my memory is correct.) So the final problem would be blocked send()... Is it actually safe to process the die-interrupt where send() is called? ProcessInterrupts() does ereport(FATAL, ...), which will attempt to send a message to the client. If that happens in the middle of constructing some other message, that will violate the protocol. So I strongly agree to you if select() works as the impression when reading the man document. 2. I think it would be reasonable to try to kill off the connection without notifying the client if we're unable to send the data to the client in a reasonable period of time. But I'm unsure what a reasonable period of time means. This patch would basically do it after no delay at all, which seems like it might be too aggressive. However, I'm not sure. I think there's no such a reasonable time. I agree it's pretty hard to define any reasonable timeout here. I think it would be fine to just cut the connection; even if you don't block while sending, you'll probably reach a CHECK_FOR_INTERRUPT() somewhere higher in the stack and kill the connection almost as abruptly anyway. (you can't violate the protocol, however) Yes, closing the blocked connection seems one of the most smarter way, checking the occurred interrupt could avoid protocol violation. But the problem for that is that there seems no means to close sockets elsewhere the blocking handle. dup(2)'ed handle cannot release the resource by only itself. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On Monday, August 25, 2014, Fabien COELHO coe...@cri.ensmp.fr wrote: I have not found any mean to force bgwriter to send writes when it can. (Well, I have: create a process which sends CHECKPOINT every 0.2 seconds... it works more or less, but this is not my point:-) There is scan_whole_pool_milliseconds, which currently forces bgwriter to circle the buffer pool at least once every 2 minutes. It is currently fixed, but it should be trivial to turn it into an experimental guc that you could use to test your hypothesis. Cheers, Jeff
Re: [HACKERS] postgresql latency bgwriter not doing its job
[oops, wrong from, resent...] Hello Jeff, The culprit I found is bgwriter, which is basically doing nothing to prevent the coming checkpoint IO storm, even though there would be ample time to write the accumulating dirty pages so that checkpoint would find a clean field and pass in a blink. Indeed, at the end of the 500 seconds throttled test, pg_stat_bgwriter says: Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i? Yes, I did. You don't want your steady state stats polluted by the bulk load. Sure! buffers_checkpoint = 19046 buffers_clean = 2995 Out of curiosity, what does buffers_backend show? buffers_backend = 157 In any event, this almost certainly is a red herring. Possibly. It is pretty easy to reproduce, though. Whichever of the three ways are being used to write out the buffers, it is the checkpointer that is responsible for fsyncing them, and that is where your drama is almost certainly occurring. Writing out with one path rather than a different isn't going to change things, unless you change the fsync. Well, I agree partially. ISTM that the OS does not need to wait for fsync to start writing pages if it is receiving one minute of buffer writes at 50 writes per second, I would have thought that some scheduler should start handling the flow before fsync... So I thought that if bgwriter was to write the buffers is would help, but maybe there is a better way. Also, are you familiar with checkpoint_completion_target, and what is it set to? The default 0.5. Moving to 0.9 seems to worsen the situation. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Escaping from blocked send() reprised.
On 08/26/2014 09:17 AM, Kyotaro HORIGUCHI wrote: but I don't think we want to define the behavior as usually, pq_terminate_backend() will kill a backend that's blocked on sending to the client, but sometimes you have to call it twice (or more!) to really kill it. I agree that it is desirable behavior, if any measure to avoid that. But I think it's better than doing kill -9 engulfing all innocent backends. A more robust way is to set ImmediateInterruptOK before calling send(). That wouldn't let you send data that can be sent without blocking though. For that, you could put the socket to non-blocking mode, and sleep with select(), also waiting for the process' latch at the same time (die() sets the latch, so that will wake up the select() if a termination request arrives). I condiered it but select() frequently (rather in most cases when send() blocks by send buffer exhaustion) fails to predict that following send() will be blocked. (If my memory is correct.) So the final problem would be blocked send()... My point was to put the socket in non-blocking mode, so that send() will return immediately with EAGAIN instead of blocking, if the send buffer is full. See WalSndWriteData for how that would work, it does something similar. Is it actually safe to process the die-interrupt where send() is called? ProcessInterrupts() does ereport(FATAL, ...), which will attempt to send a message to the client. If that happens in the middle of constructing some other message, that will violate the protocol. So I strongly agree to you if select() works as the impression when reading the man document. Not sure what you mean, but the above is a fatal problem with the patch right now, regardless of how you do the sleeping. 2. I think it would be reasonable to try to kill off the connection without notifying the client if we're unable to send the data to the client in a reasonable period of time. But I'm unsure what a reasonable period of time means. This patch would basically do it after no delay at all, which seems like it might be too aggressive. However, I'm not sure. I think there's no such a reasonable time. I agree it's pretty hard to define any reasonable timeout here. I think it would be fine to just cut the connection; even if you don't block while sending, you'll probably reach a CHECK_FOR_INTERRUPT() somewhere higher in the stack and kill the connection almost as abruptly anyway. (you can't violate the protocol, however) Yes, closing the blocked connection seems one of the most smarter way, checking the occurred interrupt could avoid protocol violation. But the problem for that is that there seems no means to close sockets elsewhere the blocking handle. dup(2)'ed handle cannot release the resource by only itself. I didn't understand that, surely you can just close() the socket? There is no dup(2) involved. And we don't necessarily need to close the socket, we just need to avoid writing to it when we're already in the middle of sending a message. I'm marking this as Waiting on Author in the commitfest app, because: 1. the protocol violation needs to be avoided one way or another, and 2. the behavior needs to be consistent so that a single pg_terminate_backend() is enough to always kill the connection. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Verbose output of pg_dump not show schema name
On Tue, Aug 26, 2014 at 3:48 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: AFAICS, the namespace can never be NULL in any of these. There is a selectSourceSchema(fout, tbinfo-dobj.namespace-dobj.name) call before or after printing the message, so if tbinfo-dobj.namespace is NULL, you'll crash anyway. Please double-check, and remove the dead code if you agree. Ah right, this field is used in many places. Even for pg_backup_archiver.c, the portion of code processing data always has the namespace set. I am sure that Fabrizio would have done that quickly, but as I was on this thread I simplified the patch as attached. Regards, -- Michael diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 3aebac8..7c0616d 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -546,8 +546,13 @@ RestoreArchive(Archive *AHX) /* Both schema and data objects might now have ownership/ACLs */ if ((te-reqs (REQ_SCHEMA | REQ_DATA)) != 0) { - ahlog(AH, 1, setting owner and privileges for %s %s\n, - te-desc, te-tag); + /* Show namespace if available */ + if (te-namespace) +ahlog(AH, 1, setting owner and privileges for %s \%s\.\%s\\n, + te-desc, te-namespace, te-tag); + else +ahlog(AH, 1, setting owner and privileges for %s \%s\\n, + te-desc, te-tag); _printTocEntry(AH, te, ropt, false, true); } } @@ -621,7 +626,13 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, if ((reqs REQ_SCHEMA) != 0) /* We want the schema */ { - ahlog(AH, 1, creating %s %s\n, te-desc, te-tag); + /* Show namespace if available */ + if (te-namespace) + ahlog(AH, 1, creating %s \%s\.\%s\\n, + te-desc, te-namespace, te-tag); + else + ahlog(AH, 1, creating %s \%s\\n, te-desc, te-tag); + _printTocEntry(AH, te, ropt, false, false); defnDumped = true; @@ -713,8 +724,9 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, _becomeOwner(AH, te); _selectOutputSchema(AH, te-namespace); - ahlog(AH, 1, processing data for table \%s\\n, - te-tag); + /* Show namespace if available */ + ahlog(AH, 1, processing data for table \%s\.\%s\\n, + te-namespace, te-tag); /* * In parallel restore, if we created the table earlier in diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 5c0f95f..c084ee9 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -1400,7 +1400,8 @@ dumpTableData_copy(Archive *fout, void *dcontext) const char *column_list; if (g_verbose) - write_msg(NULL, dumping contents of table %s\n, classname); + write_msg(NULL, dumping contents of table \%s\.\%s\\n, + tbinfo-dobj.namespace-dobj.name, classname); /* * Make sure we are in proper schema. We will qualify the table name @@ -5019,7 +5020,8 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables) continue; if (g_verbose) - write_msg(NULL, reading indexes for table \%s\\n, + write_msg(NULL, reading indexes for table \%s\.\%s\\n, + tbinfo-dobj.namespace-dobj.name, tbinfo-dobj.name); /* Make sure we are in proper schema so indexdef is right */ @@ -5385,7 +5387,8 @@ getConstraints(Archive *fout, TableInfo tblinfo[], int numTables) continue; if (g_verbose) - write_msg(NULL, reading foreign key constraints for table \%s\\n, + write_msg(NULL, reading foreign key constraints for table \%s\.\%s\\n, + tbinfo-dobj.namespace-dobj.name, tbinfo-dobj.name); /* @@ -5723,7 +5726,8 @@ getTriggers(Archive *fout, TableInfo tblinfo[], int numTables) continue; if (g_verbose) - write_msg(NULL, reading triggers for table \%s\\n, + write_msg(NULL, reading triggers for table \%s\.\%s\\n, + tbinfo-dobj.namespace-dobj.name, tbinfo-dobj.name); /* @@ -6336,8 +6340,9 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) * the output of an indexscan on pg_attribute_relid_attnum_index. */ if (g_verbose) - write_msg(NULL, finding the columns and types of table \%s\\n, - tbinfo-dobj.name); + write_msg(NULL, finding the columns and types of table \%s\.\%s\\n, + tbinfo-dobj.namespace-dobj.name, + tbinfo-dobj.name); resetPQExpBuffer(q); @@ -6548,8 +6553,9 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) int numDefaults; if (g_verbose) -write_msg(NULL, finding default expressions of table \%s\\n, - tbinfo-dobj.name); +write_msg(NULL, finding default expressions of table \%s\.\%s\\n, + tbinfo-dobj.namespace-dobj.name, + tbinfo-dobj.name); resetPQExpBuffer(q); if (fout-remoteVersion = 70300) @@ -6672,7 +6678,8 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) int numConstrs; if (g_verbose) -write_msg(NULL, finding check constraints for table \%s\\n, +write_msg(NULL, finding check constraints for table
Re: [HACKERS] Escaping from blocked send() reprised.
Hello, I condiered it but select() frequently (rather in most cases when send() blocks by send buffer exhaustion) fails to predict that following send() will be blocked. (If my memory is correct.) So the final problem would be blocked send()... My point was to put the socket in non-blocking mode, so that send() will return immediately with EAGAIN instead of blocking, if the send buffer is full. See WalSndWriteData for how that would work, it does something similar. I confused it with what I did during writing this patch. select() - blocking send(). Sorry for confusing the discussion. I understand correctly what you mean and It sounds reasonable. I agree it's pretty hard to define any reasonable timeout here. I think it would be fine to just cut the connection; even if you don't block while sending, you'll probably reach a CHECK_FOR_INTERRUPT() somewhere higher in the stack and kill the connection almost as abruptly anyway. (you can't violate the protocol, however) Yes, closing the blocked connection seems one of the most smarter way, checking the occurred interrupt could avoid protocol violation. But the problem for that is that there seems no means to close sockets elsewhere the blocking handle. dup(2)'ed handle cannot release the resource by only itself. I didn't understand that, surely you can just close() the socket? There is no dup(2) involved. And we don't necessarily need to close the socket, we just need to avoid writing to it when we're already in the middle of sending a message. My assumption there was select() and *blocking* send(). So the sentence cited is out of point from the first. I'm marking this as Waiting on Author in the commitfest app, because: 1. the protocol violation needs to be avoided one way or another, and 2. the behavior needs to be consistent so that a single pg_terminate_backend() is enough to always kill the connection. Thank you for the suggestion. I think I can go forward with that and will come up with new patch. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Commitfest status
The first week of the commitfest is now behind us. There are still 15 patches in Needs Review state, with no reviewer assigned. Please pick a patch and review! There are 20 patches in Needs Review state, with a reviewer assigned. If you have signed up as the reviewer, please proceed with the review. If you have submitted a patch for this commitfest, please check the status of your patch. If it is in Waiting for author state, you are expected to submit a new version of the patch, addressing any review comments you have received. If you don't have the time to update your patch in the next few days, please mark your patch as Returned with Feedback and resubmit for the next commitfest. If your patch is in Waiting on Author state, but you don't know what you should do to it, ask for clarification. Committers: Please pick a patch that's been marked for Ready for Committer, verify that it has been adequately reviewed, and proceed to commit or bounce it back. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
Hello Amit, I think another thing to know here is why exactly checkpoint storm is causing tps to drop so steeply. Yep. Actually it is not strictly 0, but a few tps that I rounded to 0. progress: 63.0 s, 47.0 tps, lat 2.810 ms stddev 5.194, lag 0.354 ms progress: 64.1 s, 11.9 tps, lat 81.481 ms stddev 218.026, lag 74.172 ms progress: 65.2 s, 1.9 tps, lat 950.455 ms stddev 125.602, lag 1421.203 ms progress: 66.1 s, 4.5 tps, lat 604.790 ms stddev 440.073, lag 2418.128 ms progress: 67.2 s, 6.0 tps, lat 322.440 ms stddev 68.276, lag 3146.302 ms progress: 68.0 s, 2.4 tps, lat 759.509 ms stddev 62.141, lag 4229.239 ms progress: 69.4 s, 3.6 tps, lat 440.335 ms stddev 369.207, lag 4842.371 ms Transactions are 4.8 seconds behind schedule at this point. One reason could be that backends might need to write more WAL due Full_Page_Writes, another could be contention around buffer content_lock. To dig more about the reason, the same tests can be tried by making Full_Page_Writes = off and/or synchronous_commit = off to see if WAL writes are causing tps to go down. Given the small flow of updates, I do not think that there should be reason to get that big a write contention between WAL checkpoint. If tried with full_page_write = off for 500 seconds: same overall behavior, 8.5% of transactions are stuck (instead of 10%). However in details pg_stat_bgwriter is quite different: buffers_checkpoint = 13906 buffers_clean = 20748 buffers_backend = 472 That seems to suggest that bgwriter did some stuff for once, but that did not change much the result in the end. This would imply that my suggestion to make bgwriter write more would not fix the problem alone. With synchronous_commit = off, the situation is much improved, with only 0.3% of transactions stuck. Not a surprise. However, I would not recommand that as a solution:-) Currently, the only way I was able to solve the issue while still writing to disk is to send CHECKPOINT every 0.2s, as if I had set checkpoint_timeout = 0.2s (although this is not currently allowed). Similarly for checkpoints, use checkpoint_completion_target to spread the checkpoint_writes as suggested by Jeff as well to see if that can mitigate the problem. I had already tried, and retried after Jeff suggestion. This does not seem to mitigate anything, on the contrary. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog and replication slots
On Tue, Aug 19, 2014 at 2:49 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Aug 18, 2014 at 4:01 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Aug 18, 2014 at 3:48 PM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Aug 18, 2014 at 2:38 PM, Michael Paquier michael.paqu...@gmail.com wrote: And now looking at your patch there is additional refactoring possible with IDENTIFY_SYSTEM and pg_basebackup as well... And attached is a rebased patch doing so. I reworked this patch a bit to take into account the fact that the number of columns to check after running IDENTIFY_SYSTEM is not always 4 as pointed out by Andres: - pg_basebackup = 3 - pg_receivexlog = 3 - pg_recvlogical = 4 Regards, -- Michael From c54c987f34fbdc7d7cdf59af114a247029be532e Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Mon, 18 Aug 2014 14:32:44 +0900 Subject: [PATCH] Add support for physical slot creation/deletion in pg_receivexlog Physical slot creation can be done with --create and drop with --drop. In both cases --slot is needed. Code for replication slot creation and drop is refactored with what was available in pg_recvlogical, the same applied with IDENTIFY_SYSTEM to simplify the whole set. --- doc/src/sgml/ref/pg_receivexlog.sgml | 29 +++ src/bin/pg_basebackup/pg_basebackup.c | 21 + src/bin/pg_basebackup/pg_receivexlog.c | 108 +++- src/bin/pg_basebackup/pg_recvlogical.c | 119 -- src/bin/pg_basebackup/streamutil.c | 150 + src/bin/pg_basebackup/streamutil.h | 10 +++ 6 files changed, 279 insertions(+), 158 deletions(-) diff --git a/doc/src/sgml/ref/pg_receivexlog.sgml b/doc/src/sgml/ref/pg_receivexlog.sgml index 5916b8f..7e46005 100644 --- a/doc/src/sgml/ref/pg_receivexlog.sgml +++ b/doc/src/sgml/ref/pg_receivexlog.sgml @@ -72,6 +72,35 @@ PostgreSQL documentation titleOptions/title para +applicationpg_receivexlog/application can run in one of two following +modes, which control physical replication slot: + +variablelist + + varlistentry + termoption--create/option/term + listitem + para +Create a new physical replication slot with the name specified in +option--slot/option, then exit. + /para + /listitem + /varlistentry + + varlistentry + termoption--drop/option/term + listitem + para +Drop the replication slot with the name specified +in option--slot/option, then exit. + /para + /listitem + /varlistentry +/variablelist + + /para + + para The following command-line options control the location and format of the output. diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 8b9acea..c5d9ec2 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1569,8 +1569,8 @@ BaseBackup(void) { PGresult *res; char *sysidentifier; - uint32 latesttli; - uint32 starttli; + TimeLineID latesttli; + TimeLineID starttli; char *basebkp; char escaped_label[MAXPGPATH]; char *maxrate_clause = NULL; @@ -1624,23 +1624,8 @@ BaseBackup(void) /* * Run IDENTIFY_SYSTEM so we can get the timeline */ - res = PQexec(conn, IDENTIFY_SYSTEM); - if (PQresultStatus(res) != PGRES_TUPLES_OK) - { - fprintf(stderr, _(%s: could not send replication command \%s\: %s), -progname, IDENTIFY_SYSTEM, PQerrorMessage(conn)); + if (!RunIdentifySystem(conn, sysidentifier, latesttli, NULL, 3)) disconnect_and_exit(1); - } - if (PQntuples(res) != 1 || PQnfields(res) 3) - { - fprintf(stderr, -_(%s: could not identify system: got %d rows and %d fields, expected %d rows and %d or more fields\n), -progname, PQntuples(res), PQnfields(res), 1, 3); - disconnect_and_exit(1); - } - sysidentifier = pg_strdup(PQgetvalue(res, 0, 0)); - latesttli = atoi(PQgetvalue(res, 0, 1)); - PQclear(res); /* * Start the actual backup diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c index a8b9ad3..a191c29 100644 --- a/src/bin/pg_basebackup/pg_receivexlog.c +++ b/src/bin/pg_basebackup/pg_receivexlog.c @@ -38,6 +38,8 @@ static int noloop = 0; static int standby_message_timeout = 10 * 1000; /* 10 sec = default */ static int fsync_interval = 0; /* 0 = default */ static volatile bool time_to_abort = false; +static bool do_create_slot = false; +static bool do_drop_slot = false; static void usage(void); @@ -78,6 +80,9 @@ usage(void) printf(_( -w, --no-password never prompt for password\n)); printf(_( -W, --password force password prompt (should happen automatically)\n)); printf(_( -S, --slot=SLOTNAMEreplication slot to use\n)); + printf(_(\nOptional actions:\n)); + printf(_( --create create a new replication slot (for the slot's name see --slot)\n)); + printf(_( --drop
Re: [HACKERS] postgresql latency bgwriter not doing its job
Hello again, I have not found any mean to force bgwriter to send writes when it can. (Well, I have: create a process which sends CHECKPOINT every 0.2 seconds... it works more or less, but this is not my point:-) There is scan_whole_pool_milliseconds, which currently forces bgwriter to circle the buffer pool at least once every 2 minutes. It is currently fixed, but it should be trivial to turn it into an experimental guc that you could use to test your hypothesis. I recompiled with the variable coldly set to 1000 instead of 12. The situation is slightly degraded (15% of transactions were above 200 ms late). However it seems that bgwriter did not write much more pages: buffers_checkpoint = 26065 buffers_clean = 5263 buffers_backend = 367 Or I may have a problem interpreting pg_stat_bgwriter. It seems that changing this value is not enough to persuade bgwriter to write more pages. Or I may have done something wrong, but I do not know what. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Concurrently option for reindexdb
On 2014-08-26 12:44:43 +0900, Michael Paquier wrote: On Tue, Aug 26, 2014 at 12:28 PM, Fujii Masao masao.fu...@gmail.com wrote: +many. Although I'm not sure if we managed to find a safe relation swap. Well we didn't AFAIK. With the latest patch provided I could not really find any whole in the logic, and Andres felt that something may be wrong miles away. If I'd revisit the patch now with a rebased version maybe I may find smth... I don't think it was miles away, but I'll look into the rebased version. That safe relation swap is possible if an AccessExclusive lock is taken. Right? That means that REINDEX CONCURRENTLY is not completely-concurrently, but I think that many users are satisfied with even this feature. This would block as well isolation tests on this feature, something not that welcome for a feature calling itself concurrently, Right. But it's much better than what we have now. Possibly we can rename the feature... :/ but it would deadly simplify the patch and reduce deadlock occurrences if done right with the exclusive locks (no need to check for past snapshots necessary when using ShareUpdateExclusiveLock?). I'm not sure if you really can get rid of the waiting for past snapshots without making the feature much more heavyweight htan necessary. Reading this thread, the consensus would be to use an exclusive lock for swap and be done. Well if there are enough votes for this approach I wouldn't mind resending an updated patch for the next CF. I always was of the opinion that a exclusive lock is still *MUCH* better than what we have today. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-26 08:12:48 +0200, Fabien COELHO wrote: As for checkpoint spreading, raising checkpoint_completion_target to 0.9 degrades the situation (20% of transactions are more than 200 ms late instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run). So maybe there is a bug here somewhere. What are the other settings here? checkpoint_segments, checkpoint_timeout, wal_buffers? Could you show the output of log_checkpoints during that run? Checkpoint spreading only works halfway efficiently if all checkpoints are triggered by time and not by xlog. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Mon, August 25, 2014 07:21, Andrew Gierth wrote: Here is the new version of our grouping sets patch. This version supersedes the previous post. The patches did not apply anymore so I applied at 73eba19aebe0. There they applied OK, and make make check was OK. drop table if exists items_sold; create table items_sold as select * from ( values ('Foo', 'L', 10) , ('Foo', 'M', 20) , ('Bar', 'M', 15) , ('Bar', 'L', 5) ) as f(brand, size, sales) ; select brand, size, grouping(brand, size), sum(sales) from items_sold group by rollup(brand, size); -- WARNING: unrecognized node type: 347 I suppose that's not correct. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
Hello Andres, checkpoint when the segments are full... the server is unresponsive about 10% of the time (one in ten transaction is late by more than 200 ms). That's ext4 I guess? Yes! Did you check whether xfs yields a, err, more predictable performance? No. I cannot test that easily without reinstalling the box. I did some quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very same conditions. Maybe I could try again. [...] Note that it would *not* be a good idea to make the bgwriter write out everything, as much as possible - that'd turn sequential write io into random write io. Hmmm. I'm not sure it would be necessary the case, it depends on how bgwriter would choose the pages to write? If they are chosen randomly then indeed that could be bad. If there is a big sequential write, should not the backend do the write directly anyway? ISTM that currently checkpoint is mostly random writes anyway, at least with the OLTP write load of pgbench. I'm just trying to be able to start them ealier so that they can be completed quickly. So although bgwriter is not the solution, ISTM that pg has no reason to wait for minutes before starting to write dirty pages, if it has nothing else to do. If the OS does some retention later and cannot spread the load, as Josh suggest, this could also be a problem, but currently the OS seems not to have much to write (but WAL) till the checkpoint. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote: Did you check whether xfs yields a, err, more predictable performance? No. I cannot test that easily without reinstalling the box. I did some quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very same conditions. Maybe I could try again. After Robert and I went to LSF/MM this spring I sent a test program for precisely this problem and while it could *crash* machines when using ext4, xfs yielded much more predictable performance. There's a problem with priorization of write vs read IO that's apparently FS dependent. [...] Note that it would *not* be a good idea to make the bgwriter write out everything, as much as possible - that'd turn sequential write io into random write io. Hmmm. I'm not sure it would be necessary the case, it depends on how bgwriter would choose the pages to write? If they are chosen randomly then indeed that could be bad. The essentially have to be random to fulfil it's roal of reducing the likelihood of a backend having to write out a buffer itself. Consider how the clock sweep algorithm (not that I am happy with it) works. When looking for a new victim buffer all backends scan the buffer cache in one continuous cycle. If they find a buffer with a usagecount==0 they'll use that one and throw away its contents. Otherwise they reduce usagecount by 1 and move on. What the bgwriter *tries* to do is to write out buffers with usagecount==0 that are dirty and will soon be visited in the clock cycle. To avoid having the backends to do that. If there is a big sequential write, should not the backend do the write directly anyway? ISTM that currently checkpoint is mostly random writes anyway, at least with the OLTP write load of pgbench. I'm just trying to be able to start them ealier so that they can be completed quickly. If the IO scheduling worked - which it really doesn't in many cases - there'd really be no need to make it finish fast. I think you should try to tune spread checkpoints to have less impact, not make bgwriter do something it's not written for. So although bgwriter is not the solution, ISTM that pg has no reason to wait for minutes before starting to write dirty pages, if it has nothing else to do. That precisely *IS* a spread checkpoint. If the OS does some retention later and cannot spread the load, as Josh suggest, this could also be a problem, but currently the OS seems not to have much to write (but WAL) till the checkpoint. The actual problem is that the writes by the checkpointer - done in the background - aren't flushed out eagerly enough out of the OS's page cache. Then, when the final phase of the checkpoint comes, where relation files need to be fsynced, some filesystems essentially stal while trying to write out lots and lots of dirty buffers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
Hi Fabien, On Sun, Aug 24, 2014 at 9:16 AM, Fabien COELHO coe...@cri.ensmp.fr wrote: Find attached a new version: - fix dropped percent computation in the final report - simplify progress report code I have reviewed this patch. Is the patch in a patch format which has context? Yes. Does it apply cleanly to the current git master? Yes. Does it include reasonable tests, necessary doc patches, etc? Yes. Does the patch actually implement that? Yes. Do we want that? I think we do, yes. Do we already have it? No. Are there dangers? None that I can see. Does the feature work as advertised? Almost, see below. Are there corner cases the author has failed to consider? None that I can see. Are there any assertion failures or crashes? No. I can't make the -L option work at all. If I do this: ./pgbench -R 100 -L 1 I get: pgbench: invalid option -- L Which appears to be caused by the fact that the call to getopt_long() has not been updated to reflect the new parameter. Also this part: +-L, --limit=NUM under throttling (--rate), skip transactions that\n + far behind schedule in ms (default: do not skip)\n I would suggest rewording this to something like skip transactions that are more than NUM milliseconds behind schedule (default: do not skip). Marking Waiting for Author until these small issues have been fixed. Thanks, ♜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/16/2014 02:19 AM, Tom Lane wrote: I think the realistic alternatives at this point are either to switch to all-lengths as in my test patch, or to use the hybrid approach of Heikki's test patch. IMO the major attraction of Heikki's patch is that it'd be upward compatible with existing beta installations, ie no initdb required (but thus, no opportunity to squeeze in a version identifier either). It's not showing up terribly well in the performance tests I've been doing --- it's about halfway between HEAD and my patch on that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my patch, there are things that could be done to micro-optimize it by touching a bit more code. I did some quick stats comparing compressed sizes for the delicio.us data, printing quartiles as per Josh's lead: all-lengths {440,569,609,655,1257} Heikki's patch {456,582,624,671,1274} HEAD{493,636,684,744,1485} (As before, this is pg_column_size of the jsonb within a table whose rows are wide enough to force tuptoaster.c to try to compress the jsonb; otherwise many of these values wouldn't get compressed.) These documents don't have enough keys to trigger the first_success_by issue, so that HEAD doesn't look too awful, but still there's about an 11% gain from switching from offsets to lengths. Heikki's method captures much of that but not all. Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. It's not too pretty, no. But it would be nice to not have to make a tradeoff between lookup speed and compressibility. Yet another idea is to store all lengths, but add an additional array of offsets to JsonbContainer. The array would contain the offset of, say, every 16th element. It would be very small compared to the lengths array, but would greatly speed up random access on a large array/object. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
What are the other settings here? checkpoint_segments, checkpoint_timeout, wal_buffers? They simply are the defaults: checkpoint_segments = 3 checkpoint_timeout = 5min wal_buffers = -1 I did some test checkpoint_segments = 1, the problem is just more frequent but shorter. I also reduced wal_segsize down to 1MB, which also made it even more frequent but much shorter, so the overall result was an improvement with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly. I have found no solution on this path. Could you show the output of log_checkpoints during that run? Checkpoint spreading only works halfway efficiently if all checkpoints are triggered by time and not by xlog. I do 500 seconds tests, so there could be at most 2 timeout triggered checkpoints. Given the write load it takes about 2 minutes to fill the 3 16 MB buffers (8 kb * 50 tps (there is one page modified per transaction) * 120 s ~ 48 MB), so checkpoints are triggered by xlog. The maths are consistent with logs (not sure which prooves which, though:-): LOG: received SIGHUP, reloading configuration files LOG: parameter log_checkpoints changed to on LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 5713 buffers (34.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=51.449 s, sync=4.857 s, total=56.485 s; sync files=12, longest=2.160 s, average=0.404 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6235 buffers (38.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=53.500 s, sync=5.102 s, total=58.670 s; sync files=8, longest=2.689 s, average=0.637 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6250 buffers (38.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=53.888 s, sync=4.504 s, total=58.495 s; sync files=8, longest=2.627 s, average=0.563 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6148 buffers (37.5%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=53.313 s, sync=6.437 s, total=59.834 s; sync files=8, longest=3.680 s, average=0.804 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6240 buffers (38.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=149.008 s, sync=5.448 s, total=154.566 s; sync files=9, longest=3.788 s, average=0.605 s Note that my current effective solution is to do as if checkpoints_timeout = 0.2s: it works fine if I do my own spreading. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Verbose output of pg_dump not show schema name
On 08/26/2014 10:10 AM, Michael Paquier wrote: On Tue, Aug 26, 2014 at 3:48 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: AFAICS, the namespace can never be NULL in any of these. There is a selectSourceSchema(fout, tbinfo-dobj.namespace-dobj.name) call before or after printing the message, so if tbinfo-dobj.namespace is NULL, you'll crash anyway. Please double-check, and remove the dead code if you agree. Ah right, this field is used in many places. Even for pg_backup_archiver.c, the portion of code processing data always has the namespace set. I am sure that Fabrizio would have done that quickly, but as I was on this thread I simplified the patch as Ok thanks, committed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-26 10:49:31 +0200, Fabien COELHO wrote: What are the other settings here? checkpoint_segments, checkpoint_timeout, wal_buffers? They simply are the defaults: checkpoint_segments = 3 checkpoint_timeout = 5min wal_buffers = -1 I did some test checkpoint_segments = 1, the problem is just more frequent but shorter. I also reduced wal_segsize down to 1MB, which also made it even more frequent but much shorter, so the overall result was an improvement with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly. I have found no solution on this path. Uh. I'm not surprised you're facing utterly horrible performance with this. Did you try using a *large* checkpoints_segments setting? To achieve high performance you likely will have to make checkpoint_timeout *longer* and increase checkpoint_segments until *all* checkpoints are started because of time. There's three reasons: a) if checkpoint_timeout + completion_target is large and the checkpoint isn't executed prematurely, most of the dirty data has been written out by the kernel's background flush processes. b) The amount of WAL written with less frequent checkpoints is often *significantly* lower because fewer full page writes need to be done. I've seen production reduction of *more* than a factor of 4. c) If checkpoint's are infrequent enough, the penalty of them causing problems, especially if not using ext4, plays less of a role overall. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
Erik == Erik Rijkers e...@xs4all.nl writes: Erik The patches did not apply anymore so I applied at 73eba19aebe0. Erik There they applied OK, and make make check was OK. I'll look and rebase if need be. -- WARNING: unrecognized node type: 347 Can't reproduce this - are you sure it's not a mis-build? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote: Attached is a new patch. It now keeps the current pg_clog unchanged, but adds a new pg_csnlog besides it. pg_csnlog is more similar to pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and segments older than GlobalXmin can be truncated. It appears that this patch weakens the idea of hint bits. Even if HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the snapshot. That's fast if the xid is less than snap-xmin, but otherwise it needs to do a fetch from the csnlog, which is exactly what the hint bits are designed to avoid. And we can't get around this, because the whole point of this patch is to remove the xip array from the snapshot. If the transaction was committed a long time ago, then we could set PD_ALL_VISIBLE and the VM bit, and a scan wouldn't even look at the hint bit. If it was committed recently, then it's probably greater than the recentXmin. I think there's still room for a hint bit to technically be useful, but it seems quite narrow unless I'm missing something (and a narrowly-useful hint bit doesn't seem to be useful at all). I'm not complaining, and I hope this is not a showstopper for this patch, but I think it's worth discussing. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replicating DROP commands across servers
On 2014-06-13 15:50:50 -0400, Alvaro Herrera wrote: Here's a patch implementing the proposed idea. This is used in the Bidirectional Replication stuff by Simon/Andres; it works well. I think there's been some changes to this patch since july, care to resend a new version? I think it's appropriate to mark the patch as Waiting for Author instead of Ready for Committer till then. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
Hello Rukh, I have reviewed this patch. Thanks! [...] I get: pgbench: invalid option -- L Which appears to be caused by the fact that the call to getopt_long() has not been updated to reflect the new parameter. Indeed, I only tested/used it with the --limit= syntax. Also this part: +-L, --limit=NUM under throttling (--rate), skip transactions that\n + far behind schedule in ms (default: do not skip)\n I would suggest rewording this to something like skip transactions that are more than NUM milliseconds behind schedule (default: do not skip). Done, with milliseconds written as ms to keep it short. Marking Waiting for Author until these small issues have been fixed. Please find attached a new version which fixes these two points. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 2f7d80e..de77817 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -141,6 +141,13 @@ double sample_rate = 0.0; int64 throttle_delay = 0; /* + * When under throttling, execution time slots which are more than + * this late (in us) are skipped, and the corresponding transaction + * will be counted as somehow aborted. + */ +int64 throttle_latency_limit = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -238,6 +245,7 @@ typedef struct int64 throttle_trigger; /* previous/next throttling (us) */ int64 throttle_lag; /* total transaction lag behind throttling */ int64 throttle_lag_max; /* max transaction lag */ + int64 throttle_latency_skipped; /* lagging transactions skipped */ } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -250,6 +258,7 @@ typedef struct int64 sqlats; int64 throttle_lag; int64 throttle_lag_max; + int64 throttle_latency_skipped; } TResult; /* @@ -367,6 +376,8 @@ usage(void) -f, --file=FILENAME read transaction script from FILENAME\n -j, --jobs=NUM number of threads (default: 1)\n -l, --logwrite transaction times to log file\n + -L, --limit=NUM under throttling (--rate), skip transactions that\n + are more than NUM ms behind schedule (default: do not skip)\n -M, --protocol=simple|extended|prepared\n protocol for submitting queries (default: simple)\n -n, --no-vacuum do not run VACUUM before tests\n @@ -1016,6 +1027,24 @@ top: thread-throttle_trigger += wait; + if (throttle_latency_limit) + { + instr_time now; + int64 now_us; + INSTR_TIME_SET_CURRENT(now); + now_us = INSTR_TIME_GET_MICROSEC(now); + while (thread-throttle_trigger now_us - throttle_latency_limit) + { +/* if too far behind, this slot is skipped, and we + * iterate till the next nearly on time slot. + */ +int64 wait = (int64) (throttle_delay * + 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); +thread-throttle_trigger += wait; +thread-throttle_latency_skipped ++; + } + } + st-until = thread-throttle_trigger; st-sleeping = 1; st-throttling = true; @@ -2351,7 +2380,8 @@ printResults(int ttype, int64 normal_xacts, int nclients, TState *threads, int nthreads, instr_time total_time, instr_time conn_total_time, int64 total_latencies, int64 total_sqlats, - int64 throttle_lag, int64 throttle_lag_max) + int64 throttle_lag, int64 throttle_lag_max, + int64 throttle_latency_skipped) { double time_include, tps_include, @@ -2417,6 +2447,10 @@ printResults(int ttype, int64 normal_xacts, int nclients, */ printf(rate limit schedule lag: avg %.3f (max %.3f) ms\n, 0.001 * throttle_lag / normal_xacts, 0.001 * throttle_lag_max); + if (throttle_latency_limit) + printf(number of skipped transactions: INT64_FORMAT (%.3f %%)\n, + throttle_latency_skipped, + 100.0 * throttle_latency_skipped / (throttle_latency_skipped + normal_xacts)); } printf(tps = %f (including connections establishing)\n, tps_include); @@ -2505,6 +2539,7 @@ main(int argc, char **argv) {sampling-rate, required_argument, NULL, 4}, {aggregate-interval, required_argument, NULL, 5}, {rate, required_argument, NULL, 'R'}, + {limit, required_argument, NULL, 'L'}, {NULL, 0, NULL, 0} }; @@ -2534,6 +2569,7 @@ main(int argc, char **argv) int64 total_sqlats = 0; int64 throttle_lag = 0; int64 throttle_lag_max = 0; + int64 throttle_latency_skipped = 0; int i; @@ -2578,7 +2614,7 @@ main(int argc, char **argv) state = (CState *) pg_malloc(sizeof(CState)); memset(state, 0, sizeof(CState)); - while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:R:, long_options, optindex)) != -1) + while ((c = getopt_long(argc, argv, ih:nvp:dqSNc:j:Crs:t:T:U:lf:D:F:M:P:R:L:, long_options, optindex)) != -1) { switch (c) { @@ -2775,6 +2811,18 @@ main(int argc, char **argv) throttle_delay =
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
Andrew == Andrew Gierth and...@tao11.riddles.org.uk writes: Erik == Erik Rijkers e...@xs4all.nl writes: Erik The patches did not apply anymore so I applied at 73eba19aebe0. Erik There they applied OK, and make make check was OK. Andrew I'll look and rebase if need be. They apply cleanly for me at 2bde297 whether with git apply or patch, except for the contrib one (which you don't need unless you want to run the contrib regression tests without applying the gsp-u patch). -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What in the world is happening with castoroides and protosciurus?
On Tue, Aug 26, 2014 at 1:46 AM, Tom Lane t...@sss.pgh.pa.us wrote: For the last month or so, these two buildfarm animals (which I believe are the same physical machine) have been erratically failing with errors that reflect low-order differences in floating-point calculations. A recent example is at http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52 where the only regression diff is *** /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out Mon Aug 25 11:41:00 2014 --- /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out Mon Aug 25 11:57:26 2014 *** *** 171,179 SELECT h.seqno AS i8096, h.random AS f1234_1234 FROM hash_f8_heap h WHERE h.random = '-1234.1234'::float8; ! i8096 | f1234_1234 ! ---+ ! 8906 | -1234.1234 (1 row) UPDATE hash_f8_heap --- 171,179 SELECT h.seqno AS i8096, h.random AS f1234_1234 FROM hash_f8_heap h WHERE h.random = '-1234.1234'::float8; ! i8096 |f1234_1234 ! ---+--- ! 8906 | -1234.12356777216 (1 row) UPDATE hash_f8_heap ... a result that certainly makes no sense. The results are not repeatable, failing in equally odd ways in different tests on different runs. This is happening in all the back branches too, not just HEAD. Has there been a system software update on this machine a month or so ago? If not, it's hard to think anything except that the floating point hardware on this box has developed problems. There hasn't been a software update, but something happened about two months ago, and we couldn't get to the bottom of exactly what it was - essentially, castoroides started failing with C compiler cannot create executables. It appeared that the compiler was missing from the path, however the config hadn't changed. Our working theory is that there was previously a symlink to the compiler in one of the directories in the path, that somehow got removed. The issue was fixed by adding the actual compiler location to the path. However, that would have only affected castoroides, and not protosciurus which runs under a different environment config. I have no idea what is causing the current issue - the machine is stable software-wise, and only has private builds of dependency libraries update periodically (which are not used for the buildfarm). If I had to hazard a guess, I'd suggest this is an early symptom of an old machine which is starting to give up. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
Uh. I'm not surprised you're facing utterly horrible performance with this. Did you try using a *large* checkpoints_segments setting? To achieve high performance I do not seek high performance per se, I seek lower maximum latency. I think that the current settings and parameters are designed for high throughput, but do not allow to control the latency even with a small load. you likely will have to make checkpoint_timeout *longer* and increase checkpoint_segments until *all* checkpoints are started because of time. Well, as I want to test a *small* load in a *reasonable* time, so I did not enlarge the number of segments, otherwise it would take ages. If I put a checkpoint_timeout = 1min and checkpoint_completion_target = 0.9 so that the checkpoints are triggered by the timeout, LOG: checkpoint starting: time LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s, total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s ... The result is basically the same (well 18% transactions lost, but the result do not seem to be stable one run after the other), only there are more checkpoints. I fail to understand how multiplying both the segments and time would solve the latency problem. If I set 30 segments than it takes 20 minutes to fill them, and if I put timeout to 15min then I'll have to wait for 15 minutes to test. There's three reasons: a) if checkpoint_timeout + completion_target is large and the checkpoint isn't executed prematurely, most of the dirty data has been written out by the kernel's background flush processes. Why would they be written by the kernel if bgwriter has not sent them?? b) The amount of WAL written with less frequent checkpoints is often *significantly* lower because fewer full page writes need to be done. I've seen production reduction of *more* than a factor of 4. Sure, I understand that, but ISTM that this test does not exercise this issue, the load is small, the full page writes do not matter much. c) If checkpoint's are infrequent enough, the penalty of them causing problems, especially if not using ext4, plays less of a role overall. I think that what you suggest would only delay the issue, not solve it. I'll try to ran a long test. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
Marking Waiting for Author until these small issues have been fixed. I've put it back to Needs review. Feel free to set it to Ready if it is ok for you. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
Summary of this thread so far: There was a lot of discussion comparing this with Tomas Vondra's Hash Join patch. The conclusion was that while it would be nice to be able to dump transition state to disk, for aggregates like array_agg, the patch is fine as it is. Dumping transition states would require much more work, and this is already useful without it. Moreover, solving the array_agg problem later won't require a rewrite; rather, it'll build on top of this. You listed a number of open items in the original post, and these are still outstanding: * costing * EXPLAIN details for disk usage * choose number of partitions intelligently * performance testing I think this is enough for this commitfest - we have consensus on the design. For the next one, please address those open items, and resubmit. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Selectivity estimation for inet operators
I agree with you that we can support other join type and anti join later, If others don’t have any objection in doing other parts later I will mark as Ready For Committer. I updated the patch to cover semi and anti joins with eqjoinsel_semi(). I think it is better than returning a constant. The new version attached with the new version of the test script. Can you please look at it again and mark it as ready for committer if it seems okay to you? diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c index d0d806f..eca9e7c 100644 --- a/src/backend/utils/adt/network_selfuncs.c +++ b/src/backend/utils/adt/network_selfuncs.c @@ -1,32 +1,669 @@ /*- * * network_selfuncs.c * Functions for selectivity estimation of inet/cidr operators * - * Currently these are just stubs, but we hope to do better soon. + * Estimates are based on null fraction, distinct value count, most common + * values, and histogram of inet/cidr datatypes. * * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * * IDENTIFICATION * src/backend/utils/adt/network_selfuncs.c * *- */ #include postgres.h +#include math.h + +#include access/htup_details.h +#include catalog/pg_collation.h +#include catalog/pg_operator.h +#include catalog/pg_statistic.h +#include utils/lsyscache.h #include utils/inet.h +#include utils/selfuncs.h +/* Default selectivity constant for the inet overlap operator */ +#define DEFAULT_OVERLAP_SEL 0.01 + +/* Default selectivity constant for the other operators */ +#define DEFAULT_INCLUSION_SEL 0.005 + +/* Default selectivity for given operator */ +#define DEFAULT_SEL(operator) \ + ((operator) == OID_INET_OVERLAP_OP ? \ + DEFAULT_OVERLAP_SEL : DEFAULT_INCLUSION_SEL) + +static Selectivity networkjoinsel_inner(Oid operator, +VariableStatData *vardata1, VariableStatData *vardata2); +extern double eqjoinsel_semi(Oid operator, VariableStatData *vardata1, + VariableStatData *vardata2, RelOptInfo *inner_rel); +extern RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids); +static short int inet_opr_order(Oid operator); +static Selectivity inet_his_inclusion_selec(Datum *values, int nvalues, +Datum *constvalue, short int opr_order); +static Selectivity inet_mcv_join_selec(Datum *values1, float4 *numbers1, + int nvalues1, Datum *values2, float4 *numbers2, + int nvalues2, Oid operator); +static Selectivity inet_mcv_his_selec(Datum *mcv_values, float4 *mcv_numbers, + int mcv_nvalues, Datum *his_values, int his_nvalues, + int red_nvalues, short int opr_order, + Selectivity *max_selec_pointer); +static Selectivity inet_his_inclusion_join_selec(Datum *his1_values, + int his1_nvalues, Datum *his2_values, int his2_nvalues, + int red_nvalues, short int opr_order); +static short int inet_inclusion_cmp(inet *left, inet *right, + short int opr_order); +static short int inet_masklen_inclusion_cmp(inet *left, inet *right, + short int opr_order); +static short int inet_his_match_divider(inet *boundary, inet *query, + short int opr_order); + +/* + * Selectivity estimation for the subnet inclusion operators + */ Datum networksel(PG_FUNCTION_ARGS) { - PG_RETURN_FLOAT8(0.001); + PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0); + Oid operator = PG_GETARG_OID(1); + List *args = (List *) PG_GETARG_POINTER(2); + int varRelid = PG_GETARG_INT32(3), + his_nvalues; + VariableStatData vardata; + Node *other; + boolvaronleft; + Selectivity selec, + max_mcv_selec; + Datum constvalue, + *his_values; + Form_pg_statistic stats; + double nullfrac; + FmgrInfoproc; + + /* +* If expression is not (variable op something) or (something op +* variable), then punt and return a default estimate. +*/ + if (!get_restriction_variable(root, args, varRelid, + vardata, other, varonleft)) + PG_RETURN_FLOAT8(DEFAULT_SEL(operator)); + +
Re: [HACKERS] postgresql latency bgwriter not doing its job
On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote: Uh. I'm not surprised you're facing utterly horrible performance with this. Did you try using a *large* checkpoints_segments setting? To achieve high performance I do not seek high performance per se, I seek lower maximum latency. So? I think that the current settings and parameters are designed for high throughput, but do not allow to control the latency even with a small load. The way're you're setting them is tuned for 'basically no write activity'. you likely will have to make checkpoint_timeout *longer* and increase checkpoint_segments until *all* checkpoints are started because of time. Well, as I want to test a *small* load in a *reasonable* time, so I did not enlarge the number of segments, otherwise it would take ages. Well, that way you're testing something basically meaningless. That's not helpful either. If I put a checkpoint_timeout = 1min and checkpoint_completion_target = 0.9 so that the checkpoints are triggered by the timeout, LOG: checkpoint starting: time LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s, total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s ... The result is basically the same (well 18% transactions lost, but the result do not seem to be stable one run after the other), only there are more checkpoints. With these settings you're fsyncing the entire data directy once a minute. Nearly entirely from the OS's buffer cache, because the OS's writeback logic didn't have time to kick in. I fail to understand how multiplying both the segments and time would solve the latency problem. If I set 30 segments than it takes 20 minutes to fill them, and if I put timeout to 15min then I'll have to wait for 15 minutes to test. a) The kernel's writeback logic only kicks in with delay. b) The amount of writes you're doing with short checkpoint intervals is overall significantly higher than with longer intervals. That obviously has impact on latency as well as throughput. c) the time it fills for segments to be filled is mostly irrelevant. The phase that's very likely causing troubles for you is the fsyncs issued at the end of a checkpoint. There's three reasons: a) if checkpoint_timeout + completion_target is large and the checkpoint isn't executed prematurely, most of the dirty data has been written out by the kernel's background flush processes. Why would they be written by the kernel if bgwriter has not sent them?? I think you're misunderstanding how spread checkpoints work. When the checkpointer process starts a spread checkpoint it first writes all buffers to the kernel in a paced manner. That pace is determined by checkpoint_completion_target and checkpoint_timeout. Once all buffers that are old enough to need to be checkpointed written out, the checkpointer fsync()s all the on disk files. That part is *NOT* paced. Then it can go on to remove old WAL files. The latency problem is almost guaranteedly created by the fsync()s mentioned above. When they're execute the kernel starts flushing out a lot of dirty buffers at once - creating very deep IO queues which makes it take long to process synchronous additions (WAL flushes, reads) to that queue. c) If checkpoint's are infrequent enough, the penalty of them causing problems, especially if not using ext4, plays less of a role overall. I think that what you suggest would only delay the issue, not solve it. The amount of dirty data that needs to be flushed is essentially bounded. If you have a stall of roughly the same magnitude (say a factor of two different), the smaller once a minute, the larger once an hour. Obviously the once-an-hour one will have a better latency in many, many more transactions. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Compute attr_needed for child relations (was Re: [HACKERS] inherit support for foreign tables)
Hi Fujita-san, I reviewed the v4 patch, and here are some comments from me. * After applying this patch, pull_varattnos() should not called in unnecessary places. For developers who want list of columns-to-be-processed for some purpose, it would be nice to mention when they should use pull_varattnos() and when they should not, maybe at the comments of pull_varattnos() implementation. * deparseReturningList() and postgresGetForeignRelSize() in contrib/postgres_fdw/ also call pull_varattnos() to determine which column to be in the SELECT clause of remote query. Shouldn't these be replaced in the same manner? Other pull_varattnos() calls are for restrictions, so IIUC they can't be replaced. * Through this review I thought up that lazy evaluation approach might fit attr_needed. I mean that we leave attr_needed for child relations empty, and fill it at the first request for it. This would avoid useless computation of attr_needed for child relations, though this idea has been considered but thrown away before... 2014-08-20 18:55 GMT+09:00 Etsuro Fujita fujita.ets...@lab.ntt.co.jp: Hi Ashutish, (2014/08/14 22:30), Ashutosh Bapat wrote: On Thu, Aug 14, 2014 at 10:05 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp wrote: (2014/08/08 18:51), Etsuro Fujita wrote: (2014/06/30 22:48), Tom Lane wrote: I wonder whether it isn't time to change that. It was coded like that originally only because calculating the values would've been a waste of cycles at the time. But this is at least the third place where it'd be useful to have attr_needed for child rels. I've revised the patch. There was a problem with the previous patch, which will be described below. Attached is the updated version of the patch addressing that. Here are some more comments Thank you for the further review! attr_needed also has the attributes used in the restriction clauses as seen in distribute_qual_to_rels(), so, it looks unnecessary to call pull_varattnos() on the clauses in baserestrictinfo in functions check_selective_binary_conversion(), remove_unused_subquery_outputs(), check_index_only(). IIUC, I think it's *necessary* to do that in those functions since the attributes used in the restriction clauses in baserestrictinfo are not added to attr_needed due the following code in distribute_qual_to_rels. /* * If it's a join clause (either naturally, or because delayed by * outer-join rules), add vars used in the clause to targetlists of their * relations, so that they will be emitted by the plan nodes that scan * those relations (else they won't be available at the join node!). * * Note: if the clause gets absorbed into an EquivalenceClass then this * may be unnecessary, but for now we have to do it to cover the case * where the EC becomes ec_broken and we end up reinserting the original * clauses into the plan. */ if (bms_membership(relids) == BMS_MULTIPLE) { List *vars = pull_var_clause(clause, PVC_RECURSE_AGGREGATES, PVC_INCLUDE_PLACEHOLDERS); add_vars_to_targetlist(root, vars, relids, false); list_free(vars); } Although in case of RTE_RELATION, the 0th entry in attr_needed corresponds to FirstLowInvalidHeapAttributeNumber + 1, it's always safer to use it is RelOptInfo::min_attr, in case get_relation_info() wants to change assumption or somewhere down the line some other part of code wants to change attr_needed information. It may be unlikely, that it would change, but it will be better to stick to comments in RelOptInfo 443 AttrNumber min_attr; /* smallest attrno of rel (often 0) */ 444 AttrNumber max_attr; /* largest attrno of rel */ 445 Relids *attr_needed;/* array indexed [min_attr .. max_attr] */ Good point! Attached is the revised version of the patch. Thanks, Best regards, Etsuro Fujita -- Shigeru HANADA -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Tue, August 26, 2014 11:13, Andrew Gierth wrote: Andrew == Andrew Gierth and...@tao11.riddles.org.uk writes: Erik == Erik Rijkers e...@xs4all.nl writes: Erik The patches did not apply anymore so I applied at 73eba19aebe0. Erik There they applied OK, and make make check was OK. Andrew I'll look and rebase if need be. They apply cleanly for me at 2bde297 whether with git apply or patch, except for the contrib one (which you don't need unless you want to run the contrib regression tests without applying the gsp-u patch). Ah, I had not realised that. Excluding that contrib-patch and only applying these three: gsp1.patch gsp2.patch gsp-doc.patch does indeed work (applies, compiles). Thank you, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On 08/26/2014 12:03 PM, Jeff Davis wrote: On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote: Attached is a new patch. It now keeps the current pg_clog unchanged, but adds a new pg_csnlog besides it. pg_csnlog is more similar to pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and segments older than GlobalXmin can be truncated. It appears that this patch weakens the idea of hint bits. Even if HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the snapshot. That's fast if the xid is less than snap-xmin, but otherwise it needs to do a fetch from the csnlog, which is exactly what the hint bits are designed to avoid. And we can't get around this, because the whole point of this patch is to remove the xip array from the snapshot. Yeah. This patch in the current state is likely much much slower than unpatched master, except in extreme cases where you have thousands of connections and short transactions so that without the patch, you spend most of the time acquiring snapshots. My current thinking is that access to the csnlog will need to be made faster. Currently, it's just another SLRU, but I'm sure we can do better than that. Or add a backend-private cache on top of it; that might already alleviate it enough.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Wed, Aug 20, 2014 at 1:14 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Aug 20, 2014 at 2:06 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Aug 16, 2014 at 10:27 AM, Amit Kapila amit.kapil...@gmail.com wrote: I think ideally it would have been better if we could have logged replication commands under separate log_level, but as still there is no consensus on extending log_statement and nobody is even willing to pursue, it seems okay to go ahead and log these under 'all' level. I think the consensus is clearly for a separate GUC. +1. Okay. Attached is the updated version of the patch which I posted before. This patch follows the consensus and adds separate parameter log_replication_command. Regards, -- Fujii Masao *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 4558,4563 FROM pg_stat_activity; --- 4558,4579 /listitem /varlistentry + varlistentry id=guc-log-replication-command xreflabel=log_replication_command + termvarnamelog_replication_command/varname (typeboolean/type) + indexterm +primaryvarnamelog_replication_command/ configuration parameter/primary + /indexterm + /term + listitem +para + Causes each replication command to be logged in the server log. + See xref linkend=protocol-replication for more information about + replication command. The default value is literaloff/. + Only superusers can change this setting. +/para + /listitem + /varlistentry + varlistentry id=guc-log-temp-files xreflabel=log_temp_files termvarnamelog_temp_files/varname (typeinteger/type) indexterm *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** *** 108,113 bool am_db_walsender = false; /* Connected to a database? */ --- 108,114 int max_wal_senders = 0; /* the maximum number of concurrent walsenders */ int wal_sender_timeout = 60 * 1000; /* maximum time to send one * WAL data message */ + bool log_replication_command = false; /* * State for WalSndWakeupRequest *** *** 1268,1280 exec_replication_command(const char *cmd_string) MemoryContext old_context; /* * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next * command arrives. Clean up the old stuff if there's anything. */ SnapBuildClearExportedSnapshot(); - elog(DEBUG1, received replication command: %s, cmd_string); - CHECK_FOR_INTERRUPTS(); cmd_context = AllocSetContextCreate(CurrentMemoryContext, --- 1269,1287 MemoryContext old_context; /* + * Log replication command if log_replication_command is enabled. + * Even when it's disabled, log the command with DEBUG1 level for + * backward compatibility. + */ + ereport(log_replication_command ? LOG : DEBUG1, + (errmsg(received replication command: %s, cmd_string))); + + /* * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next * command arrives. Clean up the old stuff if there's anything. */ SnapBuildClearExportedSnapshot(); CHECK_FOR_INTERRUPTS(); cmd_context = AllocSetContextCreate(CurrentMemoryContext, *** a/src/backend/utils/misc/guc.c --- b/src/backend/utils/misc/guc.c *** *** 925,930 static struct config_bool ConfigureNamesBool[] = --- 925,939 NULL, NULL, NULL }, { + {log_replication_command, PGC_SUSET, LOGGING_WHAT, + gettext_noop(Logs each replication command.), + NULL + }, + log_replication_command, + false, + NULL, NULL, NULL + }, + { {debug_assertions, PGC_INTERNAL, PRESET_OPTIONS, gettext_noop(Shows whether the running server has assertion checks enabled.), NULL, *** a/src/backend/utils/misc/postgresql.conf.sample --- b/src/backend/utils/misc/postgresql.conf.sample *** *** 431,436 --- 431,437 # e.g. '%u%%%d ' #log_lock_waits = off # log lock waits = deadlock_timeout #log_statement = 'none' # none, ddl, mod, all + #log_replication_command = off #log_temp_files = -1 # log temporary files equal or larger # than the specified size in kilobytes; # -1 disables, 0 logs all temp files *** a/src/include/replication/walsender.h --- b/src/include/replication/walsender.h *** *** 25,30 extern bool wake_wal_senders; --- 25,31 /* user-settable parameters */ extern int max_wal_senders; extern int wal_sender_timeout; + extern bool log_replication_command; extern void InitWalSender(void); extern void exec_replication_command(const char *query_string); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]
On 2014-06-29 18:54:50 +0530, Abhijit Menon-Sen wrote: At 2014-06-29 20:35:04 +0900, maumau...@gmail.com wrote: Thanks, I marked it as ready for committer. I hope Fujii san or another committer will commit this, refining English expression if necessary. Since it was just a matter of editing, I went through the patch and corrected various minor errors (typos, awkwardness, etc.). I agree that this is now ready for committer. I've attached the updated diff. I'm looking at committing this, but I wonder: Shouldn't this be accessible from inside psql as well? I.e. as a backslash command? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On 25/08/14 14:35, Andres Freund wrote: currently pg_basebackup uses fetch mode when only -x is specified - which imo isn't a very good thing to use due to the increased risk of not fetching everything. How about switching to stream mode for 9.5+? +1. I was just wondering why it's not the default a few days ago. / Oskari -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Verbose output of pg_dump not show schema name
On Tue, Aug 26, 2014 at 4:10 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Tue, Aug 26, 2014 at 3:48 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: AFAICS, the namespace can never be NULL in any of these. There is a selectSourceSchema(fout, tbinfo-dobj.namespace-dobj.name) call before or after printing the message, so if tbinfo-dobj.namespace is NULL, you'll crash anyway. Please double-check, and remove the dead code if you agree. Ah right, this field is used in many places. Even for pg_backup_archiver.c, the portion of code processing data always has the namespace set. I am sure that Fabrizio would have done that quickly, but as I was on this thread I simplified the patch as attached. Thanks Michael... last night I was working on a refactoring in tablecmds.c. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]
On 26/08/14 13:20, Andres Freund wrote: I'm looking at committing this, but I wonder: Shouldn't this be accessible from inside psql as well? I.e. as a backslash command? +1 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]
2014-08-26 13:30 GMT+02:00 Petr Jelinek p...@2ndquadrant.com: On 26/08/14 13:20, Andres Freund wrote: I'm looking at committing this, but I wonder: Shouldn't this be accessible from inside psql as well? I.e. as a backslash command? +1 have you idea about command name? \?+ Pavel -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]
On 2014-08-26 13:44:16 +0200, Pavel Stehule wrote: 2014-08-26 13:30 GMT+02:00 Petr Jelinek p...@2ndquadrant.com: On 26/08/14 13:20, Andres Freund wrote: I'm looking at committing this, but I wonder: Shouldn't this be accessible from inside psql as well? I.e. as a backslash command? +1 have you idea about command name? \?+ Some ideas: \hv \help-variables \? set \? variables Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Thank you for comments. Could you tell me where the patch for single block in one run is? Please find attached patch for single block compression in one run. Thanks! I ran the benchmark using pgbench and compared the results. I'd like to share the results. [RESULT] Amount of WAL generated during the benchmark. Unit is MB. MultipleSingle off202.0201.5 on6051.06053.0 pglz3543.03567.0 lz43344.03485.0 snappy3354.03449.5 Latency average during the benchmark. Unit is ms. MultipleSingle off19.119.0 on55.357.3 pglz45.045.9 lz444.244.7 snappy43.443.3 These results show that FPW compression is really helpful for decreasing the WAL volume and improving the performance. The compression ratio by lz4 or snappy is better than that by pglz. But it's difficult to conclude which lz4 or snappy is best, according to these results. ISTM that compression-of-multiple-pages-at-a-time approach can compress WAL more than compression-of-single-... does. [HOW TO BENCHMARK] Create pgbench database with scall factor 1000. Change the data type of the column filler on each pgbench table from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's gen_random_uuid() in order to avoid empty column, e.g., alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text After creating the test database, run the pgbench as follows. The number of transactions executed during benchmark is almost same between each benchmark because -R option is used. pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared checkpoint_timeout is 5min, so it's expected that checkpoint was executed at least two times during the benchmark. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
Erik == Erik Rijkers e...@xs4all.nl writes: They apply cleanly for me at 2bde297 whether with git apply or patch, except for the contrib one (which you don't need unless you want to run the contrib regression tests without applying the gsp-u patch). Erik Ah, I had not realised that. Excluding that contrib-patch and Erik only applying these three: Erik gsp1.patch Erik gsp2.patch Erik gsp-doc.patch Erik does indeed work (applies, compiles). I put up a rebased contrib patch anyway (linked off the CF). Did the unrecognized node type error go away, or do we still need to look into that? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] a7ae1dc has broken the windows builds
I've attached a small patch which should get the windows builds up and running again. They're currently failing from this: c:\prog\bf\root\HEAD\pgsql.build\pgsql.sln (default target) (1) - c:\prog\bf\root\HEAD\pgsql.build\pg_upgrade_support.vcxproj (default target) (67) - (Link target) - pg_upgrade_support.obj : error LNK2001: unresolved external symbol IsBinaryUpgrade [c:\prog\bf\root\HEAD\pgsql.build\pg_upgrade_support.vcxproj] .\Release\pg_upgrade_support\pg_upgrade_support.dll : fatal error LNK1120: 1 unresolved externals [c:\prog\bf\root\HEAD\pgsql.build\pg_upgrade_support.vcxproj] 0 Warning(s) 2 Error(s) Regards David Rowley IsBinaryUpgrade_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to support SEMI and ANTI join removal
On Tue, Aug 5, 2014 at 10:35 PM, David Rowley dgrowle...@gmail.com wrote: Currently most of my changes are in analyzejoin.c, but I did also have to make changes to load the foreign key constraints so that they were available to the planner. One thing that is currently lacking, which would likely be needed, before the finished patch is ready, would be a relhasfkeys column in pg_class. Such a column would mean that it would be possible to skip scanning pg_constraint for foreign keys when there's none to find. I'll delay implementing that until I get a bit more feedback to weather this patch would be a welcome addition to the existing join removal code or not. I've modified this patch to include a new relhasfkey column in pg_class, and then only attempt to load the foreign keys in get_relation_info() if the pg_class flag is true. Currently what I'm not quite sure on is the best place to be clearing this flag. I see that relhaspkey is cleared during vacuum, but only if there's no indexes at all on the relation. It seems that it will remain set to true after vacuum, if the primary key is dropped and there's still other indexes on the relation. My guess here is that this is done so that pg_constraint does not have to be checked to see if a PK exists, which is why I'm not sure if this would be the correct place for me to do the same in order to determine if there's any FKs on the relation. Though I can't quite think where else I might clear this flag. Any ideas or feedback on this would be welcome Regards David Rowley semianti_join_removal_3ab5ccb_2014-08-27.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Tue, August 26, 2014 14:24, Andrew Gierth wrote: Erik == Erik Rijkers e...@xs4all.nl writes: They apply cleanly for me at 2bde297 whether with git apply or patch, except for the contrib one (which you don't need unless you want to run the contrib regression tests without applying the gsp-u patch). Erik Ah, I had not realised that. Excluding that contrib-patch and Erik only applying these three: Erik gsp1.patch Erik gsp2.patch Erik gsp-doc.patch Erik does indeed work (applies, compiles). I put up a rebased contrib patch anyway (linked off the CF). Did the unrecognized node type error go away, or do we still need to look into that? Yes, it did go away; looks fine now: select brand , size , grouping(brand, size) , sum(sales) from items_sold group by rollup(brand, size) ; brand | size | grouping | sum ---+--+--+- Bar | L|0 | 5 Bar | M|0 | 15 Bar | |1 | 20 Foo | L|0 | 10 Foo | M|0 | 20 Foo | |1 | 30 | |3 | 50 (7 rows) I'm a bit unclear why the bottom-row 'grouping' value is 3. Shouldn't that be 2? But I'm still reading the documentation so it's perhaps too early to ask... Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements
On 04/14/2014 10:31 PM, Fabrízio de Royes Mello wrote: On Tue, Apr 1, 2014 at 2:46 PM, Robert Haas robertmh...@gmail.com wrote: Where this is a bit more interesting is in the case of sequences, where resetting the sequence to zero may cause further inserts into an existing table to fail. Yeah. Sequences do have contained data, which makes COR harder to define --- that's part of the reason why we have CINE not COR for tables, and maybe we have to do the same for sequences. The point being exactly that if you use CINE, you're implicitly accepting that you don't know the ensuing state fully. Yeah. I think CINE is more sensible than COR for sequences, for precisely the reason that they do have contained data (even if it's basically only one value). The attached patch contains CINE for sequences. I just strip this code from the patch rejected before. Committed with minor changes: * The documentation promised too much. It said that it would not throw an error if a sequence with the same name exists. In fact, it will not throw an error if any relation with the same name exists. I rewrote that paragraph to emphasize that more, re-using the phrases from the CREATE TABLE manual page. * don't call RangeVarGetAndCheckCreationNamespace unnecessarily when IF NOT EXISTS is not used. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add line number as prompt option to psql
On 2014-08-21 11:43:48 +0900, Sawada Masahiko wrote: On Wed, Aug 20, 2014 at 9:00 PM, Jeevan Chalke jeevan.cha...@enterprisedb.com wrote: Hi, I have reviewed this: I have initialize cur_lineno to UINTMAX - 2. And then observed following behaviour to check wrap-around. postgres=# \set PROMPT1 '%/[%l]%R%# ' postgres[18446744073709551613]=# \set PROMPT2 '%/[%l]%R%# ' postgres[18446744073709551613]=# select postgres[18446744073709551614]-# a postgres[18446744073709551615]-# , postgres[0]-# b postgres[1]-# from postgres[2]-# dual; It is wrapping to 0, where as line number always start with 1. Any issues? I would like to ignore this as UINTMAX lines are too much for a input buffer to hold. It is almost NIL chances to hit this. However, I think you need to use UINT64_FORMAT while printing uint64 number. Currently it is %u which wrap-around at UINT_MAX. See how pset.lineno is displayed. Apart from this, I didn't see any issues in my testing. Patch applies cleanly. make/make install/initdb/make check all are well. Thank you for reviewing the patch! Attached patch is latest version patch. I modified the output format of cur_lineno. I like the feature - and I wanted to commit it, but enough stuff turned up that I needed to fix that it warrants some new testing. Stuff I've changed: * removed include of limits.h - that probably was a rememnant from a previous version * removed a trailing whitespace * expanded the documentation about %l. The current line number isn't very clear. Of a file? Of all lines ever entered in psql? It's now The line number inside the current statement, starting from literal1/. * Correspondingly I've changed the variable's name to stmt_lineno. * COPY FROM ... STDIN/PROMPT3 was broken because a) the promp was only generated once b) the lineno wasn't incremented. * CTRL-C didn't reset the line number. * Unfortunately I've notice here that the prompting is broken in some common cases: postgres[1]=# SELECT 1, postgres[2]-# '2 postgres[2]'# 2b postgres[2]'# 2c postgres[2]'# 2d', postgres[3]-# 3; ┌──┬──┬──┐ │ ?column? │ ?column? │ ?column? │ ├──┼──┼──┤ │1 │ 2 ↵│3 │ │ │ 2b ↵│ │ │ │ 2c ↵│ │ │ │ 2d │ │ └──┴──┴──┘ (1 row) postgres[1]=# SELECT 1, '2 2b 2c 2d', 3 postgres[7]-# That's rather inconsistent... I've attached my version of the patch. Note that I've got rid of all the PSCAN_INCOMPLETE checks... Please test! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 476d799d74f2ea8eefc3480f176b3726c35cf425 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Tue, 26 Aug 2014 13:35:49 +0200 Subject: [PATCH] Add psql PROMPT variable showing which line of a statement is being edited. The new %l substitution shows the line number inside a (potentially multi-line) statement starting from one. Author: Sawada Masahiko, editorialized by me. Reviewed-By: Jeevan Chalke, Alvaro Herrera --- doc/src/sgml/ref/psql-ref.sgml | 9 + src/bin/psql/copy.c| 15 +-- src/bin/psql/mainloop.c| 17 + src/bin/psql/prompt.c | 5 + src/bin/psql/settings.h| 1 + 5 files changed, 41 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 74d4618..db314c3 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -3316,6 +3316,15 @@ testdb=gt; userinputINSERT INTO my_table VALUES (:'content');/userinput /varlistentry varlistentry +termliteral%l/literal/term +listitem + para + The line number inside the current statement, starting from literal1/. + /para +/listitem + /varlistentry + + varlistentry termliteral%/literalreplaceable class=parameterdigits/replaceable/term listitem para diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index c759abf..6908742 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -517,8 +517,8 @@ bool handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) { bool OK; - const char *prompt; char buf[COPYBUFSIZ]; + bool showprompt = false; /* * Establish longjmp destination for exiting from wait-for-input. (This is @@ -540,21 +540,20 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) /* Prompt if interactive input */ if (isatty(fileno(copystream))) { + showprompt = true; if (!pset.quiet) puts(_(Enter data to be copied followed by a newline.\n End with a backslash and a period on a line by itself.)); - prompt = get_prompt(PROMPT_COPY); } - else - prompt = NULL; OK = true; if
Re: [HACKERS] a7ae1dc has broken the windows builds
On 2014-08-27 00:25:43 +1200, David Rowley wrote: I've attached a small patch which should get the windows builds up and running again. Pushed, thanks. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements
On Tue, Aug 26, 2014 at 10:20 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04/14/2014 10:31 PM, Fabrízio de Royes Mello wrote: On Tue, Apr 1, 2014 at 2:46 PM, Robert Haas robertmh...@gmail.com wrote: Where this is a bit more interesting is in the case of sequences, where resetting the sequence to zero may cause further inserts into an existing table to fail. Yeah. Sequences do have contained data, which makes COR harder to define --- that's part of the reason why we have CINE not COR for tables, and maybe we have to do the same for sequences. The point being exactly that if you use CINE, you're implicitly accepting that you don't know the ensuing state fully. Yeah. I think CINE is more sensible than COR for sequences, for precisely the reason that they do have contained data (even if it's basically only one value). The attached patch contains CINE for sequences. I just strip this code from the patch rejected before. Committed with minor changes: * The documentation promised too much. It said that it would not throw an error if a sequence with the same name exists. In fact, it will not throw an error if any relation with the same name exists. I rewrote that paragraph to emphasize that more, re-using the phrases from the CREATE TABLE manual page. * don't call RangeVarGetAndCheckCreationNamespace unnecessarily when IF NOT EXISTS is not used. Thanks! -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [HACKERS] Patch to support SEMI and ANTI join removal
On 08/26/2014 03:28 PM, David Rowley wrote: Any ideas or feedback on this would be welcome Before someone spends time reviewing this patch, are you sure this is worth the effort? It seems like very narrow use case to me. I understand removing LEFT and INNER joins, but the case for SEMI and ANTI joins seems a lot thinner. Unnecessary LEFT and INNER joins can easily creep into a query when views are used, for example, but I can't imagine that happening for a SEMI or ANTI join. Maybe I'm lacking imagination. If someone has run into a query in the wild that would benefit from this, please raise your hand. If I understood correctly, you're planning to work on INNER join removal too. How much of the code in this patch is also required for INNER join removal, and how much is specific to SEMI and ANTI joins? Just so everyone is on the same page on what kind of queries this helps with, here are some examples from the added regression tests: -- Test join removals for semi and anti joins CREATE TEMP TABLE b (id INT NOT NULL PRIMARY KEY, val INT); CREATE TEMP TABLE a (id INT NOT NULL PRIMARY KEY, b_id INT REFERENCES b(id)); -- should remove semi join to b EXPLAIN (COSTS OFF) SELECT id FROM a WHERE b_id IN(SELECT id FROM b); QUERY PLAN -- Seq Scan on a Filter: (b_id IS NOT NULL) (2 rows) -- should remove semi join to b EXPLAIN (COSTS OFF) SELECT id FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.b_id = id); QUERY PLAN -- Seq Scan on a Filter: (b_id IS NOT NULL) (2 rows) -- should remove anti join to b EXPLAIN (COSTS OFF) SELECT id FROM a WHERE NOT EXISTS(SELECT 1 FROM b WHERE a.b_id = id); QUERY PLAN -- Seq Scan on a Filter: (b_id IS NULL) (2 rows) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replicating DROP commands across servers
Andres Freund wrote: On 2014-06-13 15:50:50 -0400, Alvaro Herrera wrote: Here's a patch implementing the proposed idea. This is used in the Bidirectional Replication stuff by Simon/Andres; it works well. I think there's been some changes to this patch since july, care to resend a new version? Sure, here it is. The only difference with the previous version is that it now also supports column defaults. This was found to be a problem when you drop a sequence that some column default depends on -- for example a column declared SERIAL, or a sequence marked with ALTER SEQUENCE OWNED BY. The new code is able to drop both the sequence and the default value (leaving, of course, the rest of the column intact.) This required adding support for such objects in get_object_address. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** *** 17564,17569 FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger(); --- 17564,17582 entryObject sub-id (e.g. attribute number for columns)/entry /row row + entryliteraloriginal/literal/entry + entrytypebool/type/entry + entryFlag used to identify the root object of the deletion/entry +/row +row + entryliteralnormal/literal/entry + entrytypebool/type/entry + entry + Flag indicating that there's a normal dependency relationship + in the dependency graph leading to this object + /entry +/row +row entryliteralobject_type/literal/entry entrytypetext/type/entry entryType of the object/entry *** *** 17593,17598 FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger(); --- 17606,17627 identifier present in the identity is quoted if necessary. /entry /row +row + entryliteraladdress_names/literal/entry + entrytypetext[]/type/entry + entry + An array that, together with literaladdress_args/literal, + can be used by the C-language function getObjectAddress() to + recreate the object address in a remote server containing a similar object. + /entry +/row +row + entryliteraladdress_args/literal/entry + entrytypetext[]/type/entry + entry + See literaladdress_names/literal above. + /entry +/row /tbody /tgroup /informaltable *** a/src/backend/catalog/dependency.c --- b/src/backend/catalog/dependency.c *** *** 203,218 deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel, /* * Keep track of objects for event triggers, if necessary. */ ! if (trackDroppedObjectsNeeded()) { for (i = 0; i targetObjects-numrefs; i++) { ! ObjectAddress *thisobj = targetObjects-refs + i; ! ! if ((!(flags PERFORM_DELETION_INTERNAL)) ! EventTriggerSupportsObjectClass(getObjectClass(thisobj))) { ! EventTriggerSQLDropAddObject(thisobj); } } } --- 203,227 /* * Keep track of objects for event triggers, if necessary. */ ! if (trackDroppedObjectsNeeded() !(flags PERFORM_DELETION_INTERNAL)) { for (i = 0; i targetObjects-numrefs; i++) { ! const ObjectAddress *thisobj = targetObjects-refs[i]; ! const ObjectAddressExtra *extra = targetObjects-extras[i]; ! bool original = false; ! bool normal = false; ! ! if (extra-flags DEPFLAG_ORIGINAL) ! original = true; ! if (extra-flags DEPFLAG_NORMAL) ! normal = true; ! if (extra-flags DEPFLAG_REVERSE) ! normal = true; ! ! if (EventTriggerSupportsObjectClass(getObjectClass(thisobj))) { ! EventTriggerSQLDropAddObject(thisobj, original, normal); } } } *** a/src/backend/catalog/objectaddress.c --- b/src/backend/catalog/objectaddress.c *** *** 417,422 static const ObjectPropertyType ObjectProperty[] = --- 417,513 } }; + /* + * This struct maps the object types as returned by getObjectTypeDescription + * into ObjType enum values. Note that some enum values can be obtained by + * different names, and that some string object types do not have corresponding + * values in the enum. The user of this map must be careful to test for + * invalid values being returned. + * + * This follows the order of getObjectTypeDescription. + */ + static const struct object_type_map + { + const char *tm_name; + ObjectType tm_type; + } + ObjectTypeMap[] = + { + /* OCLASS_CLASS */ + { table, OBJECT_TABLE }, + { index, OBJECT_INDEX }, + { sequence, OBJECT_SEQUENCE }, + { toast table, -1 }, /* unmapped */ + { view, OBJECT_VIEW }, + { materialized view, OBJECT_MATVIEW }, + { composite type, OBJECT_COMPOSITE }, + { foreign table, OBJECT_FOREIGN_TABLE }, +
Re: [HACKERS] jsonb format is pessimal for toast compression
Heikki Linnakangas hlinnakan...@vmware.com writes: On 08/16/2014 02:19 AM, Tom Lane wrote: I think the realistic alternatives at this point are either to switch to all-lengths as in my test patch, or to use the hybrid approach of Heikki's test patch. ... Personally I'd prefer to go to the all-lengths approach, but a large part of that comes from a subjective assessment that the hybrid approach is too messy. Others might well disagree. It's not too pretty, no. But it would be nice to not have to make a tradeoff between lookup speed and compressibility. Yet another idea is to store all lengths, but add an additional array of offsets to JsonbContainer. The array would contain the offset of, say, every 16th element. It would be very small compared to the lengths array, but would greatly speed up random access on a large array/object. That does nothing to address my basic concern about the patch, which is that it's too complicated and therefore bug-prone. Moreover, it'd lose on-disk compatibility which is really the sole saving grace of the proposal. My feeling about it at this point is that the apparent speed gain from using offsets is illusory: in practically all real-world cases where there are enough keys or array elements for it to matter, costs associated with compression (or rather failure to compress) will dominate any savings we get from offset-assisted lookups. I agree that the evidence for this opinion is pretty thin ... but the evidence against it is nonexistent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: Incidentally, while I generally think your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need significant cleanup. Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things that used to be protected by BufFreelistLock; specifically, you're doing StrategyControl-numBufferAllocs++ and SetLatch(StrategyControl-bgwriterLatch) without any locking. That's not OK. I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else. Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today. Increment StrategyControl-numBufferAllocs; save the values of StrategyControl-bgwriterLatch; pop a buffer off the freelist if there is one, saving its identity. Release the spinlock. Then, set the bgwriterLatch if needed. In the first loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage count and return it if not, holding the buffer header lock. Otherwise, reacquire the spinlock just long enough to pop a new potential victim and then loop around. Today, while working on updating the patch to improve locking I found that as now we are going to have a new process, we need a separate latch in StrategyControl to wakeup that process. Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. I thought it is better to mention about above points so that if you have any different thoughts about it, then it is better to discuss them now rather than after I take performance data with this locking protocol. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
Amit Kapila amit.kapil...@gmail.com writes: On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else. Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. I'm rather concerned by this cavalier assumption that we can protect fields a,b,c with one lock and fields x,y,z in the same struct with some other lock. A minimum requirement for that to work safely at all is that the fields are of atomically fetchable/storable widths; which might be okay here but it's a restriction that bears thinking about (and documenting). But quite aside from safety, the fields are almost certainly going to be in the same cache line which means contention between processes that are trying to fetch or store them concurrently. For a patch whose sole excuse for existence is to improve performance, that should be a very scary concern. (And yes, I realize these issues already affect the freelist. Perhaps that's part of the reason we have performance issues with it.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #11208: Refresh Materialized View Concurrently bug using user Postgres
Kevin Grittner kgri...@ymail.com wrote: Kevin Grittner kgri...@ymail.com wrote: bemanuel...@gmail.com bemanuel...@gmail.com wrote: tjma_dw= set role user_dw; tjma_dw= CREATE TABLE foo_data AS SELECT i, md5(random()::text) FROM generate_series(1, 10) i; SELECT 10 tjma_dw= CREATE MATERIALIZED VIEW mv_foo AS SELECT * FROM foo_data; SELECT 10 tjma_dw= ALTER MATERIALIZED VIEW mv_foo OWNER TO user_dw; ALTER MATERIALIZED VIEW tjma_dw= REFRESH MATERIALIZED VIEW mv_foo; REFRESH MATERIALIZED VIEW tjma_dw= ALTER TABLE foo_data OWNER TO user_dw; ALTER TABLE tjma_dw= REFRESH MATERIALIZED VIEW mv_foo; REFRESH MATERIALIZED VIEW tjma_dw= create unique index on mv_foo (i); CREATE INDEX /pgsql/pg94/bin/psql -Upostgres -p 5434 tjma_dw tjma_dw=# refresh materialized view CONCURRENTLY mv_foo; ERROR: permission denied for relation pg_temp_432971_2 CONTEXT: SQL statement DELETE FROM public.mv_foo mv WHERE ctid OPERATOR(pg_catalog.=) ANY (SELECT diff.tid FROM pg_temp_10.pg_temp_432971_2 diff WHERE diff.tid IS NOT NULL AND diff.newdata IS NULL) Yeah, that's a bug Attached is my proposed fix. I will push it in a day or two if there are no objections. Done. I think we will have a third beta release; which should include this fix. The master branch needed to be adjusted from the initially posted patch because of changes there. That version is attached. Thanks for testing the beta and for the report! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company refresh-concurrently-fix-v2.patch.gz Description: application/tgz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Aug 26, 2014 at 8:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. I'm rather concerned by this cavalier assumption that we can protect fields a,b,c with one lock and fields x,y,z in the same struct with some other lock. In some cases, it could be beneficial especially when a,b,c are going to be more frequently accessed as compare to x,y,z. A minimum requirement for that to work safely at all is that the fields are of atomically fetchable/storable widths; which might be okay here but it's a restriction that bears thinking about (and documenting). But quite aside from safety, the fields are almost certainly going to be in the same cache line which means contention between processes that are trying to fetch or store them concurrently. I think patch will reduce the contention for some of such variables (which are accessed during clock sweep) as it will minimize the need to perform clock sweep by backends. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] pg_dump refactor patch to remove global variables
On 8/15/14 7:30 PM, Joachim Wieland wrote: Attached is a patch that doesn't add any new functionality or features, all it does is get rid of the global variables that pg_dump.c is full of. I'm getting a compiler error: In file included from pg_dump.c:60: In file included from ./pg_backup_archiver.h:32: ./pg_backup.h:212:3: error: redefinition of typedef 'DumpOptions' is a C11 feature [-Werror,-Wtypedef-redefinition] } DumpOptions; ^ ./pg_dump.h:507:29: note: previous definition is here typedef struct _dumpOptions DumpOptions; ^ 1 error generated. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange
Mark Kirkwood wrote: On 06/05/14 16:28, Amit Kapila wrote: On Mon, May 5, 2014 at 11:57 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: I could think of 2 ways to change this: a. if user has specified cost_limit value for table, then it just uses it rather than rebalancing based on value of system-wide guc variable autovacuum_vacuum_cost_limit b. another could be to restrict setting per-table value to be lesser than system-wide value? The former is used for auto vacuum parameters like scale_factor and later is used for parameters like freeze_max_age. Thoughts? Alvaro, do you think above options makes sense to solve this problem? I've been giving some thought to this. Really, there is no way to handle this sensibly while at the same time keeping the documented behavior -- or in other words, what we have documented is not useful behavior. Your option (b) above is an easy solution to the problem, however it means that the user will have serious trouble configuring the system in scenarios such as volatile tables, as Mark says -- essentially that will foreclose the option of using autovacuum for them. I'm not sure I like your (a) proposal much better. One problem there is that if you set the values for a table to be exactly the same values as in postgresql.conf, it will behave completely different because it will not participate in balancing. To me this seems to violate POLA. I checked Haribabu's latest patch in this thread, and didn't like it much. If you set up a table to have cost_delay=1000, it runs at that speed when vacuumed alone; but if there are two workers, it goes at half the speed even if the other one is configured with a very small cost_delay (in essence, waste the allocated I/O bandwidth). Three workers, it goes at a third of the speed -- again, even if the other tables are configured to go much slower than the volatile one. This seems too simplistic. It might be okay when you have only one or two very large or high-churn tables, and small numbers of workers, but it's not unreasonable to think that you might have lots more workers if your DB has many high-churn tables. So my proposal is a bit more complicated. First we introduce the notion of a single number, to enable sorting and computations: the delay equivalent, which is the cost_limit divided by cost_delay. The highest the value is for any table, the fastest it is vacuumed. (It makes sense in physical terms: a higher cost_limit makes it faster, because vacuum sleeps less often; and a higher cost_delay makes it go slower, because vacuums sleeps for longer.) Now, the critical issue is to notice that not all tables are equal; they can be split in two groups, those that go faster than the global delay equivalent (i.e. the effective values of GUC variables autovacuum_vacuum_cost_limit/autovacuum_vacuum_cost_delay), and those that go equal or slower. For the latter group, the rebalancing algorithm distributes the allocated I/O by the global vars, in a pro-rated manner. For the former group (tables vacuumed faster than global delay equiv), to rebalance we don't consider the global delay equiv but the delay equiv of the fastest table currently being vacuumed. Suppose we have two tables, delay_equiv=10 each (which is the default value). If they are both vacuumed in parallel, then we distribute a delay_equiv of 5 to each (so set cost_limit=100, cost_delay=20). As soon as one of them finishes, the remaining one is allowed to upgrade to delay_equiv=10 (cost_limit=200, cost_delay=20). Now add a third table, delay_equiv=500 (cost_limit=1, cost_delay=20; this is Mark's volatile table). If it's being vacuumed on its own, just assign cost_limit=1 cost_delay=20, as normal. If one of the other two tables are being vacuumed, that one will use delay_equiv=10, as per above. To balance the volatile table, we take the delay_equiv of this one and subtract the already handed-out delay_equiv of 10; so we set the volatile table to delay_equiv=490 (cost_limit=9800, cost_delay=20). If we do it this way, the whole system is running at the full speed enabled by the fastest table we have set the per-table options, but also we have scaled things so that the slow tables go slow and the fast tables go fast. As a more elaborate example, add a fourth table with delay_equiv=50 (cost_limit=1000, cost_delay=20). This is also faster than the global vars, so we put it in the first group. If all four tables are being vacuumed in parallel, we have the two slow tables going at delay_equiv=5 each (cost_limit=100, cost_delay=20); then there are delay_equiv=490 to distribute among the remaining ones; pro-rating this we have delay_equiv=445 (cost_limit=8900, cost_delay=20) for the volatile table and delay_equiv=45 (cost_limit=900, cost_delay=20) for the other one. If one of the slowest tables finished vacuuming, the other one will speed up to delay_equiv=10, and the two fastest ones will go on unchanged.
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On Mon, Aug 25, 2014 at 1:35 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, currently pg_basebackup uses fetch mode when only -x is specified - which imo isn't a very good thing to use due to the increased risk of not fetching everything. How about switching to stream mode for 9.5+? I think the original reasons were to not change the default behaviour with a new feature, and secondly because defaulting to -X requires two replication connections rather than one. I think the first reason is gone now, and the risk/damage of the two connections is probably smaller than running out of WAL. -x is a good default for smaller systems, but -X is a safer one for bigger ones. So I agree that changing the default mode would make sense. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On 2014-08-26 18:40:27 +0200, Magnus Hagander wrote: On Mon, Aug 25, 2014 at 1:35 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, currently pg_basebackup uses fetch mode when only -x is specified - which imo isn't a very good thing to use due to the increased risk of not fetching everything. How about switching to stream mode for 9.5+? I think the original reasons were to not change the default behaviour with a new feature, and secondly because defaulting to -X requires two replication connections rather than one. Right. I think the first reason is gone now, and the risk/damage of the two connections is probably smaller than running out of WAL. Especially as that will fail pretty nearly immediately instead at the end of the basebackup... -x is a good default for smaller systems, but -X is a safer one for bigger ones. So I agree that changing the default mode would make sense. Cool. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On Tue, Aug 26, 2014 at 6:51 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-26 18:40:27 +0200, Magnus Hagander wrote: On Mon, Aug 25, 2014 at 1:35 PM, Andres Freund and...@2ndquadrant.com wrote: Hi, currently pg_basebackup uses fetch mode when only -x is specified - which imo isn't a very good thing to use due to the increased risk of not fetching everything. How about switching to stream mode for 9.5+? I think the original reasons were to not change the default behaviour with a new feature, and secondly because defaulting to -X requires two replication connections rather than one. Right. I think the first reason is gone now, and the risk/damage of the two connections is probably smaller than running out of WAL. Especially as that will fail pretty nearly immediately instead at the end of the basebackup... Yeah, I don't think the problem was actually pg_basebackup failing as much as pg_basebackup getting in the way of regular replication standbys. Which I think is also a smaller problem now, given that it's a more common thing to do backups through replication protocol. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote: My current thinking is that access to the csnlog will need to be made faster. Currently, it's just another SLRU, but I'm sure we can do better than that. Or add a backend-private cache on top of it; that might already alleviate it enough.. Aren't those the same ideas that have been proposed for eliminating hint bits? If this patch makes taking snapshots much faster, then perhaps it's enough incentive to follow through on one of those ideas. I am certainly open to removing hint bits (which was probably clear from earlier discussions) if there are significant wins elsewhere and the penalty is not too great. To continue the discussion on this patch, let's assume that we make TransactionIdGetCommitLSN sufficiently fast. If that's the case, can't we make HeapTupleSatisfiesMVCC look more like: LsnMin = TransactionIdGetCommitLSN(xmin); if (!COMMITLSN_IS_COMMITTED(LsnMin)) LsnMin = BIG_LSN; LsnMin = TransactionIdGetCommitLSN(xmin); if (!COMMITLSN_IS_COMMITTED(LsnMin)) LsnMin = BIG_LSN; return (snapshot-snapshotlsn = LsnMin snapshot-snapshotlsn LsnMax); There would need to be some handling for locked tuples, or tuples related to the current transaction, of course. But I still think it would turn out simpler; perhaps by enough to save a few cycles. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, Aug 26, 2014 at 11:45 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It appears that this patch weakens the idea of hint bits. Even if HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the snapshot. That's fast if the xid is less than snap-xmin, but otherwise it needs to do a fetch from the csnlog, which is exactly what the hint bits are designed to avoid. And we can't get around this, because the whole point of this patch is to remove the xip array from the snapshot. Yeah. This patch in the current state is likely much much slower than unpatched master, except in extreme cases where you have thousands of connections and short transactions so that without the patch, you spend most of the time acquiring snapshots. Interesting analysis. I suppose the equivalent of hint bits would be to actually write the CSN of the transaction into the record when the hint bit is set. I don't immediately see how to make that practical. One thought would be to have a list of xids in the page header with their corresponding csn -- which starts to sound a lot like Oralce's Interested Transaction List. But I don't see how to make that work for the hundreds of possible xids on the page. The worst case for visibility resolution is you have a narrow table that has random access DDL happening all the time, each update is a short transaction and there are a very high rate of such transactions spread out uniformly over a very large table. That means any given page has over 200 rows with random xids spread over a very large range of xids. Currently the invariant hint bits give us is that each xid needs to be looked up in the clog only a more or less fixed number of times, in that scenario only once since the table is very large and the transactions short lived. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Compute attr_needed for child relations (was Re: [HACKERS] inherit support for foreign tables)
Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes: [ attr_needed-v4.patch ] I looked this over, and TBH I'm rather disappointed. The patch adds 150 lines of dubiously-correct code in order to save ... uh, well, actually it *adds* code, because the places that are supposedly getting a benefit are changed like this: *** 799,806 check_selective_binary_conversion(RelOptInfo *baserel, } /* Collect all the attributes needed for joins or final output. */ ! pull_varattnos((Node *) baserel-reltargetlist, baserel-relid, ! attrs_used); /* Add all the attributes used by restriction clauses. */ foreach(lc, baserel-baserestrictinfo) --- 799,810 } /* Collect all the attributes needed for joins or final output. */ ! for (i = baserel-min_attr; i = baserel-max_attr; i++) ! { ! if (!bms_is_empty(baserel-attr_needed[i - baserel-min_attr])) ! attrs_used = bms_add_member(attrs_used, ! i - FirstLowInvalidHeapAttributeNumber); ! } /* Add all the attributes used by restriction clauses. */ foreach(lc, baserel-baserestrictinfo) That's not simpler, it's not easier to understand, and it's probably not faster either. We could address some of those complaints by encapsulating the above loop into a utility function, but still, I come away with the feeling that it's not worth changing this. Considering that all the places that are doing this then proceed to use pull_varattnos to add on attnos from the restriction clauses, it seems like using pull_varattnos on the reltargetlist isn't such a bad thing after all. So I'm inclined to reject this. It seemed like a good idea in the abstract, but the concrete result isn't very attractive, and doesn't seem like an improvement over what we have. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/26/2014 07:51 AM, Tom Lane wrote: My feeling about it at this point is that the apparent speed gain from using offsets is illusory: in practically all real-world cases where there are enough keys or array elements for it to matter, costs associated with compression (or rather failure to compress) will dominate any savings we get from offset-assisted lookups. I agree that the evidence for this opinion is pretty thin ... but the evidence against it is nonexistent. Well, I have shown one test case which shows where lengths is a net penalty. However, for that to be the case, you have to have the following conditions *all* be true: * lots of top-level keys * short values * rows which are on the borderline for TOAST * table which fits in RAM ... so that's a special case and if it's sub-optimal, no bigee. Also, it's not like it's an order-of-magnitude slower. Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Josh Berkus j...@agliodbs.com writes: Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/26/2014 11:40 AM, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( regards, tom lane yah, me too. :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Josh Berkus j...@agliodbs.com writes: On 08/26/2014 11:40 AM, Tom Lane wrote: I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. ... and while I was typing that sentence, lightning struck. The existing arrangement of object subfields with keys and values interleaved is just plain dumb. We should rearrange that as all the keys in order, then all the values in the same order. Then the keys are naturally adjacent in memory and object-key searches become much more cache-friendly: you probably touch most of the key portion of the object, but none of the values portion, until you know exactly what part of the latter to fetch. This approach might complicate the lookup logic marginally but I bet not very much; and it will be a huge help if we ever want to do smart access to EXTERNAL (non-compressed) JSON values. I will go prototype that just to see how much code rearrangement is required. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-08-26 15:01:27 -0400, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: On 08/26/2014 11:40 AM, Tom Lane wrote: I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. On the other hand the majority of databases these day fit into main memory due to its increasing sizes and postgres is more often CPU than IO bound. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 26 August 2014 11:34, Josh Berkus j...@agliodbs.com wrote: On 08/26/2014 07:51 AM, Tom Lane wrote: My feeling about it at this point is that the apparent speed gain from using offsets is illusory: in practically all real-world cases where there are enough keys or array elements for it to matter, costs associated with compression (or rather failure to compress) will dominate any savings we get from offset-assisted lookups. I agree that the evidence for this opinion is pretty thin ... but the evidence against it is nonexistent. Well, I have shown one test case which shows where lengths is a net penalty. However, for that to be the case, you have to have the following conditions *all* be true: * lots of top-level keys * short values * rows which are on the borderline for TOAST * table which fits in RAM ... so that's a special case and if it's sub-optimal, no bigee. Also, it's not like it's an order-of-magnitude slower. Anyway, I called for feedback on by blog, and have gotten some: http://www.databasesoup.com/2014/08/the-great-jsonb-tradeoff.html It would be really interesting to see your results with column STORAGE EXTERNAL for that benchmark. I think it is important to separate out the slowdown due to decompression now being needed vs that inherent in the new format, we can always switch off compression on a per-column basis using STORAGE EXTERNAL. My JSON data has smallish objects with a small number of keys, it barely compresses at all with the patch and shows similar results to Arthur's data. Across ~500K rows I get: encoded=# select count(properties-'submitted_by') from compressed; count 431948 (1 row) Time: 250.512 ms encoded=# select count(properties-'submitted_by') from uncompressed; count 431948 (1 row) Time: 218.552 ms Laurence
Re: [HACKERS] jsonb format is pessimal for toast compression
Andres Freund and...@2ndquadrant.com writes: On 2014-08-26 15:01:27 -0400, Tom Lane wrote: Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. On the other hand the majority of databases these day fit into main memory due to its increasing sizes and postgres is more often CPU than IO bound. Well, better data compression helps make that true ;-). And don't forget cache effects; actual main memory is considered slow these days. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Aug 26, 2014 at 4:01 PM, Tom Lane t...@sss.pgh.pa.us wrote: Josh Berkus j...@agliodbs.com writes: On 08/26/2014 11:40 AM, Tom Lane wrote: I was hoping you'd get some useful data from that, but so far it seems like a rehash of points made in the on-list thread :-( Unfortunately even the outside commentors don't seem to understand that storage size *is* related to speed, it's exchanging I/O speed for CPU speed. Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. What about my earlier proposal? An in-memory compressed representation would greatly help cache locality, more so if you pack keys as you mentioned. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 2014-08-26 15:17:13 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-26 15:01:27 -0400, Tom Lane wrote: Yeah, exactly. Given current hardware trends, data compression is becoming more of a win not less as time goes on: CPU cycles are cheap even compared to main memory access, let alone mass storage. So I'm thinking we want to adopt a compression-friendly data format even if it measures out as a small loss currently. On the other hand the majority of databases these day fit into main memory due to its increasing sizes and postgres is more often CPU than IO bound. Well, better data compression helps make that true ;-). People disable toast compression though because it results in better performance :(. Part of that could be fixed by a faster compression method, part of it by decompressing less often. But still. And don't forget cache effects; actual main memory is considered slow these days. Right. But that plays the other way round too. Compressed datums need to be copied to be accessed uncompressed. Whereas at least in comparison to inline compressed datums that's not necessary. Anyway, that's just to say that I don't really agree that CPU overhead is a worthy price to pay for storage efficiency if the gains are small. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, 2014-08-26 at 19:25 +0100, Greg Stark wrote: I don't immediately see how to make that practical. One thought would be to have a list of xids in the page header with their corresponding csn -- which starts to sound a lot like Oralce's Interested Transaction List. But I don't see how to make that work for the hundreds of possible xids on the page. I feel like that's moving in the wrong direction. That's still causing a lot of modifications to a data page when the data is not changing, and that's bad for a lot of cases that I'm interested in (checksums are one example). We are mixing two kinds of data: user data and visibility information. Each is changed under different circumstances and has different characteristics, and I'm beginning to think they shouldn't be mixed at all. What if we just devised a structure specially designed to hold visibility information, put all of the visibility information there, and data pages would only change where there is a real, user-initiated I/U/D. Vacuum could still clear out dead tuples from the data area, but it would do the rest of its work on the visibility structure. It could even be a clever structure that could compress away large static areas until they become active again. Maybe this wouldn't work for all tables, but could be an option for big tables with low update rates. The worst case for visibility resolution is you have a narrow table that has random access DDL happening all the time, each update is a short transaction and there are a very high rate of such transactions spread out uniformly over a very large table. That means any given page has over 200 rows with random xids spread over a very large range of xids. That's not necessarily a bad case, unless the CLOG/CSNLOG lookup is a significant fraction of the effort to update a tuple. That would be a bad case if you introduce scans into the equation as well, but that's not a problem if the all-visible bit is set. Currently the invariant hint bits give us is that each xid needs to be looked up in the clog only a more or less fixed number of times, in that scenario only once since the table is very large and the transactions short lived. A backend-local cache might accomplish that, as well (would still need to do a lookup, but no locks or contention). There would be some challenges around invalidation (for xid wraparound) and pre-warming the cache (so establishing a lot of connections doesn't cause a lot of CLOG access). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Aug 26, 2014 at 12:27 PM, Andres Freund and...@2ndquadrant.com wrote: Anyway, that's just to say that I don't really agree that CPU overhead is a worthy price to pay for storage efficiency if the gains are small. +1 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Model For Role Attributes and Fine Grained Permssions
On Thu, Aug 21, 2014 at 10:49 PM, Stephen Frost sfr...@snowman.net wrote: * Heikki Linnakangas (hlinnakan...@vmware.com) wrote: On 08/19/2014 04:27 AM, Brightwell, Adam wrote: This is a proof-of-concept patch for a new model around role attributes and fine grained permissions meant to alleviate the current over dependence on superuser. Hmm. How does this get us any closer to fine-grained permissions? This patch, by itself, does not- but it adds the structure to allow us to add more permissions without having to add more fields to pg_authid, and we could build into pg_permission the WITH ADMIN OPTION and grantor bits that the normal GRANT syntax already supports- but without having to chew up additional bits for these granted permissions which are applied to roles instead of objects in the system. As for specific examples where this could be used beyond those presented, consider things like: CREATE EXTENSION CREATE TABLESPACE COPY (server-side) I'm not opposed to this in theory, but I agree with Heikki that the syntax you've picked (as well as the internal teminology of the patch) is not sufficiently unambiguous. Permission could refer to a lot of things, and the GRANT syntax does a lot of things already. Since the patch appears to aim to supplant what we current do with ALTER ROLE .. [NO] {CREATEDB | CREATEROLE }, how about something like: ALTER ROLE role_name { GRANT | REVOKE } CAPABILITY capability_name; In terms of catalog structure, I doubt that a row-per-capability-per-user makes sense. Why not just add a new rolcapability column to pg_authid? A single int64 interpreted as a bitmask would give us room for 64 capabilities at a fraction of the storage and lookup cost of a separate catalog. If that's not enough, the column could be stored as an integer array or vector or something, but a bigger problem is that Tom's head will likely explode if you tell him you're going to have more than 64 of these things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Tue, 2014-08-26 at 12:39 +0300, Heikki Linnakangas wrote: I think this is enough for this commitfest - we have consensus on the design. For the next one, please address those open items, and resubmit. Agreed, return with feedback. I need to get the accounting patch in first, which needs to address some performance issues, but there's a chance of wrapping those up quickly. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/26/2014 12:27 PM, Andres Freund wrote: Anyway, that's just to say that I don't really agree that CPU overhead is a worthy price to pay for storage efficiency if the gains are small. But in this case the gains aren't small; we're talking up to 60% smaller storage. Testing STORAGE EXTENDED soon. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency bgwriter not doing its job
Hello Jeff, The culprit I found is bgwriter, which is basically doing nothing to prevent the coming checkpoint IO storm, even though there would be ample time to write the accumulating dirty pages so that checkpoint would find a clean field and pass in a blink. Indeed, at the end of the 500 seconds throttled test, pg_stat_bgwriter says: Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i? Yes, I did. You don't want your steady state stats polluted by the bulk load. Sure! buffers_checkpoint = 19046 buffers_clean = 2995 Out of curiosity, what does buffers_backend show? buffers_backend = 157 In any event, this almost certainly is a red herring. Possibly! It is pretty easy to reproduce... can anyone try? Whichever of the three ways are being used to write out the buffers, it is the checkpointer that is responsible for fsyncing them, and that is where your drama is almost certainly occurring. Writing out with one path rather than a different isn't going to change things, unless you change the fsync. Well, ISTM that the OS does not need to wait for fsync to start writing pages if it has received one minute of buffer writes at 50 writes per second, some scheduler should start handling the flow somewhere... So if bgwriter was to write the buffers is would help, but maybe there is a better way. Also, are you familiar with checkpoint_completion_target, and what is it set to? The default 0.5. Moving to 0.9 seems to worsen the situation. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Fri, Aug 22, 2014 at 2:46 PM, Peter Geoghegan p...@heroku.com wrote: On Fri, Aug 22, 2014 at 7:19 AM, Robert Haas robertmh...@gmail.com wrote: Patch 0002 no longer applies; please rebase. I attach rebased patch. Note that there is currently a bug in the master branch: + if (len2 = tss-buflen2) + { + pfree(tss-buf2); + tss-buflen1 = Max(len2 + 1, Min(tss-buflen2 * 2, MaxAllocSize)); + tss-buf2 = MemoryContextAlloc(ssup-ssup_cxt, tss-buflen2); + } You didn't say what the bug is; after inspection, I believe it's that line 4 begins with tss-buflen1 rather than tss-buflen2. I have committed a fix for that problem. Let me know if I missed something else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, Aug 26, 2014 at 8:32 PM, Jeff Davis pg...@j-davis.com wrote: We are mixing two kinds of data: user data and visibility information. Each is changed under different circumstances and has different characteristics, and I'm beginning to think they shouldn't be mixed at all. What if we just devised a structure specially designed to hold visibility information, put all of the visibility information there, and data pages would only change where there is a real, user-initiated I/U/D. Vacuum could still clear out dead tuples from the data area, but it would do the rest of its work on the visibility structure. It could even be a clever structure that could compress away large static areas until they become active again. Well fundamentally the reason the visibility information is with the user data is so that we don't need to do two i/os to access the data. The whole point of hint bits is to guarantee that after some amount of time you can read data directly out of the heap page and use it without doing any additional I/O. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Aug 26, 2014 at 12:59 PM, Robert Haas robertmh...@gmail.com wrote: I have committed a fix for that problem. Let me know if I missed something else. Yes, that's all I meant. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On Fri, Aug 22, 2014 at 9:48 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote: One thing I was pointed out, it is the reason why I implemented DDL support, is that intermediation of c-language function also loads the extension module implicitly. It is an advantage, but not sure whether it shall be supported from the beginning. That is definitely an advantage of the DDL-based approach, but I think it's too much extra machinery for not enough real advantage. Sounds like we all agree, so ... I'd like to follow this direction, and start stripping the DDL support. ...please make it so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, 2014-08-26 at 21:00 +0100, Greg Stark wrote: Well fundamentally the reason the visibility information is with the user data is so that we don't need to do two i/os to access the data. The whole point of hint bits is to guarantee that after some amount of time you can read data directly out of the heap page and use it without doing any additional I/O. If the data is that static, then the visibility information would be highly compressible, and surely in shared_buffers already. (Yes, it would need to be pinned, which has a cost.) Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
On Sat, Aug 23, 2014 at 6:39 PM, Greg Stark st...@mit.edu wrote: On Sat, Aug 23, 2014 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Ah. Okay, but then what's wrong with the original proposal of use ceil() instead of floor()? Basically I think the idea of treating fractions less than one differently from fractions greater than one is bogus; nobody will ever find that intuitive. Or make it an error to specify a value that rounds to 0 but isn't 0. I liked David Johnston's even stronger suggestion upthread: make it an error to specify a value requires rounding of any kind. In other words, if the minimum granularity is 1 minute, you can specify that as 60 seconds instead, but if you write 59 seconds, we error out. Maybe that seems pedantic, but I don't think users will much appreciate the discovery that 30 seconds means 60 seconds. They'll be happier to be told that up front than having to work it out afterward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
On 2014-08-26 16:16:32 -0400, Robert Haas wrote: On Sat, Aug 23, 2014 at 6:39 PM, Greg Stark st...@mit.edu wrote: On Sat, Aug 23, 2014 at 9:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Ah. Okay, but then what's wrong with the original proposal of use ceil() instead of floor()? Basically I think the idea of treating fractions less than one differently from fractions greater than one is bogus; nobody will ever find that intuitive. Or make it an error to specify a value that rounds to 0 but isn't 0. I liked David Johnston's even stronger suggestion upthread: make it an error to specify a value requires rounding of any kind. In other words, if the minimum granularity is 1 minute, you can specify that as 60 seconds instead, but if you write 59 seconds, we error out. Maybe that seems pedantic, but I don't think users will much appreciate the discovery that 30 seconds means 60 seconds. They'll be happier to be told that up front than having to work it out afterward. Is the whole topic actually practically relevant? Afaics there's no guc's with a time unit bigger than GUC_UNIT_S except log_rotation_age where it surely doesn't matter and no space unit bigger than GUC_UNIT_BLOCKS/GUC_UNIT_XBLOCKS. In theory I agree with you/$subject, but I don't really see this worth much effort. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On 8/26/14 12:40 PM, Magnus Hagander wrote: I think the first reason is gone now, and the risk/damage of the two connections is probably smaller than running out of WAL. -x is a good default for smaller systems, but -X is a safer one for bigger ones. So I agree that changing the default mode would make sense. I would seriously consider just removing one of the modes. Having two modes is complex enough, and then having different defaults in different versions, and fuzzy recommendations like, it's better for smaller systems, it's quite confusing. I don't think it's a fundamental problem to say, you need 2 connections to use this feature. (For example, you need a second connection to issue a cancel request. Nobody has ever complained about that.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On 2014-08-26 16:41:44 -0400, Peter Eisentraut wrote: On 8/26/14 12:40 PM, Magnus Hagander wrote: I think the first reason is gone now, and the risk/damage of the two connections is probably smaller than running out of WAL. -x is a good default for smaller systems, but -X is a safer one for bigger ones. So I agree that changing the default mode would make sense. I would seriously consider just removing one of the modes. Having two modes is complex enough, and then having different defaults in different versions, and fuzzy recommendations like, it's better for smaller systems, it's quite confusing. Happy with removing the option and just accepting -X for backward compat. I don't think it's a fundamental problem to say, you need 2 connections to use this feature. (For example, you need a second connection to issue a cancel request. Nobody has ever complained about that.) Well, replication connections are more limited in number than normal connections... And cancel requests are very short lived. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
On 8/23/14 6:39 PM, Greg Stark wrote: Or make it an error to specify a value that rounds to 0 but isn't 0. That's what I would prefer. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
On 8/26/14 4:22 PM, Andres Freund wrote: Is the whole topic actually practically relevant? It's clearly not all that important, or otherwise we'd have heard about before now. I suppose someone could do something like wal_receiver_status_interval = 10ms and end up silently turning the whole thing off instead of making it very aggressive. The mistake here is that the mathematically appropriate turn-off value in this and similar cases is infinity, not zero. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Missing comment block at the top of streamutil.h and receivelog.h
On Sat, Aug 23, 2014 at 11:59 PM, Michael Paquier michael.paqu...@gmail.com wrote: As mentioned in $subject, the header files in src/bin/pg_basebackup do not have a comment block at the top and do not have any copyright text. Any reason for that? Shouldn't we have something for consistency with the other files like in the patch attached? Probably that's a good idea, but do we really need Author: tags? I know we have those in a few places, but certainly not everywhere, and as time goes by they tend to be less accurate reflections of who wrote the latest code (as opposed to the original code). Furthermore, every time we include them, it tends to increase the demand to add even more of them because, hey, everybody likes to be acknowledged. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Missing comment block at the top of streamutil.h and receivelog.h
On Tue, Aug 26, 2014 at 11:03 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Aug 23, 2014 at 11:59 PM, Michael Paquier michael.paqu...@gmail.com wrote: As mentioned in $subject, the header files in src/bin/pg_basebackup do not have a comment block at the top and do not have any copyright text. Any reason for that? Shouldn't we have something for consistency with the other files like in the patch attached? Probably that's a good idea, but do we really need Author: tags? I know we have those in a few places, but certainly not everywhere, and as time goes by they tend to be less accurate reflections of who wrote the latest code (as opposed to the original code). Furthermore, every time we include them, it tends to increase the demand to add even more of them because, hey, everybody likes to be acknowledged. Given that I'm the one named in it - nah, just drop it. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switch pg_basebackup to use -X stream instead of -X fetch by default?
On Tue, Aug 26, 2014 at 10:46 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-26 16:41:44 -0400, Peter Eisentraut wrote: On 8/26/14 12:40 PM, Magnus Hagander wrote: I think the first reason is gone now, and the risk/damage of the two connections is probably smaller than running out of WAL. -x is a good default for smaller systems, but -X is a safer one for bigger ones. So I agree that changing the default mode would make sense. I would seriously consider just removing one of the modes. Having two modes is complex enough, and then having different defaults in different versions, and fuzzy recommendations like, it's better for smaller systems, it's quite confusing. Happy with removing the option and just accepting -X for backward compat. Works for me - this is really the cleaner way of doing it... If we do that, perhaps we should backpatch a deprecation notice into the 9.4 docs? I don't think it's a fundamental problem to say, you need 2 connections to use this feature. (For example, you need a second connection to issue a cancel request. Nobody has ever complained about that.) Well, replication connections are more limited in number than normal connections... And cancel requests are very short lived. Yeah. But as long as we document it clearly, we should be OK I think. And it's fairly clearly documented now - just need to be sure not to remove that when changing the -x stuff. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
Robert Haas robertmh...@gmail.com writes: I liked David Johnston's even stronger suggestion upthread: make it an error to specify a value requires rounding of any kind. In other words, if the minimum granularity is 1 minute, you can specify that as 60 seconds instead, but if you write 59 seconds, we error out. Maybe that seems pedantic, but I don't think users will much appreciate the discovery that 30 seconds means 60 seconds. They'll be happier to be told that up front than having to work it out afterward. I think this is totally wrong. The entire point of the GUC units system, or at least a large part of the point, is that users should not have to be intimately aware of the units in which a given value is measured internally. And that in turn means that the units had better be such that users won't find them overly coarse. If it matters a lot whether 59 seconds gets rounded to 60, then we didn't make a good choice of units for the GUC in question; and we should fix that choice, not mess with the rounding rules. The case where this argument falls down is for special values, such as where zero means something quite different from the smallest nonzero value. Peter suggested upthread that we should redefine any GUC values for which that is true, but (a) I think that loses on backwards compatibility grounds, and (b) ISTM zero is probably always special to some extent. A zero time delay for example is not likely to work. Maybe we should leave the rounding behavior alone (there's not much evidence that rounding in one direction is worse than another; although I'd also be okay with changing to round-to-nearest), and confine ourselves to throwing an error for the single case that an apparently nonzero input value is truncated/rounded to zero as a result of units conversion. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: rounding up time value less than its unit.
Tom Lane-2 wrote Robert Haas lt; robertmhaas@ gt; writes: I liked David Johnston's even stronger suggestion upthread: make it an error to specify a value requires rounding of any kind. In other words, if the minimum granularity is 1 minute, you can specify that as 60 seconds instead, but if you write 59 seconds, we error out. Maybe that seems pedantic, but I don't think users will much appreciate the discovery that 30 seconds means 60 seconds. They'll be happier to be told that up front than having to work it out afterward. I think this is totally wrong. The entire point of the GUC units system, or at least a large part of the point, is that users should not have to be intimately aware of the units in which a given value is measured internally. And that in turn means that the units had better be such that users won't find them overly coarse. If it matters a lot whether 59 seconds gets rounded to 60, then we didn't make a good choice of units for the GUC in question; and we should fix that choice, not mess with the rounding rules. The case where this argument falls down is for special values, such as where zero means something quite different from the smallest nonzero value. Peter suggested upthread that we should redefine any GUC values for which that is true, but (a) I think that loses on backwards compatibility grounds, and (b) ISTM zero is probably always special to some extent. A zero time delay for example is not likely to work. Maybe we should leave the rounding behavior alone (there's not much evidence that rounding in one direction is worse than another; although I'd also be okay with changing to round-to-nearest), and confine ourselves to throwing an error for the single case that an apparently nonzero input value is truncated/rounded to zero as a result of units conversion. To Andres' point: SELECT unit, count(*) FROM pg_settings WHERE unit '' GROUP BY unit; (9.3 / Ubuntu) min (1 - log_rotation_age) s (10) ms (13) kb (7) 8kb (6) I don't know about the size implications but they seem to be non-existent. That any setting critically matters at +/- 1s or 1ms doesn't seem likely in practice. Even +/- 1min for a setting, if it did matter at extreme scale, would be recognizable by the user in practice as a rounding artifact and compensated for. At this point throwing an error for any precision that results in less than the default precision is my preference. I would not change the rounding rules for the simple reason that there is no obvious improvement to be had and so why introduce pointless change that - however marginal and unlikely - will be user-visible. The complaint to overcome is avoiding an interpretation of zero when the precision of the input is less than the GUC unit. Lacking any concrete complaints about our round-down policy I don't see where a change there is worthwhile. Fixing zero as a special value falls under the same category. As mathematically pure as using infinity may be the trade-off for practicality and usability seems, even in light of this complaint, like the correct one to have made. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/proposal-rounding-up-time-value-less-than-its-unit-tp5811102p5816409.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I wrote: I wish it were cache-friendly too, per the upthread tangent about having to fetch keys from all over the place within a large JSON object. ... and while I was typing that sentence, lightning struck. The existing arrangement of object subfields with keys and values interleaved is just plain dumb. We should rearrange that as all the keys in order, then all the values in the same order. Then the keys are naturally adjacent in memory and object-key searches become much more cache-friendly: you probably touch most of the key portion of the object, but none of the values portion, until you know exactly what part of the latter to fetch. This approach might complicate the lookup logic marginally but I bet not very much; and it will be a huge help if we ever want to do smart access to EXTERNAL (non-compressed) JSON values. I will go prototype that just to see how much code rearrangement is required. This looks pretty good from a coding point of view. I have not had time yet to see if it affects the speed of the benchmark cases we've been trying. I suspect that it won't make much difference in them. I think if we do decide to make an on-disk format change, we should seriously consider including this change. The same concept could be applied to offset-based storage of course, although I rather doubt that we'd make that combination of choices since it would be giving up on-disk compatibility for benefits that are mostly in the future. Attached are two patches: one is a delta against the last jsonb-lengths patch I posted, and the other is a merged patch showing the total change from HEAD, for ease of application. regards, tom lane diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index e47eaea..4e7fe67 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *** *** 26,33 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (the total size of an array's elements is also limited by JENTRY_LENMASK, ! * but we're not concerned about that here) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) --- 26,33 * in MaxAllocSize, and the number of elements (or pairs) must fit in the bits * reserved for that in the JsonbContainer.header field. * ! * (The total size of an array's or object's elements is also limited by ! * JENTRY_LENMASK, but we're not concerned about that here.) */ #define JSONB_MAX_ELEMS (Min(MaxAllocSize / sizeof(JsonbValue), JB_CMASK)) #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), JB_CMASK)) *** findJsonbValueFromContainer(JsonbContain *** 294,303 { JEntry *children = container-children; int count = (container-header JB_CMASK); ! JsonbValue *result = palloc(sizeof(JsonbValue)); Assert((flags ~(JB_FARRAY | JB_FOBJECT)) == 0); if (flags JB_FARRAY container-header) { char *base_addr = (char *) (children + count); --- 294,309 { JEntry *children = container-children; int count = (container-header JB_CMASK); ! JsonbValue *result; Assert((flags ~(JB_FARRAY | JB_FOBJECT)) == 0); + /* Quick out without a palloc cycle if object/array is empty */ + if (count = 0) + return NULL; + + result = palloc(sizeof(JsonbValue)); + if (flags JB_FARRAY container-header) { char *base_addr = (char *) (children + count); *** findJsonbValueFromContainer(JsonbContain *** 323,329 char *base_addr = (char *) (children + count * 2); uint32 *offsets; uint32 lastoff; ! int lastoffpos; uint32 stopLow = 0, stopHigh = count; --- 329,335 char *base_addr = (char *) (children + count * 2); uint32 *offsets; uint32 lastoff; ! int i; uint32 stopLow = 0, stopHigh = count; *** findJsonbValueFromContainer(JsonbContain *** 332,379 /* * We use a cache to avoid redundant getJsonbOffset() computations ! * inside the search loop. Note that count may well be zero at this ! * point; to avoid an ugly special case for initializing lastoff and ! * lastoffpos, we allocate one extra array element. */ ! offsets = (uint32 *) palloc((count * 2 + 1) * sizeof(uint32)); ! offsets[0] = lastoff = 0; ! lastoffpos = 0; /* Binary search on object/pair keys *only* */ while (stopLow stopHigh) { uint32 stopMiddle; - int index; int difference; JsonbValue candidate; stopMiddle = stopLow + (stopHigh - stopLow) / 2; - /* - * Compensate for the fact that we're searching through pairs (not - * entries). - */ - index = stopMiddle * 2; - - /* Update the offsets cache through at least
Re: [HACKERS] Per table autovacuum vacuum cost limit behaviour strange
On 27/08/14 10:27, Alvaro Herrera wrote: Alvaro Herrera wrote: So my proposal is a bit more complicated. First we introduce the notion of a single number, to enable sorting and computations: the delay equivalent, which is the cost_limit divided by cost_delay. Here's a patch that implements this idea. As you see this is quite a bit more complicated that Haribabu's proposal. There are two holes in this: 1. if you ALTER DATABASE to change vacuum delay for a database, those values are not considered in the global equiv delay. I don't think this is very important and anyway we haven't considered this very much, so it's okay if we don't handle it. 2. If you have a fast worker that's only slightly faster than regular workers, it will become slower in some cases. This is explained in a FIXME comment in the patch. I don't really have any more time to invest in this, but I would like to see it in 9.4. Mark, would you test this? Haribabu, how open are you to fixing point (2) above? Thanks Alvaro - I will take a look. regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] delta relations in AFTER triggers
Kevin Grittner kgri...@ymail.com wrote: I think this is approaching a committable state, although I think it should probably be broken down to four separate patches. And here they are. This should net to the same set of changes as the prior post, but the changes are logically separated. They are labeled as v3 to match the last post. trigger-transition-tables allows transition table names to be specified in a REFERENCING clause of CREATE TRIGGER, per spec, and creates tuplestores when needed in the TriggerData structure. It doesn't worry about who does what with that data. This doesn't depend on anything else. 15 files changed, 577 insertions(+), 43 deletions(-) spi-tuplestore-registry allows tuplestores, with associated name and TupleDesc, to be registered with the current SPI connection. Queries planned or executed on that connection will recognize the name as a tuplestore relation. It doesn't care who is registering the tuplestores or what happens to them. It doesn't depend on anything else. 5 files changed, 445 insertions(+) executor-tuplestore-relations covers parse analysis, planner/optimizer, and executor layers. It pulls information from the registry in a couple places, but is not very intertwined with SPI. That is the only registry it knows right now, but it should be easy to add other registries if needed. It doesn't care where the tuplestore came from, so we should be able to use this for other things. I have it in mind to use it for incremental maintenance of materialized views, but I expect that other uses will be found. It has a logical dependency on the spi-tuplestore-registry patch. While it doesn't have a logical dependency on trigger-transition-tables, they both modified some of the same files, and this won't apply cleanly unless trigger-transition-tables is applied first. If you hand-correct the failed hunks, it compiles and runs fine without trigger-transition-tables. 30 files changed, 786 insertions(+), 9 deletions(-) plpgsql-after-trigger-transition-tables takes the tuplestores from TriggerData and registers them with SPI before trigger planning and execution. It depends on the trigger-transition-tables and spi-tuplestore-registry patches to build, and won't do anything useful at run time without the executor-tuplestore-relations patch. 3 files changed, 37 insertions(+), 11 deletions(-) Hopefully this will make review easier. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company trigger-transition-tables-v3.patch.gz Description: application/tgz spi-tuplestore-registry-v3.patch.gz Description: application/tgz executor-tuplestore-relations-v3.patch.gz Description: application/tgz plpgsql-after-trigger-transition-tables-v3.patch.gz Description: application/tgz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers